Infrared and Visible Image Homography Estimation Based on Feature Correlation Transformers for Enhanced 6G Space–Air–Ground Integrated Network Perception

: The homography estimation of infrared and visible images, a key technique for assisting perception, is an integral element within the 6G Space–Air–Ground Integrated Network (6G SAGIN) framework. It is widely applied in the registration of these two image types, leading to enhanced environmental perception and improved efﬁciency in perception computation. However, the traditional estimation methods are frequently challenged by insufﬁcient feature points and the low similarity in features when dealing with these images, which results in poor performance. Deep-learning-based methods have attempted to address these issues by leveraging strong deep feature extraction capabilities but often overlook the importance of precisely guided feature matching in regression networks. Consequently, exactly acquiring feature correlations between multi-modal images remains a complex task. In this study, we propose a feature correlation transformer method, devised to offer explicit guidance for feature matching for the task of homography estimation between infrared and visible images. First, we propose a feature patch, which is used as a basic unit for correlation computation, thus effectively coping with modal differences in infrared and visible images. Additionally, we pro-pose a novel cross-image attention mechanism to identify correlations between varied modal images, thus transforming the multi-source images homography estimation problem into a single-source images problem by achieving source-to-target image mapping in the feature dimension. Lastly, we propose a feature correlation loss (FCL) to induce the network into learning a distinctive target feature map, further enhancing source-to-target image mapping. To validate the effectiveness of the newly proposed components, we conducted extensive experiments to demonstrate the superiority of our method compared with existing methods in both quantitative and qualitative aspects.


Introduction
With the development of 6G Space-Air-Ground Integrated Network (6G SAGIN) [1] technology, distributed intelligent-assisted sensing, communication, and computing have become important aspects of future communication networks. This provides the possibility for more extensive perception, real-time transmission, and the real-time computation and analysis of data. Smart sensors capture information from various modalities, such as visible images and infrared images, and then transmit this information in real time to edge computing [2][3][4] devices for perception computational solving. The registration techniques of infrared and visible images can provide highly accurate perceptual images, which support more effective perceptual computations and applications, such as image fusion [5,6], target tracking [7,8], semantic segmentation [9], surveillance [10], and the

Contribution
To solve the problems of difficult feature correspondence capture, difficult feature matching, and poor interpretability in regression networks, we propose a new feature correlation transformer, called FCTrans, for the homography estimation of infrared and visible images. Inspired by the Swin Transformer [48], we employed a similar structure to explicitly guide feature matching. We achieved explicit feature matching by computing the correlation between infrared and visible images (one is the source image; the other is the target image) in the feature patch unit within the window instead of in the pixel unit and then derived a homography matrix, as shown in Figure 1. Specifically, we first propose a feature patch, a basic unit for computing correlations, to better cope with the modal differences between infrared and visible images. Second, we propose a cross-image attention mechanism to calculate the correlation between source and target images to effectively establish feature correspondence between different modal images. The method finds the correlation between source and target images in a window in the unit of the feature patch, thus projecting the source image to the target image in the feature dimension. However, infrared and visible images have significant pixel grayscale differences and weak image correlation. This may result in very small attention weights during the training process, which makes it difficult to effectively capture the relationship between features. To address this problem, we propose a method called feature correlation loss (FCL). This approach aims to encourage the network to learn discriminative target feature mapping, which we call the projected target feature map. Then, we use the projected target feature map and the unprojected target feature map to obtain the homography matrix, thus converting the homography estimation problem between multi-source images into a problem between single-source images. Compared with previous methods, FCTrans explicitly guides feature matching by computing the correlation between infrared and visible images with a feature patch as the basic unit; additionally, it is more interpretable.   The contributions of this paper are summarized as follows: • We propose a new transformer structure: the feature correlation transformer (FCTrans). The FCTrans can explicitly guide feature matching, thus further improving feature matching performance and interpretability.

•
We propose a new feature patch to reduce the errors introduced by imaging differences in the multi-source images themselves for homography estimation.

•
We propose a new cross-image attention mechanism to efficiently establish feature correspondence between different modal images, thus projecting the source images into the target images in the feature dimensions.

•
We propose a new feature correlation loss (FCL) to encourage the network to learn a discriminative target feature map, which can better realize mapping from the source image to the target image.
The rest of the paper is organized as follows. In Section 2, we detail the overall architecture of the FCTrans and its components and introduce the loss function of the network. In Section 3, we present some experimental results and evaluations from an ablation study performed to demonstrate the effectiveness of the proposed components. In Section 4, the proposed method is discussed. Finally, some conclusions are presented in Section 5.

Methods
In this section, we first provide an overview of the overall architecture of the network. Second, we further give an overview of the proposed FCTrans and introduce the architecture of cross-image attention and the feature patch in the FCTrans. Finally, we show some details of the loss function, where the proposed FCL is described in detail. The contributions of this paper are summarized as follows:

•
We propose a new transformer structure: the feature correlation transformer (FCTrans). The FCTrans can explicitly guide feature matching, thus further improving feature matching performance and interpretability.

•
We propose a new feature patch to reduce the errors introduced by imaging differences in the multi-source images themselves for homography estimation.

•
We propose a new cross-image attention mechanism to efficiently establish feature correspondence between different modal images, thus projecting the source images into the target images in the feature dimensions.

•
We propose a new feature correlation loss (FCL) to encourage the network to learn a discriminative target feature map, which can better realize mapping from the source image to the target image.
The rest of the paper is organized as follows. In Section 2, we detail the overall architecture of the FCTrans and its components and introduce the loss function of the network. In Section 3, we present some experimental results and evaluations from an ablation study performed to demonstrate the effectiveness of the proposed components. In Section 4, the proposed method is discussed. Finally, some conclusions are presented in Section 5.

Methods
In this section, we first provide an overview of the overall architecture of the network. Second, we further give an overview of the proposed FCTrans and introduce the architecture of cross-image attention and the feature patch in the FCTrans. Finally, we show some details of the loss function, where the proposed FCL is described in detail.

Overview
Given a pair of visible and infrared grayscale image patches, I v and I r , of size H × W × 1 as the input to the network, we produced a homography matrix from I v to I r , denoted by H vr . Similarly, we obtained the homography matrix, H rv , by exchanging the order of image patches I v and I r . The proposed model consisted of four modules: two shallow feature extraction networks (an infrared shallow feature extraction network, f r (·), and a visible shallow feature extraction network, f v (·)), an FCTrans generator, and a discriminator, as shown in Figure 2.

Overview
Given a pair of visible and infrared grayscale image patches, and , of size × × 1 as the input to the network, we produced a homography matrix from to , denoted by . Similarly, we obtained the homography matrix, , by exchanging the order of image patches and . The proposed model consisted of four modules: two shallow feature extraction networks (an infrared shallow feature extraction network, (• ), and a visible shallow feature extraction network, (•)), an FCTrans generator, and a discriminator, as shown in Figure 2.  First, we converted images and into shallow feature maps and using shallow feature extraction networks (•) and (•) which did not share weights, respectively. The purpose of shallow feature extraction networks is to extract fine features that are meaningful for homography estimation from both channel and spatial dimensions. Next, we employed the FCTrans (generator) to continuously query the correlation between feature patches of the target feature map and the source feature map to explicitly guide feature matching, thus achieving mapping from the source image to the target image in the feature dimension. Then, we utilized the projected target feature map and the unprojected target feature map to obtain the homography matrix, thus converting the homography estimation problem between multi-source images into that between single-source images. Finally, we applied the homography matrix to the source image to generate the warped image and distinguish the warped image from the target image by a discriminator to further optimize the homography estimation performance. We adopted the Spatial Transformation Network (STN) [50] to implement the warping operation.

Iv
The core innovation of our method is to design a new transformer structure for homography estimation: FCTrans. By taking the feature patch as the computing unit, FCTrans constantly queries the feature correlation between infrared and visible images to explicitly guide feature matching, thus realizing mapping from the source image to the target image. We employed a method to output the homography matrix by converting the Figure 2. Overall architecture of the deep homography estimation network. The network architecture consists of four modules: two shallow feature extraction networks (an infrared shallow feature extraction network, f r (·), and a visible shallow feature extraction network f v (·)), an FCTrans generator, and a discriminator. Two consecutive blocks of FCTans used to output different feature maps (F l+1 v , F l+1 r , and F l+1 c ) are shown at the top of the figure. W-CIA and SW-CIA are cross-image attention modules with regular and shifted window configurations, respectively. First, we converted images I v and I r into shallow feature maps F v and F r using shallow feature extraction networks f v (·) and f r (·) which did not share weights, respectively. The purpose of shallow feature extraction networks is to extract fine features that are meaningful for homography estimation from both channel and spatial dimensions. Next, we employed the FCTrans (generator) to continuously query the correlation between feature patches of the target feature map and the source feature map to explicitly guide feature matching, thus achieving mapping from the source image to the target image in the feature dimension. Then, we utilized the projected target feature map and the unprojected target feature map to obtain the homography matrix, thus converting the homography estimation problem between multi-source images into that between single-source images. Finally, we applied the homography matrix to the source image to generate the warped image and distinguish the warped image from the target image by a discriminator to further optimize the homography estimation performance. We adopted the Spatial Transformation Network (STN) [50] to implement the warping operation.
The core innovation of our method is to design a new transformer structure for homography estimation: FCTrans. By taking the feature patch as the computing unit, FCTrans constantly queries the feature correlation between infrared and visible images to explicitly guide feature matching, thus realizing mapping from the source image to the target image. We employed a method to output the homography matrix by converting the homography estimation problem of multi-source images to that of single-source images. Compared with the previous HomoMGAN [47], we deeply optimized the generator to effectively improve the performance of homography estimation.

FCTrans Structure
Previous approaches [43][44][45][46][47] usually input the features of image pairs into a regression network by channel cascading, thus implicitly learning the association between image pairs but not directly comparing their feature similarity. However, considering the significant imaging differences between infrared and visible images, this implicit featurematching method may not accurately capture the feature correspondence between the two images, thus affecting the performance of homography estimation. To solve this problem, we propose a new transformer structure (FCTrans). This structure continuously queries the correlation between a feature patch in the source feature map and all feature patches in the corresponding window of the target feature map within the window to achieve explicit feature matching, thus projecting the source image into the target image in the feature dimension. Then, we use the projected target feature map and the unprojected target feature map to obtain the homography matrix, thus converting the homography estimation problem between multi-source images into that between single-source images. The structure of the FCTrans network is shown in Figure 3. homography estimation problem of multi-source images to that of single-source images. Compared with the previous HomoMGAN [47], we deeply optimized the generator to effectively improve the performance of homography estimation.

FCTrans Structure
Previous approaches [43][44][45][46][47] usually input the features of image pairs into a regression network by channel cascading, thus implicitly learning the association between image pairs but not directly comparing their feature similarity. However, considering the significant imaging differences between infrared and visible images, this implicit featurematching method may not accurately capture the feature correspondence between the two images, thus affecting the performance of homography estimation. To solve this problem, we propose a new transformer structure (FCTrans). This structure continuously queries the correlation between a feature patch in the source feature map and all feature patches in the corresponding window of the target feature map within the window to achieve explicit feature matching, thus projecting the source image into the target image in the feature dimension. Then, we use the projected target feature map and the unprojected target feature map to obtain the homography matrix, thus converting the homography estimation problem between multi-source images into that between single-source images. The structure of the FCTrans network is shown in Figure 3.  Figure 3. The overall architecture of the FCTans. In the -th FCTrans block, we consider as the query feature map (source feature map), as the key/value feature map (projected target feature map), and as the reference feature map (unprojected target feature map).

Fv
Assuming that the source and target images are the visible image, , and infrared image, , respectively, then the corresponding source shallow feature map and target shallow feature map are and , respectively. The same assumptions are applied in the rest of this paper. First, we input and into the patch partition module and linear embedding module, respectively, to obtain the feature maps and of size × . Meanwhile, we made a deep copy of to obtain , subsequently distinguishing the projected target feature map from the unprojected target feature map.
Then, we applied two FCTrans blocks with cross-image attention to , , and . In the -th FCTrans block, we regard as the query feature map (source feature map), as the key/value feature map (projected target feature map), and as the reference feature map (unprojected target feature map). In addition, the cross-image attention operation in each FCTrans block requires and as inputs to obtain the projected target feature map , as shown at the top of Figure 2. and are regarded as the query image and the reference image, respectively, and do not need to be projected; therefore, and are obtained through the FCTrans block without cross-image attention, respectively. The computations in the FCTrans block are as follows: Assuming that the source and target images are the visible image, I v , and infrared image, I r , respectively, then the corresponding source shallow feature map and target shallow feature map are F v and F r , respectively. The same assumptions are applied in the rest of this paper. First, we input F v and F r into the patch partition module and linear embedding module, respectively, to obtain the feature maps F 0 v and F 0 r of size H 2 × w 2 . Meanwhile, we made a deep copy of F 0 r to obtain F 0 c , subsequently distinguishing the projected target feature map from the unprojected target feature map.
Then, we applied two FCTrans blocks with cross-image attention to F 0 v , F 0 r , and F 0 c . In the l-th FCTrans block, we regard F l v as the query feature map (source feature map), F l c as the key/value feature map (projected target feature map), and F l r as the reference feature map (unprojected target feature map). In addition, the cross-image attention operation in each FCTrans block requires are regarded as the query image and the reference image, respectively, and do not need to be projected; therefore, F l v and F l r are obtained through the FCTrans block without cross-image attention, respectively. The computations in the FCTrans block are as follows: where LN(·) denotes the operation of the LayerNorm layer; MLP(·) denotes the operation of MLP; F l k indicates the feature map output by the l-th FCTrans block, where F l v , F l c , and F l r denote the source feature map, the projected target feature map and the unprojected To generate a hierarchical representation, we halved the feature map size and doubled the number of channels using the patch merging module. The two FCTrans blocks, together with a patch merging module, are called "Stage 1". Similarly, "Stage 2" and "Stage 3" adopt a similar scheme. However, their FCTrans block numbers are 2 and 6, respectively, and "Stage 3" does not have a patch merging module. After three stages, each feature patch in F 10 c implies a correlation with all the feature patches in the corresponding window of the source feature map at different scales, thus achieving the goal of projecting feature information from the source image into the target image.
Finally, we concatenated F 10 r and F 10 c to build F 10 r , F 10 c and then input it to the homography prediction layer (including the LayerNorm layer, global pooling layer, and fully connected layer) to output 4 offset vectors (8 values). With the 4 offset vectors, we obtained the homography matrix, H vr , by solving the DLT [19]. We use h(·) to represent the whole process, i.e.: where F 10 r represents the unprojected target feature map outputted by the 10th FCTrans block and F 10 c indicates the projected target feature map outputted by the 10th FCTrans block. In this way, we converted the homography estimation problem for multi-source images into the homography estimation problem for single-source images, simplifying the network training. Similarly, assuming that the source and target images are infrared image I r and visible image I v , respectively, then the homography matrix H rv can be obtained based on F 10 v and F 10 c . Algorithm 1 shows some training details of the FCTrans.

Feature Patch
In infrared and visible image scenes, the feature-based method shows greater robustness and descriptive power compared with the pixel-based method in coping with modal differences, establishing correspondence, and handling occlusion and noise, resulting in more stable and accurate performance. In this study, we followed a similar idea, using a 2 × 2 feature patch as an image feature to participate in the attention computation instead of relying on pixels as the computational unit. Specifically, we further evenly partitioned the window of size M × M (set to 16 by default) in a non-overlapping manner and then obtained M 2 × M 2 feature patches of size 2 × 2, as shown in Figure 4. In Figure 4, we assume that the size of the window is 4 × 4, which results in 2 × 2 feature patches. By involving the feature patch as the basic computational unit in the attention calculation, we can capture the structural information in the image effectively while reducing the effect of modal differences on the homography estimation.  In the next layer, l + 1 (illustrated on the right), we apply a shifted window partitioning scheme to generate new windows and similarly evenly partition them into feature patches inside these new windows.

Algorithm 1:
The training process of the FCTrans Input: and Output: FCL and homography matrix Select the input to the patch partition layer and linear embedding layer: ; Select the input to the patch partition layer and linear embedding layer: ; Select for deep copy: ; for n < number_of_stages do Figure 4. An illustration of the feature patch in the proposed FCTrans architecture. In layer l (illustrated on the left), we employ a regular window partitioning scheme to partition the image into multiple windows and then further evenly partition them into feature patches inside each window. In the next layer, l + 1 (illustrated on the right), we apply a shifted window partitioning scheme to generate new windows and similarly evenly partition them into feature patches inside these new windows.
input to LayerNorm layer and MLP: input to LayerNorm layer and MLP: SelectF l c input to LayerNorm layer and MLP: Select F l v input to patch merging layer; Select F l r input to patch merging layer; Select F l c input to patch merging layer; end Calculate homography matrix : H vr = h F 10 r , F 10 c ; Return: L f c (F v , F r ) and H vr ;

Cross-Image Attention
In image processing, the cross-attention mechanism [51] can help models capture dependencies and correlations between different images or images and other modal data, thus enabling effective information exchange and fusion. In this study, we borrowed a similar idea and designed a cross-image attention mechanism for the homography estimation task, as shown in Figure 5. Cross-image attention takes the feature patch as the unit and finds the correlation between a feature patch in the source feature map and all feature patches in the target feature map within the window, thus projecting the source image into the target image in the feature dimension. The dimensionality of the feature patch is small; therefore, we use single-headed attention to compute cross-image attention.
First, we take F l−1 v and F l−1 c of size H 2 k × W 2 k (where k denotes the number of stages) processed by the LayerNorm layer as the query feature map and key/value feature map. We adopt a (shifted) window partitioning scheme and a feature patch partitioning scheme to partition them into windows of size M × M containing M 2 × M 2 feature patches. Next, we flatten these windows in the feature patch dimension, thus reshaping the window size to N × D, where N denotes the number of feature patch ( M 2 × M 2 ) and D represents the number of pixels in the feature patch (2 × 2). Then, the window of F l−1 v passes through the fully connected layer to obtain the query matrix, and the window of F l−1 c passes through two different fully connected layers to obtain the key matrix and the value matrix, respectively. We compute the similarity between the query matrix and all key matrices to assign weights to each value matrix. The similarity matrix is usually computed using the dot product and Remote Sens. 2023, 15, 3535 9 of 21 then normalized to a probability distribution via the softmax function. In this way, we can query the similarity between each feature in F l−1 v (represented by feature patch) and all features in F l−1 c within the corresponding windows of F l−1 v and F l−1 c , thus achieving the effect of explicit feature matching. Finally, we multiply the value matrix and the similarity matrix to obtain the final output matrix, y l−1 c , after obtaining the weighted similarity matrix. Each feature patch in this output matrix, y l−1 c , implies the correlation between all the feature patches in the window corresponding to the source feature map, thus achieving a mapping from the source image to the target image in the feature dimension. This implementation process can be described as follows: where Q, K, and V represent the query, key, and value matrices, respectively; d stands for the Q/K dimension, which is 2 × 2 in the experiment; and B represents the relative position bias. We used a feature patch as the unit of computation; therefore, the relative positions along each axis were in the range − M 2 + 1, M 2 + 1 . We parameterized a bias matrix,B ∈ R (M−1)×(M−1) , and the values in B were taken fromB. We rescaled the output matrix y l−1 c of size N × D to match the size of the original feature map, i.e., H 2 k × W 2 k . This adjustment could facilitate subsequent convolution operations or other image processing steps. In addition, we performed residual concatenation by adding the output feature map and the original feature map, F l−1 c , to obtain the feature map,F l c , thus alleviating the gradient disappearance. In image processing, the cross-attention mechanism [51] can help models capture dependencies and correlations between different images or images and other modal data, thus enabling effective information exchange and fusion. In this study, we borrowed a similar idea and designed a cross-image attention mechanism for the homography estimation task, as shown in Figure 5. Cross-image attention takes the feature patch as the unit and finds the correlation between a feature patch in the source feature map and all feature patches in the target feature map within the window, thus projecting the source image into the target image in the feature dimension. The dimensionality of the feature patch is small; therefore, we use single-headed attention to compute cross-image attention.  First, we take and of size × (where denotes the number of stages) processed by the LayerNorm layer as the query feature map and key/value feature map. We adopt a (shifted) window partitioning scheme and a feature patch partitioning scheme to partition them into windows of size × containing × feature patches. Next, we flatten these windows in the feature patch dimension, thus reshaping the window size to × , where denotes the number of feature patch ( × ) and represents the number of pixels in the feature patch (2 × 2). Then, the window of passes through the fully connected layer to obtain the query matrix, and the window of passes through two different fully connected layers to obtain the key matrix and the value matrix, respectively. We compute the similarity between the query matrix and all key matrices to assign weights to each value matrix. The similarity matrix is usually computed using the dot product and then normalized to a probability distribution via the softmax function. In this way, we can query the similarity between each feature in (represented by feature patch) and all features in within the corresponding windows of and , thus In particular, there may be multiple non-adjacent sub-windows in the shifted window, so the Swin Transformer [48] employs a masking mechanism to restrict attention to each window. However, we now adopt the feature patch as the basic unit of attention calculation instead of the pixel level, which makes the mask mechanism in the Swin Transformer [48] no longer applicable to our method. Considering that the size of the feature patch is 2 × 2 and the size of the window is set to be a multiple of 2, we generate the mask adapted to our method in steps of 2 based on the mask in the Swin Transformer.

Loss Function
In this study, the generative adversarial network architecture was used to train the network, which consists of two parts: a generator (FCTrans) and a discriminator (D). The generator is responsible for generating the homography matrix to obtain the warped image. The discriminator aims to distinguish the shallow feature maps of the warped image and the target image. To train the network, we define the generator loss function and the discriminator loss function. In particular, we introduce the proposed FCL in detail in the generator loss function.

Loss Function of the Generator
To solve the problem of the network having difficulty adequately capturing the feature relationship between infrared and visible images, we propose a constraint called "Feature Correlation Loss" (FCL). FCL aims to minimize the distance between the projected target feature map, F l c , and the source feature map, F l v , while maintaining a large distance between the unprojected target feature map, F l r , and the source feature map, F l v . This scheme encourages the network to continuously learn the feature correlation between the projected target feature map (F l c ) and the source feature map (F l v ) within the window, and then continuously weight the projected target feature map under multiple stages to achieve better feature matching with the source feature map. Our FCL constraint is defined as follows: where F l v , F l c , and F l r represent the source feature map, the projected target feature map, and the unprojected target feature map output by the l-th FCTrans block, respectively. L l f c F l v , F l c , F l r denotes the loss generated by the l-th FCTrans block. F v and F r stand for the visible shallow feature map and infrared shallow feature map, respectively. Our FCL is the sum of the losses generated by all FCTrans blocks, i.e., L f c (F v , F r ).
To perform unsupervised learning, we minimize three other losses in addition to constraining the FCL of FCTrans network training. The first one is the feature loss, which is used to encourage the feature maps between the warped and target images to have similar data distributions [47], written as: where I v and I r represent the visible image patch and the infrared image patch, respectively. F v and F r indicate the visible shallow feature map and the infrared shallow feature map, respectively. F r denotes the warped infrared shallow feature map obtained by warping F r with the homography matrix, H rv . The second term is the homography loss, which is used to force H rv and H vr to be mutually inverse matrices [47], and is computed by: where E denotes the third-order identity matrix. H vr represents the homography matrix from I v to I r . H rv denotes the homography matrix from I r to I v .
The third term is the adversarial loss, which is used to force the feature map of the warped image to be closer to that of the target image [47], i.e.: where logD θ D (·) indicates the probability of the warped shallow feature map like a target shallow feature map, N represents the size of the batch, and F r stands for the warped infrared shallow feature map.
In practice, we can derive the losses L f (I v , I r ), L adv F ' v , and L f c (F r , F v ) by exchanging the order of image patches I v and I r . Thus, the total loss function of the generator can be written as: where I v and I r stand for the visible image patch and infrared image patch, respectively. F v and F r indicate the visible shallow feature map and infrared shallow feature map, respectively. F v and F r represent the warped visible shallow feature map and the warped infrared shallow feature map, respectively. λ, µ, and ξ are the weights of each term set as 0.01, 0.005, and 0.05, respectively. We provide an analysis of parameter ξ in Appendix A.

Loss Function of the Discriminator
The discriminator aims to distinguish the feature maps between the warped image and the target image. According to [47], the loss between the feature map of the infrared image and the warped feature map of the visible image is calculated by: where F r indicates the infrared shallow feature map; F v represents the warped visible shallow feature map; N represents the size of the batch; a and b represent the labels of the shallow feature maps F r and F v , which are set as random numbers from 0.95 to 1 and 0 to 0.05, respectively; and logD θ D (·) indicates the probability of the warped shallow feature map to be similar to the target shallow feature map. In practice, we can obtain the loss L D F v , F ' r by swapping the order of I v and I r . Thus, the total loss function of the discriminator can be defined as follows: where F v and F r indicate the visible shallow feature map and infrared shallow feature map, respectively; F v and F r represent the warped visible shallow feature map and warped infrared shallow feature map, respectively.

Experimental Results
In this section, we first briefly introduce the synthetic benchmark dataset and the real-world dataset, and then describe some implementation details of the proposed method. Next, we briefly present the evaluation metrics used in the synthetic benchmark dataset and the real-world dataset. Second, we perform comparisons with existing methods on synthetic benchmark datasets and real-world datasets to demonstrate the performance of our method. We compare our method with traditional feature-based methods and deep-learning-based methods. The traditional feature-based methods include eight methods that are combined by four feature descriptors (SIFT [20], ORB [22], BRISK [23], and AKAZE [24]) and two outlier rejection algorithms (RANSAC [36] and MAGSAC++ [38]). The deep-learning-based methods include three methods (CADHN [43], DADHN [46], and HomoMGAN [47]). Finally, we also performed some ablation experiments to demonstrate the effectiveness of all the newly proposed components.

Dataset
We used the same synthetic benchmark dataset as Luo et al. [47] to evaluate our method. The dataset consists of unregistered infrared and visible image pairs of size 150 × 150, which include 49,738 training pairs and 42 test pairs. In particular, the test set also includes the corresponding infrared ground-truth image I GT for each image pair, thus facilitating the presentation of channel mixing results in qualitative comparisons.
Meanwhile, the test set provides four pairs of ground-truth matching corner coordinates for each pair of test images for evaluation calculation.
Furthermore, we utilized the CVC Multimodal Stereo Dataset [52] as our real-world dataset. This collection includes 100 pairs of long-wave infrared and visible images, primarily taken on city streets, each with a resolution of 506 × 408. Figure 6 displays four representative image pairs from the dataset.
We used the same synthetic benchmark dataset as Luo et al. [47] to evaluate our method. The dataset consists of unregistered infrared and visible image pairs of size 150 × 150, which include 49,738 training pairs and 42 test pairs. In particular, the test set also includes the corresponding infrared ground-truth image for each image pair, thus facilitating the presentation of channel mixing results in qualitative comparisons. Meanwhile, the test set provides four pairs of ground-truth matching corner coordinates for each pair of test images for evaluation calculation.
Furthermore, we utilized the CVC Multimodal Stereo Dataset [52] as our real-world dataset. This collection includes 100 pairs of long-wave infrared and visible images, primarily taken on city streets, each with a resolution of 506 × 408. Figure 6 displays four representative image pairs from the dataset.

Implementation Details
Our experimental environment parameters are shown in Table 1. During data preprocessing, we resized the image pairs to a uniform size of 150 × 150 and then randomly cropped them to image patches of size 128 × 128 to increase the amount of data. In addition, we normalized and grayscaled the images to obtain patches and as the input of the model. Our network was trained under the PyTorch framework. To optimize the network, we employed the adaptive moment estimation (Adam) [53] optimizer with the initial value of the learning rate set to 0.0001 and adjusted by the decay strategy during the training process. All parameters of the proposed method are shown in Table 2. In each iteration of model training, we first updated the discriminator (D) parameters and then the generator (FCTrans). Its loss function is optimized by backpropagation in each iteration step. Specifically, we first utilized the generator to generate a homography matrix through which the source image is warped to a warped image. Thus, we trained the discriminator using the warped and target images. We calculated the loss function of the discriminator using Equation (8) and then updated the discriminator's parameters by backpropagation. Next, we trained the generator. We computed the loss function of the

Implementation Details
Our experimental environment parameters are shown in Table 1. During data preprocessing, we resized the image pairs to a uniform size of 150 × 150 and then randomly cropped them to image patches of size 128 × 128 to increase the amount of data. In addition, we normalized and grayscaled the images to obtain patches I v and I r as the input of the model. Our network was trained under the PyTorch framework. To optimize the network, we employed the adaptive moment estimation (Adam) [53] optimizer with the initial value of the learning rate set to 0.0001 and adjusted by the decay strategy during the training process. All parameters of the proposed method are shown in Table 2. In each iteration of model training, we first updated the discriminator (D) parameters and then the generator (FCTrans). Its loss function is optimized by backpropagation in each iteration step. Specifically, we first utilized the generator to generate a homography matrix through which the source image is warped to a warped image. Thus, we trained the discriminator using the warped and target images. We calculated the loss function of the discriminator using Equation (8) and then updated the discriminator's parameters by backpropagation. Next, we trained the generator. We computed the loss function of the generator using Equation (10) and updated the generator's parameters by backpropagation. We made the network continuously tuned to the homography matrix through the adversarial game between the generator and the discriminator. Meanwhile, we periodically saved the model state during the training process for subsequent analysis and evaluation.

Evaluation Metrics
The real-world dataset lacks ground-truth matching point pairs; therefore, we employed two distinct evaluation metrics: the point matching error [43,44] for the real-world dataset and the corner error [40,41,47] for the synthetic benchmark dataset. The corner error [40,41,47] is calculated as the average l 2 distance between the corner points transformed by the estimated homography and those transformed by the ground-truth homography. A smaller value of this metric signifies a superior performance in homography estimation. The formula for computing the corner error [40,41,47] is expressed as follows: where x i and y i are the corner point, i, transformed by the estimated homography and the ground-truth homography, respectively. The point matching error [43,44] is a measure of the average l 2 distance between pairs of manually labeled matching points. Lower values of this metric indicate superior performance in homography estimation. The calculation of the point matching error [43,44] is performed as follows: where x i denotes point i transformed by the estimated homography, y i denotes the matching point corresponding to point i, and N represents the number of manually labeled matching point pairs.

Comparison on Synthetic Benchmark Datasets
We conducted qualitative and quantitative comparisons between our method and all the comparative methods on synthetic benchmark dataset to demonstrate the performance of our method.

Qualitative Comparison
First, we compared our method with eight traditional feature-based methods, as shown in Figure 7. The traditional feature-based methods had difficulty obtaining stable feature matching in infrared and visible image scenes, which led to severe distortions in the warped image. More specifically, SIFT [20] and AKAZE [24] demonstrate algorithm failures in both examples, as shown in (2) and (3). However, our method shows better adaptability in infrared and visible image scenes, and its performance is significantly better than the traditional feature-based methods. Although SIFT [20] + RANSAC [36] in the first example is the best performer among the feature-based methods and does not exhibit severe image distortion, it still shows a large number of yellow ghosts in the ground region. These yellow ghosts indicate that the corresponding regions between the warped and ground-truth images are not aligned. However, our method shows significantly fewer ghosts in the ground region compared with the SIFT [20] + RANSAC [36] method, showing superior results. This indicates that our method has higher accuracy in processing infrared and visible image scenes. feature matching in infrared and visible image scenes, which led to severe distortions in the warped image. More specifically, SIFT [20] and AKAZE [24] demonstrate algorithm failures in both examples, as shown in (2) and (3). However, our method shows better adaptability in infrared and visible image scenes, and its performance is significantly better than the traditional feature-based methods. Although SIFT [20] + RANSAC [36] in the first example is the best performer among the feature-based methods and does not exhibit severe image distortion, it still shows a large number of yellow ghosts in the ground region. These yellow ghosts indicate that the corresponding regions between the warped and ground-truth images are not aligned. However, our method shows significantly fewer ghosts in the ground region compared with the SIFT [20] + RANSAC [36] method, showing superior results. This indicates that our method has higher accuracy in processing infrared and visible image scenes.
(1)  (1), (3) and (2), (4). The "Nan" in (2) and (3) [25]; and (l) the proposed algorithm. We mixed the blue and green channels of the warped infrared image with the  (1), (3) and (2), (4). The "Nan" in (2) and (3) [25]; and (l) the proposed algorithm. We mixed the blue and green channels of the warped infrared image with the red channel of the ground-truth infrared image to obtain the above visualization and the remaining visualizations in this paper using this method. The unaligned pixels are presented as yellow, blue, red, or green ghosts.
Secondly, we compared our method with three deep learning-based methods, as shown in Figure 8. Our method exhibited higher accuracy in image alignment compared with the other methods. In addition, CHDHN [43], DADHN [46], and HomoMGAN [47] showed the different extents of green ghosting when processing door frame edges and door surface textures in (1). However, these ghosts were significantly reduced by our method, which fully illustrates its superiority. Similarly, our method achieves superior results on the alignment of cars and people in (2) compared with other deep-learning-based methods.
Secondly, we compared our method with three deep learning-based methods, as shown in Figure 8. Our method exhibited higher accuracy in image alignment compared with the other methods. In addition, CHDHN [43], DADHN [46], and HomoMGAN [47] showed the different extents of green ghosting when processing door frame edges and door surface textures in (1). However, these ghosts were significantly reduced by our method, which fully illustrates its superiority. Similarly, our method achieves superior results on the alignment of cars and people in (2) compared with other deep-learningbased methods.

Quantitative Comparison
To demonstrate the performance of the proposed method, we performed a quantitative comparison with all other methods. We classify the testing results into three levels based on performance: easy (top 0-30%), moderate (top 30-60%), and hard (top 60-100%). We report the corner error, the overall average corner error, and the failure rate of the algorithm for the three levels in Table 3, where rows 3-10 are for the traditional featurebased methods and rows 11-13 are for the deep-learning-based methods. In particular, the failure rate in the last column of Table 3 indicates the ratio of the number of test images in which the algorithm failed against the total number of test images. × in row 2 denotes the identity transformation, whose error reflects the original distance between point pairs. The "Nan" in Table 3 indicates that the corner error is not present at this level. This usually means that the method has a large number of failures in the test set; thus, no test results can be classified into this level.   (1) and (2). From left to right: (a) visible image; (b) infrared image; (c) ground-truth infrared image; (d) CADHN [43]; (e) DADHN [46]; (f) HomoMGAN [47]; and (g) the proposed algorithm.
Error-prone regions are highlighted using red and yellow boxes, and the corresponding regions are zoomed in.

Quantitative Comparison
To demonstrate the performance of the proposed method, we performed a quantitative comparison with all other methods. We classify the testing results into three levels based on performance: easy (top 0-30%), moderate (top 30-60%), and hard (top 60-100%). We report the corner error, the overall average corner error, and the failure rate of the algorithm for the three levels in Table 3, where rows 3-10 are for the traditional feature-based methods and rows 11-13 are for the deep-learning-based methods. In particular, the failure rate in the last column of Table 3 indicates the ratio of the number of test images in which the algorithm failed against the total number of test images. I 3×3 in row 2 denotes the identity transformation, whose error reflects the original distance between point pairs. The "Nan" in Table 3 indicates that the corner error is not present at this level. This usually means that the method has a large number of failures in the test set; thus, no test results can be classified into this level. As can be seen in Table 3, our method achieved the best performance at all three levels. In particular, the average corner error of our method significantly decreased from 5.06 to 4.92 compared with the suboptimal algorithm HomoMGAN [47]. Specifically, the performance of the feature-based method is significantly lower than that of the deep-learning-based method under all three levels, and all of them show algorithm failures. Meanwhile, although the average corner error of SIFT [20] + RANSAC [36] is 50.87, the average corner error of other feature-based methods is above 100. This illustrates the generally worse performance of the traditional feature-based methods. Although SIFT [20] +RANSAC [36] has the most excellent performance among all feature-based methods, it fails on most of the test images. As a result, most traditional feature-based methods in infrared and visible image scenes usually fail to extract or match enough key points, which leads to algorithm failure or poor performance and is difficult to be applied in practice.
In contrast, deep-learning-based methods can easily avoid this problem. They not only avoid algorithm failure but also significantly improve performance. CADHN [43], DADHN [46], and HomoMGAN [47] achieved excellent performance in the test images with average corner errors of 5.25, 5.08, and 5.06, respectively. However, they are guided implicitly in the regression network for feature matching, which leads to limited performance in homography estimation. In contrast, our method converts the homography estimation problem for multi-source images into a problem for single-source images by explicitly guiding feature matching, thus significantly reducing the difficulties incurred due to the large imaging differences of multi-source images for network training. As shown in Table 3, our method significantly outperforms existing deep-learning-based methods in terms of error at all three levels and overall average corner error, and the average corner error can be reduced to 4.91. This sufficiently demonstrates the superiority of explicit feature matching in our method.

Comparison on the Real-World Dataset
We performed a quantitative comparison with 11 methods on the real-world dataset to demonstrate the effectiveness of our method, as shown in Table 4. The evaluation results of the feature-based methods on the real-world dataset are similar to the results on the synthetic benchmark dataset, and both show varying degrees of algorithm failure and poor algorithm performance. In contrast, the deep-learning-based methods performed significantly better than the feature-based methods, and no algorithm failures were observed. The proposed algorithm achieves the best performance among the deep-learning-based methods; the performance of CADHN [43] and DADHN [46] is comparable with the average point matching errors of 3.46 and 3.47, respectively. Notably, our algorithm significantly improves the performance by explicitly guiding feature matching in the regression network compared to HomoMGAN [47], and the average point matching error is significantly reduced from 3.36 to 2.79. This fully illustrates the superiority of explicitly guided feature matching compared to implicitly guided feature matching.

Ablation Studies
In this section, we present the results of the ablation experiments performed on the FCTrans, feature patch, cross-image attention, and FCL and combine some visualization results to demonstrate the effectiveness of the proposed method and its components.

FCTrans
The proposed FCTrans is an architecture similar to the Swin Transformer [48]. To evaluate the effectiveness of FCTrans, we replaced it with the Swin Transformer [48] to serve as the backbone network of the generator; the results are shown in row 2 of Table 5. In this process, we channel-cascade the shallow features of the infrared and visible images and feed them into the Swin Transformer [48] to generate four 2D offset vectors (eight values), which, in turn, are solved by DLT [19] to obtain the homography matrix. By comparing the data in rows 2 and 6 of Table 5, we observe a significant decrease in the average corner error from 5.13 to 4.91. This result demonstrates that the proposed FCTrans can effectively improve the homography estimation performance compared with the Swin Transformer [48].

Feature Patch
To verify the validity of the feature patch, we removed all operations related to the feature patch from our network; the results are shown in row 3 of Table 5. Due to the removal of the feature patch, we performed the attention calculation in pixels within the window. By comparing the data in rows 3 and 6 of Table 5, our average corner error is reduced from 5.02 to 4.91. This result shows that the feature patch is more adept at capturing structural information in images, thus reducing the effect of modal differences on homography estimation.

Cross-Image Attention
To verify the effectiveness of cross-image attention, we used self-attention [48] to replace cross-image attention in our experiments; the results are shown in row 4 of Table 5. In this process, we channel-concatenated the shallow features of the infrared image and the visible image as the input of self-attention [48] to obtain the homography matrix. The replaced network no longer applies the FCL; therefore, we removed the operations associated with the FCL. By comparing rows 4 and 6 in Table 5, we found that the average corner error significantly decreases from 5.03 to 4.91. This is a sufficient indication that crossimage attention can effectively capture the correlation between different modal images, thus improving the homography estimation performance.

FCL
We removed the term of Equation (4) from Equation (8) to verify the validity of the FCL; the results are shown in row 5 of Table 5. By comparing the data in rows 5 and 6 of Table 5, we found that the average corner error was significantly reduced from 5.10 to 4.91. In addition, we visualized the attention weights of the window to further verify the validity of the FCL; the results are shown in Figure 9. As shown in the comparison of (a) and (c), the FCL allows the network to better adapt to the modal differences between infrared and visible images, thus achieving better performance in capturing inter-feature correlations.
Additionally, the performance of the proposed method in (b) and (d) is slightly superior to the " w/o. FCL", with the average corner error reduced from 5.17 to 4.71.
We removed the term of Equation (4) from Equation (8) to verify the validity of the FCL; the results are shown in row 5 of Table 5. By comparing the data in rows 5 and 6 of Table 5, we found that the average corner error was significantly reduced from 5.10 to 4.91. In addition, we visualized the attention weights of the window to further verify the validity of the FCL; the results are shown in Figure 9. As shown in the comparison of (a) and (c), the FCL allows the network to better adapt to the modal differences between infrared and visible images, thus achieving better performance in capturing inter-feature correlations. Additionally, the performance of the proposed method in (b) and (d) is slightly superior to the " w/o. FCL", with the average corner error reduced from 5.17 to 4.71.

Discussion
In this study, we proposed a feature correlation transformer method which significantly improves the accuracy of homography estimation in infrared and visible images. By introducing feature patch and cross-image attention mechanisms, our method dramatically improves the precision of feature matching. It tackles the challenges induced by the insufficient quantity and low similarity of feature points in traditional methods. Extensive experimental data demonstrate that our method significantly outperforms existing techniques in terms of both quantitative and qualitative results. However, our method also has some limitations. Firstly, although our method performs well in dealing with modality differences in infrared and visible images, it might need further optimization and adjustment when processing images in large-baseline scenarios. In future research, we aim to further improve the robustness of our method to cope with challenges in large-baseline scenarios. Moreover, we will further explore combining our method with other perception computing tasks to enhance the perception capability of 6G SAGINs.

Discussion
In this study, we proposed a feature correlation transformer method which significantly improves the accuracy of homography estimation in infrared and visible images. By introducing feature patch and cross-image attention mechanisms, our method dramatically improves the precision of feature matching. It tackles the challenges induced by the insufficient quantity and low similarity of feature points in traditional methods. Extensive experimental data demonstrate that our method significantly outperforms existing techniques in terms of both quantitative and qualitative results. However, our method also has some limitations. Firstly, although our method performs well in dealing with modality differences in infrared and visible images, it might need further optimization and adjustment when processing images in large-baseline scenarios. In future research, we aim to further improve the robustness of our method to cope with challenges in large-baseline scenarios. Moreover, we will further explore combining our method with other perception computing tasks to enhance the perception capability of 6G SAGINs.

Conclusions
In this study, we have proposed a feature correlation transformer method for the homography estimation of infrared and visible images, aiming to provide a higher-accuracy environment-assisted perception technique for 6G SAGINs. Compared with previous methods, our approach explicitly guides feature matching in a regression network, thus enabling the mapping of source-to-target images in the feature dimension. With this strategy, we converted the homography estimation problem between multi-source images into that of single-source images, which significantly improved the homography estimation performance. Specifically, we innovatively designed a feature patch as the basic unit for correlation queries to better handle modal differences. Moreover, we designed a crossimage attention mechanism that enabled mapping the source-to-target images in feature dimensions. In addition, we have proposed a feature correlation loss (FCL) constraint that further optimizes the mapping from source-to-target images. Extensive experimental results demonstrated the effectiveness of all the newly proposed components; our performance is significantly superior to existing methods. Nevertheless, the performance of our method may be limited in large-baseline infrared and visible image scenarios. Therefore, we intend to further explore the problem of homography estimation in large-baseline situations in future studies in order to further enhance the scene perception capability of the 6G SAGIN.