A Dual-Branch Network of Strip Convolution and Swin Transformer for Multimodal Remote Sensing Image Registration

Mu, Kunpeng; Wang, Wenqing; Liu, Han; Liang, Lili; Zhang, Shuang

doi:10.3390/rs17061071

Open AccessArticle

A Dual-Branch Network of Strip Convolution and Swin Transformer for Multimodal Remote Sensing Image Registration

by

Kunpeng Mu

¹,

Wenqing Wang

^1,2,*

,

Han Liu

^1,2

,

Lili Liang

^1,2 and

Shuang Zhang

¹

School of Automation and Information Engineering, Xi’an University of Technology, Xi’an 710048, China

²

Shaanxi Key Laboratory of Complex System Control and Intelligent Information Processing, Xi’an University of Technology, Xi’an 710048, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(6), 1071; https://doi.org/10.3390/rs17061071

Submission received: 18 February 2025 / Revised: 16 March 2025 / Accepted: 17 March 2025 / Published: 18 March 2025

(This article belongs to the Special Issue Image Processing and Analysis: Trends in Registration, Data Fusion, 3D Reconstruction, and Change Detection (Third Edition))

Download

Browse Figures

Versions Notes

Abstract

Multimodal remote sensing image registration aims to achieve effective fusion and analysis of information by accurately aligning image data obtained by different sensors, thereby improving the accuracy and application value of remote sensing data in engineering. However, current advanced registration frameworks are unable to accurately register large-scale rigid distortions, such as rotation or scaling, that occur in multi-source remote sensing images. This paper presents a stable and high-precision end-to-end registration network that incorporates dual-branch feature extraction to address the stringent registration requirements encountered in practical engineering applications. The deep neural network consists of three parts: dual-branch feature extraction, affine parameter regression, and spatial transformation network. In the upper branch of the dual-branch feature extraction module, we designed a combination of multi-scale convolution and Swin Transformer to fully extract features of remote sensing images at different scales and levels to better understand the global structure and context information. In the lower branch, we incorporate strip convolution blocks to capture remote contextual information from various directions in multimodal images. Additionally, we introduce an efficient and lightweight ResNet module to enhance global features. At the same time, we developed a strategy to parallelize various convolution kernels in affine parameter regression networks, aiming to enhance the accuracy of transformation parameters and the robustness of the model. We conducted experiments on panchromatic–multispectral, infrared–optical, and SAR–optical image pairs with large-scale rigid transformations. The experimental results show that our method achieves the best registration effect.

Keywords:

remote sensing; multimodal rigid registration; deep learning; Swin Transformer

1. Introduction

Multimodal remote sensing image registration is a critical research task in the field of remote sensing image processing and interpretation. This process involves the precise alignment of image data acquired from various sensors, enhancing the accuracy and effectiveness of information extraction [1]. Since different sensors possess distinct imaging mechanisms and characteristics, multimodal images allows us to fully leverage the advantages of various sensors while compensating for the limitations of any single sensor. For instance, optical sensors can deliver high-resolution color and texture information, making them highly effective for identifying the types and details of objects. In contrast, synthetic aperture radar (SAR) sensors can penetrate clouds and fog, operate in adverse weather conditions, and are more sensitive to changes in terrain [2]. Additionally, infrared images offer distinct advantages in detecting concealed targets at night and in poor weather conditions. However, in practice, multimodal image pairs are frequently influenced by the sensor’s shooting angle and environmental factors, which makes it impossible to achieve perfectly aligned images. This will adversely impact subsequent remote sensing applications, such as image fusion [3], change detection [4], and agricultural monitoring [5]. Therefore, multimodal remote sensing image registration is a prerequisite for image processing in the field of remote sensing and plays a vital role in subsequent remote sensing image interpretation.

There are numerous challenges associated with multimodal remote sensing image registration [6]. Remote sensing images captured by different sensors exhibit variations in resolution and radiation characteristics. Additionally, multimodal remote sensing images are often obtained from different satellites or sensors at varying times and spatial locations, resulting in differences in image viewing angles, shooting heights, azimuths, and other factors. These changes may result in geometric deformations in the image, including rotation and translation [7]. To address these challenges, researchers have developed a range of methods for multimodal remote sensing image registration.

Traditional multimodal remote sensing image registration methods are generally categorized into two types: region-based methods and feature-based methods. Region-based methods primarily determine the registration parameters by comparing regional features in images and maximizing the similarity measure between them. Common techniques include mutual information (MI) [8] and normalized cross-correlation (NCC) [9]. In contrast, feature-based methods establish correspondences between two images by extracting significant features from each image. The process is segmented into several stages, including feature extraction, feature matching, parameter estimation, and spatial transformation. SIFT [10], ORB [11], and SURF [12] are representative traditional feature-based methods. While these traditional methods are effective for single-modal remote sensing image registration, they exhibit certain limitations when applied to multimodal remote sensing image registration. There are significant nonlinear differences between multimodal image pairs, which can hinder region-based methods from accurately measuring regional similarity between the two images. In feature-based methods, each step influences the registration accuracy of subsequent steps. In particular, when confronted with complex nonlinear deformations and variations in radiation between multimodal image pairs, the system struggles to extract a sufficient number of feature points, leading to unsatisfactory registration outcomes. In addition, both traditional methods are susceptible to noise.

With the rapid advancement of deep learning in recent years, an increasing number of deep learning-based methods have been applied in the field of remote sensing image registration [13,14,15,16,17]. The deep learning-based method can automatically learn the features between multimodal image pairs, uncover implicit, more representative and robust features across different modalities, and effectively address the feature discrepancies between multimodal images. At the same time, it possesses a robust anti-noise capability, effectively identifying and filtering out noise, extracting genuinely relevant feature information, and minimizing the impact of noise on registration results. However, most current deep learning-based methods tend to employ twin convolutional neural networks (CNNs) as feature extractors or descriptors for remote sensing image matching tasks [18,19].

However, because these methods do not register in an end-to-end manner, they lack a subsequent registration process that distorts the moving image, leading to a decrease in registration efficiency. Some researchers have proposed utilizing a neural network to directly learn the feature representation of an image and the affine parameters between image pairs from the original input image [20,21]. This method employs end-to-end training, integrating processes such as feature extraction and parameter learning into a single model. It eliminates the complex intermediate steps typically involved in learning feature matching methods, including feature point detection and feature descriptor calculation. It can more effectively manage various transformations that may occur in an image, including rotation, scaling, translation, and shearing, demonstrating greater versatility and adaptability. However, most of these methods currently require a substantial number of image pairs to effectively learn feature representations and affine parameters. When attempting to learn alignment directly from the original image, these methods may struggle to accurately extract features and determine affine parameters, particularly in scenarios involving complex textures, blurry boundaries, or occlusions, as well as in cases with large-scale rigid transformations within the image.

To address the aforementioned challenges, we propose an end-to-end dual-branch feature extraction network for multimodal remote sensing image registration. The network comprises three modules: dual-branch feature extraction, affine parameter regression, and a spatial transformer network (STN) [22]. Initially, we employ a dual-branch feature extraction network that operates without shared parameters to comprehensively extract features from remote sensing images. This module consists of multi-scale and Swin Transformer [23] self-attention modules in the upper branch, along with multi-directional convolution, attention-extracted features, and efficient model (EMO) in the lower branch, to extract deep features. The deep features extracted from the two images are then concatenated and fed into the affine parameter regression module to predict rigid transformation parameters, such as rotation and translation. Finally, the predicted affine transformation parameters are input into the spatial transformation network, which directly applies these parameters to the moving image to be registered, thereby facilitating multimodal remote sensing image registration. The overall registration process is shown in Figure 1. In summary, the main contributions of this paper are as follows:

(1): We propose an end-to-end multimodal remote sensing image registration network that incorporates dual-branch feature extraction. This network consists of three components: feature extraction, affine parameter regression, and a spatial transformation module.
(2): In our dual-branch feature extraction module, the upper branch is designed for multi-scale feature extraction, allowing it to account for information across various scales and levels. It employs the Swin Transformer self-attention mechanism to model long-range dependencies within the image. In the lower branch, we introduce a module that integrates strip convolution blocks, batch normalization (BN), and multilayer perceptron (MLP). We define this module as the SBM module, which aims to capture remote contextual information from four different directions. Additionally, we combine channel and spatial attention modules to minimize irrelevant feature interference.
(3): We design convolutional kernels of varying sizes in parallel within the affine parameter regression network to enhance the adaptability of the network to a diverse range of features. This approach increases the flexibility of the model and generalizability across different input images.
(4): We conduct extensive experiments on three datasets, panchromatic–multispectral, SAR–optical, and infrared–optical, with different scales of rigid transformations. Compared to the most advanced multimodal remote sensing image registration methods, our experiments demonstrate strong performance, validating the effectiveness of our network in image registration.

The rest of this paper is organized as follows. Section 2 introduces the related work in the field of multimodal remote sensing image registration. Section 3 details the proposed method. Section 4 experimentally demonstrates the feasibility of the proposed method. Section 5 discusses the limitations of the proposed method and provides prospects for the future. Section 6 presents the concluding remarks.

2. Related Works

2.1. Based on Traditional Registration Method

Traditional multimodal image registration methods can be categorized into region-based methods and feature-based methods. The most widely used region-based method is mutual information (MI) [8]. This method primarily achieves image registration by continuously adjusting the spatial transformation parameters between images to maximize the MI. M. Shadaydeh et al. enhanced the MI by addressing the issues of numerous local maxima and the complete loss of spatial information in the calculation of the joint intensity probability distribution. They integrated additional image features and spatial information into the estimation of the joint intensity histogram, resulting in improved registration outcomes [24]. However, this method has shortcomings such as improper feature selection, introduction of noise interference, and limited adaptability to complex image scenes. The feature-based method identifies point features, regional features, and edge features in multimodal image pairs, subsequently establishing correspondences among these features to achieve image registration. SIFT [10] is a well-established technique utilized for feature registration. SUFT introduces an enhanced SIFT algorithm that selects SIFT features across the complete distribution of position and scale. By incorporating the extracted features into the initial cross-matching process, it achieves effective feature matching [25]. SAR-SIFT introduces a SIFT-like algorithm tailored specifically for SAR registration. This algorithm modifies several steps of the original SIFT algorithm to accommodate the unique characteristics of SAR images [26]. However, this improved SIFT-based registration method often produces poor multimodal remote sensing image registration results, especially in the presence of nonlinear radiometric variations. In order to develop a robust algorithm specifically for multimodal remote sensing image registration, Ye et al. proposed a novel feature descriptor based on the structural properties of images, known as the Histogram of Oriented Phase Consistency (HOPC) [27]. This method utilizes phase coherence rather than intensity or gradient to construct an oriented histogram representation for registration. Ref. [28] proposed a local main direction multi-scale histogram (MS-HLMO) method. This approach developed a fundamental feature map known as the partial main direction map and employed the generalized GLOH class feature descriptor for local feature extraction. Li et al. proposed a radiation variation-insensitive feature transform (RIFT) method to address the issue of multimodal images being sensitive to nonlinear radiation distortion [29]. This method employs phase congruence for feature point detection and introduces maximum index mapping for feature description and rotation invariance.

However, the traditional methods mentioned above have limited feature extraction capabilities and poor adaptability to complex remote sensing image scenes. When facing large-scale rigid deformation, it is difficult to achieve good registration results due to the inability to extract sufficient feature points, feature point mismatches, and other issues.

2.2. Based on Learning Registration Method

Deep learning-based methods can effectively solve problems such as insufficient feature point matching in traditional registration methods. These methods have strong robustness and adaptability, and can learn feature patterns in different scenarios through a large amount of data training, making them more adaptable to complex scenes. Currently, multimodal remote sensing image registration methods utilizing deep learning are primarily categorized into two types: end-to-end direct learning of the mapping relationship between multimodal image pairs, and the conversion of multimodal image pairs into mono-modality image pairs prior to registration. Wang et al. proposed an end-to-end mapping between patch pairs of moving and fixed images along with their corresponding matching labels. They also applied transfer learning to reduce the computational cost during the training phase [30]. Ye et al. proposed an unsupervised registration framework that utilizes multiple deep neural networks operating at different scales.The network directly learns the mapping of image pairs to transformation parameters, progressing from coarse to fine [21]. Xiao et al. proposed an ADRNet method, which for the first time combined rigid registration with flow field prediction for multimodal remote sensing image registration [31]. With the advancement of image conversion networks such as Pix2Pix [32] and CycleGAN [33], an increasing number of researchers have begun to apply these techniques to multimodal remote sensing image registration. Du et al. proposed a semi-supervised image-to-image translation framework for SAR and optical images. This framework first converts SAR images into pseudo-optical images and then employs traditional methods to align the optical and pseudo-optical images [34]. Wang et al. proposed a cross-modality-aware style transfer network (CPSTN) to generate a pseudo-infrared image using an optical image as input. This approach transforms the multimodal registration problem into a single-modal registration problem. They subsequently introduced a multi-level refinement registration network (MRRN) to predict the displacement vector field between the distorted and pseudo-infrared images [35].

In addition, several studies have integrated the multimodal remote sensing image registration task with other research areas. Zheng et al. highlighted the importance of registration in the context of hyperspectral super-resolution and proposed a novel unsupervised spectral unmixing and image registration network, termed NonRegSRNet [36]. Tang et al. unified image registration, image fusion, and high-level semantic requirements into a single framework. They developed a symmetrical bidirectional image registration module to effectively achieve multimodal image alignment [37]. Zhou et al. proposed a unified image registration and change detection network (URCNet), which performs image registration and detects change information using a single network [38].

However, most of the aforementioned deep learning-based methods primarily concentrate on image transformation and re-registration, or on utilizing flow field prediction to address nonrigid registration challenges. Some studies also attempt to integrate registration with other tasks, which may result in an incomplete research of the registration process. Consequently, we propose a multimodal remote sensing image registration method that specifically targets rigid transformations. For a detailed introduction, please refer to Section 3.

3. Method

3.1. The Overview of Network Framework

The overall network structure of the proposed multimodal remote sensing image registration method is shown in Figure 2. It consists of three parts: dual-branch feature extraction, affine parameter regression, and a spatial transformation network. Two distinct modalities of remote sensing images utilize the same dual-branch feature extraction network to derive their respective features. The final extracted features are obtained by combining the features that incorporate both global and local information extracted from the upper and lower branches. The deep features of the two modalities are spliced and input into the affine parameter regression network to derive the corresponding affine transformation parameters between the two remote sensing images. The predicted affine transformation parameters are input into the spatial transformation network, which will apply the affine transformation parameters to the moving image to be registered, thereby aligning the distorted images.

3.2. Feature Extraction Module

3.2.1. The Upper Branch Feature Extraction

Our upper branch feature extraction module comprises a multi-scale feature extraction (MSFE) [39] component and two Swin Transformers arranged in series. The MSFE conducts varying degrees of downsampling and upsampling operations on the fundamental features of the input image. It then performs operations such as concatenation, element-wise multiplication, and addition of the processed features at different levels, ultimately yielding deep features that integrate multi-scale information.

Specifically, the input image

I_{M} \in R^{H \times W}

first undergoes a basic

3 \times 3

convolution to perform preliminary feature extraction to obtain the primary feature

M_{1} \in R^{H \times W}

, providing a relatively standardized input for the subsequent multi-scale module. Then,

M_{1}

is passed through the upper, middle, and lower branches for convolution at different scales. Here,

H \times W

is the size of the image. The upper branch has only one convolutional layer to extract the shallow feature

M_{2} \in R^{H \times W}

with unchanged scale. The process of the upper branch can be expressed as follows:

\begin{matrix} M_{2} = Conv (M_{1}) . \end{matrix}

(1)

The middle branch first downsamples

M_{1}

to obtain

M_{1}^{'} \in R^{\frac{H}{2} \times \frac{W}{2}}

, passes through two convolutional layers, and performs residual connection on

M_{1}^{'}

to obtain

M_{3} \in R^{\frac{H}{2} \times \frac{W}{2}}

.The process of branching can be expressed as follows:

\{\begin{matrix} M_{1}^{'} = Down (M_{1}), \\ M_{3} = Conv (Conv (M_{1}^{'})) + M_{1}^{'}, \end{matrix}

(2)

where

Down

refers to downsampling.

The lower branch continues to downsample

M_{1}^{'}

to obtain

M_{1}^{″} \in R^{\frac{H}{4} \times \frac{W}{4}}

. After that, three convolutional layers are passed and residual connections are performed on

M_{1}^{″}

to obtain

M_{4} \in R^{\frac{H}{4} \times \frac{W}{4}}

. The process of the lower branch can be expressed as follows:

\{\begin{matrix} M_{1}^{″} = Down (M_{1}^{'}), \\ M_{4} = Conv (Conv (Conv (M_{1}^{″}))) + M_{1}^{″} . \end{matrix}

(3)

At this point,

M_{3}

and

M_{4}

are first upsampled to obtain

M_{3}^{'} \in R^{H \times W}

,

M_{4}^{'} \in R^{H \times W}

, then concatenated and convolved to obtain

M_{5}

. The process is represented as:

M_{5} = Conv (Cat (M_{3}^{'}, M_{4}^{'})) .

(4)

This operation can fully integrate the features of

M_{3}

and

M_{4}

at two different levels and enrich the subsequent feature representation. Next,

M_{5}

with multi-scale information is element-wise multiplied with

M_{2}

. This operation introduces the interaction and association between low-level and high-level features, which enhances the semantic information of the features while retaining the feature details. Afterwards, in order to make full use of features at different levels, we add the results of element-wise multiplication of

M_{3}^{'}

and

M_{4}^{'}

with

M_{2}

and

M_{5}

, and then add the original feature

M_{1}

to obtain feature

M_{6}

.

M_{6}

contains information from the input of each intermediate stage, thereby strengthening the overall feature expression.The process is represented as:

M_{6} = M_{3}^{'} + M_{4}^{'} + M_{5} \cdot M_{2} + M_{1} .

(5)

However, due to the limited size of the convolutional kernel in convolutional neural networks, each neuron can only perceive local information from the input data. This limitation makes it challenging for the network to capture long-range dependencies and global semantic information when processing large-scale remote sensing image data, thereby affecting subsequent global registration. In recent years, attention mechanisms, particularly those driven by Transformers, have addressed the aforementioned challenges. For complex remote sensing data, Transformers directly compute the relationships between any two positions using the self-attention mechanism. This approach effectively models long-range dependencies, thereby enhancing the understanding of global structure and contextual information. However, the computational complexity of Transformers is quadratic in relation to image size, resulting in a significant computational burden when processing high-resolution images. Therefore, we utilize Swin Transformer in the network. The structure of the Swin Transformer network is illustrated in Figure 3. This network module is a lightweight transformer that relies on local self-attention calculations within a defined window. It restricts the self-attention calculation to a localized window, which more effectively models feature dependencies within the immediate area and enhances the capacity of the model to capture local details. Swin Transformer achieves efficient processing of input features and learning of feature representations by alternately using Windows Multi-head Self-Attention (W-MSA) and Shifted Windows Multi-Head Self-Attention (SW-MSA), combined with layer normalization (LN) and multilayer perceptron (MLP). This is essential for extracting image texture and edge features, which are critical in remote sensing image registration tasks.

Specifically, we input

M_{6}

after the multi-scale feature extraction module into the Swin Transformer module to obtain

M_{7}

, concatenate the

M_{7}

and

M_{6}

features, fuse the feature information at different levels, and then input it into the Swin Transformer module to further refine the upper branch final feature

M_{u p}

. This process removes noise and irrelevant information and highlights the features related to the registration task. The entire process of the feature passing through Swin Transformer can be expressed by the following expression:

\{\begin{matrix} M_{7} = Swin (M_{6}), \\ M_{u p} = Swin (Cat (M_{6}, M_{7})) . \end{matrix}

(6)

3.2.2. The Lower Branch Feature Extraction

The lower branch feature extraction module comprises three components: a primary feature extraction module (SBM) that includes strip convolution, BN, and MLP [40,41], a channel spatial attention module, and the EMO [42] module. This enables the network to compensate for the insufficient attention paid by the multi-scale feature extraction module in the upper branch to directional information in the image and to extract targeted edge and texture information of remote sensing images in different directions.

This module enables the network to observe input features from multiple perspectives. The horizontal direction aids in capturing the continuity of texture, the vertical direction emphasizes the edges and contours of objects, and the diagonal direction reveals specific structural patterns. Consequently, a more comprehensive feature representation can be achieved, thereby preventing the potential loss of information that may occur when extracting features from only a single perspective. In contrast, traditional convolutional networks are typically confined to extracting features in a single or a limited number of directions. Unlike SBM, they struggle to concurrently capture texture, edge, and structural information comprehensively from diverse directions such as horizontal, vertical, and diagonal orientations. This limitation frequently results in the omission of crucial features, thereby undermining the accuracy of registration. In addition, this module possesses a comprehensive global understanding of the features, enabling it to better grasp the overall structure and semantics. This capability is particularly well-suited for rigid registration tasks that necessitate a thorough consideration of global information. After extracting features from four different directional inputs, we concatenate them and input the result into the BN and MLP to obtain

F_{2}

. The comprehensive process is expressed as:

F_{2} = MLP (BN (SCB (F_{1}))),

(7)

where

SCB

refers to the strip convolution block with four-directional convolutions.

The feature

F_{2}

has a more stable distribution and further abstracts and transforms the extracted features, learning more advanced and abstract feature representations from basic features. Then, we input

F_{2}

into the channel attention module and the spatial attention module, respectively. Channel attention performs adaptive feature selection on the channel dimension of

F_{2}

, dynamically adjusts the importance of each channel, and improves the expression of channel features. Spatial attention dynamically adjusts the importance of pixels and enhances the feature representation of

F_{2}

by using operations such as dimensionality reduction, nonlinear activation, and dimensionality compression. Next, the outputs of the channel attention and spatial attention modules are concatenated and passed through the last convolutional layer to obtain

F_{3}

. The process is represented as follows:

F_{3} = Conv (Cat (CA (F_{2}), SA (F_{2}))),

(8)

where

CA

and

SA

represent the channel attention and spatial attention modules, respectively.

Finally, adding

F_{1}

and

F_{3}

yields

F_{4}

with complete information features in different directions

F_{4} = F_{1} + F_{3} .

(9)

However, both the SBM and attention modules utilize CNNs to extract features from remote sensing images; they cannot fully extract representative features due to the induction bias of static CNNs. This limitation reduces their adaptability to images undergoing transformations such as rotation and scaling. To address this issue, we employ an efficient EMO module that integrates the Inverted Residual Module (iRMB) with the Transformer architecture, akin to ResNet, to further improve feature extraction. The schematic diagram illustrating the structure of the EMO is presented in Figure 4. This module is designed for multi-level stacking of iRMB, which consists solely of Depth-Wise Convolution (DW-Conv) and Enhanced Extended Window Multi-Head Self-Attention (EW-MHSA), without any additional complex operators. DW-Conv directly performs downsampling operations through step adaptation, eliminating the need for positional embeddings to introduce inductive bias in MHSA. EW-MHSA is a multi-head self-attention mechanism that effectively interacts with features over long distances.

Overall, this EMO module combines the local modeling capabilities of CNN with the global context modeling capabilities of Transformer while maintaining characteristics of being lightweight and highly efficient. This effectively enhances the remote sensing image features extracted by the lower branch. The features extracted by the final lower branch are represented by the following process:

F_{l o w} = EMO (F_{4}) .

(10)

At this stage, we acquire the final features

M_{u p}

, extracted by the upper branch, and

F_{l o w}

, extracted by the lower branch. We combine these features to obtain the final feature

F M_{d u a l}

, which incorporates advanced semantic and texture information. This process can be expressed as follows:

F M_{d u a l} = F_{l o w} + M_{u p} .

(11)

In summary, we conduct dual-branch feature extraction on two distinct modalities of remote sensing images separately. We then concatenate the extracted features into the subsequent affine transformation module to predict the affine transformation parameters.

3.3. Affine Parameter Regression Module

We concatenate the feature maps obtained from the dual-branch feature extraction of both the moving and fixed images. These concatenated feature maps are then input into the affine parameter regression network to learn the parameters related to the rigid registration task, including rotation, translation, and scaling.

Specifically, we concatenate the fixed image

I_{F}

and the moving image

I_{M}

through a dual-branch feature extraction network to obtain their respective depth features

Z_{F}

and

Z_{M}

. For the convenience of subsequent representation, we define the following equation:

Z_{F}, Z_{M} = DBFE (I_{F}), DBFE (I_{M}),

(12)

where

DBFE

is our proposed dual-branch feature extraction module.

The concatenated features are further optimized through the ResNet34 [43] network. The final output is generated using a fully connected layer to derive six affine transformation parameters. The overall process of the regression network is represented as follows:

[θ_{11}, θ_{12}, θ_{13}, θ_{21}, θ_{22}, θ_{23}] = FC (ResNet (Cat (Z_{F}, Z_{M}))),

(13)

where

FC

represents a fully connected layer.

At this point, we obtain six predicted affine transformation parameters, namely translation, scaling, and rotation parameters in two directions. Since affine transformation is a linear transformation in a two-dimensional plane, it manifests as a combination of translation, rotation, and scaling. Therefore, we reshape the output of the fully connected layer into a 2 × 3 affine transformation matrix to complete the subsequent affine transformation of the image. This overall process can be represented by the following equation:

ϕ = reshape ([θ_{11}, θ_{12}, θ_{13}, θ_{21}, θ_{22}, θ_{23}]),

(14)

where

reshape

represents the reshaping operation. The main function of the affine transformation matrix

ϕ

is to generate a sampling grid, which is then input into the subsequent spatial transformation network.

In the affine parameter regression network, we design convolution kernels of varying sizes to operate in parallel. This module is called a parallel convolution block (PCB). This approach combines the fine texture features extracted by smaller convolution kernels with the macroscopic structural features obtained from larger kernels, resulting in richer and more comprehensive feature representations. Consequently, this enhances the accuracy of affine parameter regression. Additionally, by fusing features of different scales and types, the parallel convolution blocks increase the robustness of the regression network against noise, deformation, and changes in illumination within the image. The features extracted by different convolution kernels can complement one another to a certain extent, thereby mitigating the limitations associated with relying on a single feature type.

3.4. Spatial Transformer Network (STN)

The STN [22] is a learnable module integrated into traditional convolutional neural networks. It explicitly performs spatial transformations on the input image, enhancing the network’s adaptability to geometric deformations. The process of the STN distorting moving images is illustrated in Figure 5. It can automatically correct images using learned parameters and is particularly suitable for remote sensing image registration involving large-scale rigid transformations.

After obtaining the affine transformation matrix through the radiometric parameter regression network, we compute the sampling coordinates of each pixel in the moving image based on

ϕ

. A sampling grid of the same size as the output image is then generated, recording the corresponding coordinates of each output pixel in the moving image. This process can be expressed as follows:

G r i d = Γ_{ϕ} (G),

(15)

where

Γ_{ϕ} (G)

represents converting the affine transformation matrix

ϕ

into a grid operation.

G r i d

records the corresponding coordinate position of each pixel of the output image in the moving image.

Afterwards, STN applies this sampling grid to resample the motion image to be registered, thus completing the registration. This process can be expressed as follows:

I_{m o v}^{r e g} = Sampler (I_{m o v}, G r i d),

(16)

where

G r i d

represents the sampling grid for resampling, and

I_{m o v}

represents the moving image.

Sampler

represents the resampling operation.

3.5. Loss Function

A well-defined loss function is essential for effectively guiding the registration network. We employ L1 loss [44] to quantify the discrepancy between the affine transformation parameters predicted by the network and the actual parameters, thereby facilitating our network optimization. The formulation of this loss function is as follows:

L_{1} = \frac{1}{N} \sum^{N} | ϕ_{p r e} - ϕ_{g t} |_{1},

(17)

where N represents the number of training samples, and

ϕ_{p r e}

and

ϕ_{g t}

denote the predicted affine transformation parameters and the ground-truth values, respectively.

At the same time, Ref. [31] developed a symmetric loss function based on inverse consistency, which employs the concept of bidirectional constraints. This approach has been demonstrated to be effective in the field of registration. Therefore, we add this symmetric loss to our rigid registration network to perform bidirectional registration, thereby reducing registration deviations at both the forward and reverse levels. Our network derives a positive affine transformation matrix

H_{1}

and an inverse affine transformation matrix

H_{2}

for the final predicted affine parameters.

H_{1}

is an extension of the predicted affine transformation matrix

ϕ_{p r e}

. The closer the product of

H_{1}

and

H_{2}

is to the identity matrix, the more it indicates that the bidirectional transformation parameters can effectively cancel each other out, thus achieving registration consistency. Therefore, we introduce a bidirectional symmetric loss quantization for bidirectional registration, and the expression for symmetric loss is as follows:

L_{s y m a f f} = \sum_{i, j}^{3, 3} | | H_{1} (i j) \otimes H_{2} (i, j) - E | |_{2},

(18)

where E represents the identity matrix and ⊗ denotes matrix multiplication.

The first two loss functions constrain the registration network by optimizing the affine regression parameters. However, in the actual rigid registration of multimodal remote sensing images, solely learning the affine transformation parameters may overlook the pixel differences between the registered image and the reference image. The Normalized MI (NMI) loss [45] can directly assess the degree of alignment between the registered image produced by the network and the ground truth. The loss function is defined as follows:

L_{n m i} = \frac{\sum_{a} P_{A} (a) log P_{A} (a) + \sum_{b} P_{B} (b) log P_{B} (b)}{\sum_{a, b} P_{A B} (a, b) log P_{A B} (a, b)},

(19)

where A and B represent the registered image and the ground truth, respectively.

P (a, b)

represents the joint probability distribution of the two images. The marginal probability distribution of the images is represented by

P (a)

and

P (b)

. The greater the degree of alignment between the two images, the greater the similarity between the images.

Combining the three aforementioned loss functions, the expression for the total loss function in affine transformation rigid registration is as follows:

L_{a f f} = α L_{1} + β L_{s y m a f f} + γ L_{n m i},

(20)

where

α

,

β

, and

γ

are weight hyperparameters.

4. Experiments

In this section, we first introduce the multimodal remote sensing image dataset utilized in the experiment. Next, we elaborate on the experimental details. Following that, we present the evaluation metrics employed to quantitatively assess our experiments. Finally, we perform comparative experiments and analyses with six advanced registration methods across two remote sensing datasets of varying modalities. Additionally, we conduct ablation experiments to verify the impact of the different modules we proposed on registration performance.

4.1. Dataset

4.1.1. PAN-MS

Panchromatic (PAN) images offer high spatial resolution, capturing fine details and texture information. In contrast, multispectral (MS) images provide rich spectral information, aiding in the differentiation of land cover types. Registering PAN and MS images ensures spatial consistency, thereby enhancing the quality of fused images in subsequent pansharpening processes. The PAN and MS datasets consist of cropped images obtained from three remote sensing satellites: Gaofen-1, WorldView-2, and WorldView-4 [46]. The ground sampling distance (GSD) of the PAN images is 0.5 m/pixel, while the GSD of the MS images is 2 m/pixel. The typical view zenith angle of the images is about 30°. The MS images in this dataset have a resolution of 256 × 256 pixels, while the panchromatic images possess a higher resolution of 1024 × 1024 pixels. During the experiment, we treated the MS images as moving images and the PAN images as fixed images for the purpose of registration (vice versa). For this dataset, we first downsample the PAN images to a size of 256 × 256 pixels for subsequent registration. We randomly select 1104 pairs of images for training and 148 pairs of images for testing.

4.1.2. IR-OPT

Infrared (IR) images are sensitive to temperature differences and can show the thermal radiation characteristics of objects, while optical (OPT) images can provide rich texture, color, and other detailed information. By registering them, the location of ground targets can be determined more accurately. Therefore, IR and OPT datasets are selected for multimodal remote sensing image registration during the experiment [21]. The dataset covers the Chengdu Plain and the surrounding hills and mountains, with an image resolution of 30 m. All image pairs have undergone geographic correction to ensure complete affine alignment. During the training process, we apply a random affine transformation to one of the modalities. We use a total of 1325 pairs of IR and OPT images for training and 200 pairs for testing.

4.1.3. SAR-OPT

SAR images remain immune to the constraints of weather and lighting conditions, enabling the seamless acquisition of ground information around the clock. When registered with OPT images, they can achieve a more potent form of multi-source information complementarily. This registration furnishes more comprehensive and precise information support, along with more efficacious solutions, for challenges such as disaster monitoring and a host of other related issues. We utilize SAR and OPT image pairs from the SEN1-2 dataset for both training and testing purposes. The GSD of SAR images is about 5 m/pixel and that of optical images is 10 m/pixel. The typical view zenith angle is about 25° to 30°. This dataset encompasses a variety of scenarios, including urban areas, agricultural land, rivers, and forests. For our training, we employ a total of 1300 images, while 300 images are designated for testing. The test dataset represents a diverse range of environments, covering both rural and urban areas.

4.2. Implementation Details

During the training process, we apply a large-scale random affine transformation to one of the modalities, which includes rotation within the range of [−45°, 45°], translation within [−20, 20] pixels, and scaling within the range of [0.8, 1.2]. We simultaneously save the transformed parameters as ground truth to constrain the subsequent loss function. During the testing phase, images subjected to various rigid transformations are directly utilized to assess the registration performance of our network.

When conducting comparative and ablation experiments to assess the effectiveness of our network registration, all experimental methods are executed on the same device. In addition, all methods utilize the same training and testing datasets for the experiments.

Our framework is developed using PyTorch 2.0.0 and the experiments described in this paper are conducted on a computer equipped with an Intel (manufactured by Intel Corporation, Santa Clara, CA, USA) Core i5-12400F CPU and an NVIDIA (manufactured by NVIDIA Corporation, Santa Clara, CA, USA) GeForce RTX 4090 GPU. Based on previous work experience, we assign the weights

α

,

β

, and

γ

of the loss function as 10, 1, and 1, respectively. The learning rate is set to

1 \times 10^{- 4}

, and the batch size is set to 8. The Adam optimizer is employed for optimization. Our model is trained for a total of 300 iterations.

4.3. Evaluation Metrics

The RE [37] directly compares the registered result with the ground truth by measuring the distance between the reprojected landmark after registration and its corresponding ground truth. The expression is as follows:

R E = \frac{1}{N} \sum_{x = 1}^{N} | P_{x}^{p r e} - P_{x}^{g t} |,

(21)

where

P_{x}^{p r e}

and

P_{x}^{g t}

represent the registered image and the ground truth, respectively. We utilize the mask to calculate the RE of the common area between the two images.

Mutual information (MI) is used to measure the statistical correlation between two images [47]. The larger its value, the better the registration effect. The expression of MI is as follows:

M I = \sum_{I, J} P_{I_{1}, I_{2}} (i, j) log (\frac{P_{I_{1}, I_{2}} (i, j)}{P_{I_{1}} (i) P_{I_{2}} (j)}) .

(22)

P_{I_{1}} (i)

and where

P_{I_{2}} (j)

are the edge probability distributions of the grayscale values i and j of image 1 and image 2, respectively, and

P_{I_{1}, I_{2}} (i, j)

are the joint probability.

Normalized Cross-Correlation (NCC) measures the similarity between two images by calculating the correlation between them at different positions. In the quantitative evaluation of registration, NCC is positively correlated with the similarity between the registered image and the true image [48]. The larger the NCC, the better the registration effect.

RMSE is the difference between the predicted affine transformation parameters and the actual affine transformation parameters. We calculate the displacement between the predicted coordinates of the four corner points and the ground-truth coordinates of these four corner points [49]. The formula for RMSE is as follows:

R M S E_{4 c o r} = \frac{1}{n} \sum_{i = 1}^{n} \frac{1}{4} \sum_{i = 1}^{4} \sqrt{{(x_{j}^{p r e} - x_{j}^{g t})}^{2} + {(y_{j}^{p r e} - y_{j}^{g t})}^{2}},

(23)

where n represents the quantity of image pairs.

x_{j}^{p r e}

and

y_{j}^{p r e}

represent the parameters predicted by the affine parameter regression network.

x_{j}^{g t}

and

y_{j}^{g t}

represent the ground truth of the affine parameters that distort the image.

We use the above four indicators to quantitatively evaluate the effectiveness of the proposed registration method.

4.4. Experimental Results and Analysis

In this section, we analyze the registration results from both qualitative and quantitative perspectives and compare them with the most advanced registration methods currently available. We compare our method with three state-of-the-art traditional registration methods (SIFT [10], RIFT [29], TWMM [50]) and three recent deep learning-based registration methods (TransMorph [51], SuperFusion [37], ADRNet [31]). We validate our proposed method using pairs of PAN-MS, SAR-OPT, and IR-OPT images that are subjected to large-scale rigid transformations to further evaluate its registration performance.

4.4.1. Qualitative Comparisons

Figure 6 presents a comparison of our method with other advanced techniques. The qualitative comparison chart is presented in a chessboard format, which clearly illustrates the staggered distribution of two distinct modal regions. This qualitative analysis effectively allows for the assessment of whether the registration results are accurately aligned with the corresponding modal images in both global and local contexts.

As illustrated in Figure 6, RIFT exhibits a global mismatch phenomenon during the registration process. This is because RIFT is a registration method based on feature matching. It is difficult to extract sufficient point features or perform incorrect feature matching when facing heavy affine transformations in multimodal images. This results in unsatisfactory registration results in all three datasets. TWMM is a reliable registration method for OPT images captured by unmanned aerial vehicles equipped with thermal IR and dual cameras. Figure 6 indicates that there is a notable registration effect on IR-OPT images. However, there are numerous areas of misalignment in the registration of PAN-MS and SAR-OPT images.

TransMorph is a method in the medical field that integrates convolutional neural networks and Transformers. However, its registration performance is suboptimal when applied to remote sensing images that exhibit large-scale rigid distortions. This clearly indicates that directly applying techniques from medical image registration to remote sensing image registration is not feasible. SuperFusion is a method that integrates tasks such as registration, fusion, and segmentation within a single framework for IR and OPT datasets. Although SuperFusion achieves a certain level of registration on the distorted moving image, it does not attain perfect alignment between the two images overall. This highlights the limitations of registration methods like SuperFusion, which predict the flow field and subsequently apply it to the moving image through the STN resampling process, particularly in nonrigid registration tasks. During our experiments, we observe that both the TransMorph and SuperFusion methods exhibit some degree of registration effectiveness for smaller rigid transformations. However, when confronted with the prevalent large-scale rigid transformations encountered in real-world scenarios, such as those simulated in this experiment, these methods demonstrate inadequate registration performance. ADRNet is currently the most advanced multimodal remote sensing image registration method. It demonstrates effective registration, with both global and local regions well-aligned. However, it employs a two-stage approach that integrates affine transformation with flow field prediction. The second stage flow field prediction network fine-tunes the first stage affine registration, thereby enhancing overall registration performance. Nonetheless, this approach inevitably increases the number of parameters in the network, which can impose a significant burden on computing resources. Please refer to Section 4.4.3 for specific details.

In contrast, our method achieves precise registration through a single-stage affine registration network. Both global registration and local alignment are successfully performed on the three datasets. The experimental results fully demonstrate that our method still maintains good registration results in the face of large-scale rigid transformations in images.

4.4.2. Quantitative Comparisons

Table 1, Table 2 and Table 3 present the quantitative indicators for various methods applied to the PAN-MS, IR-OPT, and SAR-OPT datasets, which involve large-scale rigid transformations. During the experiment, we discover that traditional SIFT, RIFT, and TWMM methods frequently yield abnormal indicators. This may be attributed to the inherent noise present in multimodal remote sensing images, which can lead to inaccuracies in the feature points they extract. In regions with similar textures, these methods generate a substantial number of analogous feature points, resulting in diminished differentiation among them. Consequently, when matching these feature points, mismatches are likely to occur, adversely affecting the calculation of subsequent affine transformation parameters and leading to global mismatches. The results presented in the table represent the average quantitative indicators calculated after we eliminate the clearly erroneous abnormal indicators. All quantitative calculations involve only the common areas of the two modality images.

It is evident from Table 1, Table 2 and Table 3 that traditional methods have significant limitations when it comes to multimodal remote sensing image registration. The Re indicators for TransMorph and SuperFusion demonstrate that they are ineffective in producing the final registration image, resulting in a significant discrepancy from the ground truth. Since their methods cannot predict the parameters of affine transformations, we do not include the RMSE index for these two approaches. This flow field prediction method has a certain registration effect on nonrigid transformations but performs poorly when facing large-scale rigid transformations tasks. In contrast, the indicators from ADRNet demonstrate its superiority in the realm of multimodal remote sensing image registration. The

A D R N e t_{a f f}

in Table 1, Table 2 and Table 3 represents the ADRNet that only uses the first stage network. Its final performance metrics are slightly better than those of our method when registering the PAN-MS dataset. However, this advantage is contingent upon its use of the flow field predicted in the second stage to fine-tune the affine transformation registration. When considering only the affine registration metrics, our method significantly outperforms ADRNet, and the RMSE indicates that our method is much more accurate than ADRNet in predicting affine parameters. The SAR-OPT and IR-OPT datasets have the greatest difficulty in registering multimodal remote sensing images due to their significant modal differences. However, our method outperforms the ADRNet two-stage registration network when registering IR-OPT and SAR-OPT datasets. Quantitative metrics are the best among all the methods. For the IR-OPT dataset, our method achieves the best performance in RE, MI, and NCC, demonstrating the highest similarity between the registered images and the ground truth. Additionally, it achieves the lowest RMSE value, further confirming the accuracy of the predicted affine transformation matrix. In the SAR-OPT dataset experiment, our approach reduces RE by approximately 2.95% and RMSE by 22.53% compared to the suboptimal ADRNet, fully highlighting the significant quantitative advantages of our method. This clearly shows that our method achieves excellent registration results while saving computational resources.

4.4.3. Further Result Analysis

Figure 7, Figure 8 and Figure 9 illustrate the registration results of our method across three datasets. These results demonstrate the effectiveness of our approach in registering different rigid transformations within various remote sensing image regions. We evaluated diverse areas, including hills, rivers, and buildings, to assess the robustness of our method in handling a range of remote sensing images. As shown in Figure 7, Figure 8 and Figure 9, our registration performance remains robust across various terrains and image modality types.

We apply varying degrees of rigid transformations to multimodal remote sensing images, including translation, scaling, and rotation. This is performed to further validate the robustness of our method under different rigid transformations. The results indicate that our method achieves commendable registration performance across various translation and rotation transformations. Even when confronted with complex distortions, our network consistently delivers effective registration results. Notably, for pairs of SAR-OPT images, our network achieves near-perfect registration. This strongly demonstrates the robustness of our model. Figure 10 shows the comparison results between the proposed method and advanced registration methods in handling slight rigid deformation tasks. It uses a denser checkerboard pattern, effectively highlighting whether the local areas of the observed image achieve precise alignment. From the comparison results, it can be seen that our method not only performs well in the face of large-scale rigid transformations, but is also suitable for simple deformation tasks, and still achieves competitive results in processing small deformation multimodal remote sensing images.

At the same time, multimodal remote sensing image registration methods have real-time requirements. We evaluate the complexity of the proposed model in terms of the number of network parameters and the time required for real-time registration. All methods are executed on an RTX3060 GPU. The results regarding efficiency and parameters are presented in Table 4. The findings indicate that deep learning-based methods are generally faster than traditional approaches. Although our model has relatively high parameters, it demonstrates significantly higher registration accuracy compared to other methods. Compared to the advanced ADRNet, we achieve superior registration performance while utilizing far fewer network parameters. In general, our method successfully balances real-time performance with registration accuracy.

4.5. Ablation Study

To verify the effectiveness of each module proposed in the network, we conduct ablation experiments on the SAR-OPT dataset. The results of these ablation experiments are presented in Table 5. RE (SAR/OPT) refers to treating SAR and OPT as moving images separately during the registration process, thereby demonstrating the effectiveness of the proposed method in both registration directions. As shown in Table 5, the dual-branch feature extraction network significantly enhances registration accuracy. Notably, we observe during the experiments that removing the Swin Transformer or PCB module increases the training speed of the network. However, this comes at the cost of a substantial reduction in registration performance. This finding underscores the importance of the Swin Transformer and PCB module in our architecture. The Swin Transformer facilitates efficient feature interaction within local windows while enabling information transfer between different windows through sliding window operations, thus achieving a fusion of local and global features. This capability is crucial for extracting diverse features from remote sensing images. The EMO module compensates for the insufficient extraction of global and local features in the lower branches of the network. Local features are instrumental in matching the details of ground objects within the image, while global features help correct overall geometric deformations caused by factors such as terrain undulations, ultimately leading to a significant improvement in the final registration results.

The multi-scale feature extraction (MSFE) module effectively captures both large-scale features, such as macro-level terrain and landforms, and micro-level details, including the edges of buildings and the texture of roads. This capability is particularly crucial for multimodal remote sensing image registration. As can be seen from the quantitative analysis of ablation experiments, the registration performance is significantly diminished in the absence of this module.

Table 5 demonstrates that the SBM module also significantly enhances the registration performance of the network. It effectively captures the spatial features of remote sensing images from various directions. This is particularly important for the SAR-OPT dataset where different ground objects exhibit distinct characteristics depending on the viewing angle. For linear ground objects in remote sensing images, such as rivers and roads, horizontal and vertical convolutions are more effective at extracting their shape and directional features. Additionally, diagonal convolutions are particularly useful for delineating the boundaries of ground objects that are oriented at angles or for capturing obliquely distributed texture features. Our integrated SBM module yields richer feature representations compared to single-direction convolution, facilitating a better distinction of feature differences among the same ground objects across different modalities and providing more matching cues for subsequent registration. Combining BN and MLP further enhances the feature representation capabilities of our registration network.

In addition, PCB significantly enhances feature extraction. When predicting the parameters of affine transformations, it is essential to consider various geometric deformation factors of the image. Utilizing convolution kernels of different sizes in parallel allows for a comprehensive representation of the spatial structure and deformation of the image. The features generated by 1 × 1 convolutions exhibit strong channel selectivity, while those produced by 3 × 3 convolutions emphasize local spatial structures. In contrast, the features derived from 5 × 5 convolutions focus on spatial relationships over a broader range. These diverse types of features can complement one another in the subsequent prediction of affine transformation parameters, ultimately resulting in more accurate and robust registration outcomes.

5. Discussion

The proposed dual-branch network, integrating strip convolution and Swin Transformer, effectively addresses the challenge of rigid registration in multimodal remote sensing images. This network capitalizes on the multi-scale characteristics of remote sensing images. The Swin Transformer is employed to capture long-range dependencies within the images. Additionally, an SBM is designed to enhance the representation of directional features. This network not only overcomes the problem of large-scale rigid deformation between multimodal remote sensing images but can also complete the registration of slight rigid deformations.

However, the network still has some limitations and shortcomings. Although the Swin Transformer has significantly reduced computational complexity and network resource requirements compared to traditional Transformers, it still has limitations when processing large-scale input images. The Swin Transformer divides the image into multiple local windows for self-attention calculations. This strategy makes it difficult to effectively capture some long-distance cross-window dependency information in large-scale image scenes. In theory, increasing the number of windows or increasing the window size can alleviate this problem to a certain extent. However, due to hardware conditions, we cannot take these countermeasures. Looking forward, we plan to design a lightweight Transformer framework specifically for remote sensing image registration tasks. In addition, the model of the proposed method requires fully aligned multimodal remote sensing image pairs for training, which is difficult to obtain in practical applications. At the same time, our network is a general framework designed for the task of rigid registration of multimodal remote sensing images. In reality, images are often affected by nonrigid distortions such as local distortions. In the future, we will further study more effective and efficient network frameworks for nonrigid registration tasks, and strive to design an unsupervised network that does not require a large amount of labeled data.

6. Conclusions

This article presents a dual-branch network designed for multimodal remote sensing image registration. The network comprises a dual-branch feature extraction module, an affine parameter regression network, and a spatial transformation network. In the upper branch of the feature extraction network, we integrate multi-scale features with the Swin Transformer to comprehensively extract various land features from remote sensing images. This approach emphasizes the importance of capturing both local details and global contextual information in multimodal images. In the lower branch, we introduce a strip convolution block and combine it with BN and MLP to create an SBM module. This module effectively extracts more comprehensive features from various directions, adapts to complex terrain distributions, and better manages intricate semantic relationships between different terrains. In addition, we incorporate a parallel connection mechanism for convolutional kernels of varying sizes in the affine parameter regression network, further enhancing the accuracy of the network in learning affine transformation parameters. We conduct extensive experiments on three multimodal datasets with large-scale rigid transformations, and the experimental results validate the performance and efficiency of our method in the field of multimodal remote sensing image registration.

Author Contributions

K.M.: conceptualization, methodology, formal analysis, writing—original draft. W.W.: methodology, software, validation, writing—review and editing, funding acquisition. H.L.: supervision, funding acquisition. L.L.: validation, funding acquisition. S.Z.: writing—review and editing. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China [grant numbers 62376214, 92270117, U2034209]; the Natural Science Basic Research Program of Shaanxi [grant number 2023-JC-YB-533]; and Qin Chuangyuan ‘Scientists + Engineers’ Team Building [grant number 2024QCY-KXJ-160].

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Acknowledgments

The author would like to thank ADRNet, SuperFusion, and others for selflessly making their code open source. The author would like to thank MUNet for the publicly available dataset.

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

Jiang, X.; Ma, J.; Xiao, G.; Shao, Z.; Guo, X. A review of multimodal image matching: Methods and applications. Inf. Fusion 2021, 73, 22–71. [Google Scholar] [CrossRef]
Hu, C.; Zhu, R.; Sun, X.; Li, X.; Xiang, D. Optical and SAR Image Registration Based on Pseudo-SAR Image Generation Strategy. Remote Sens. 2023, 15, 3528. [Google Scholar] [CrossRef]
Wang, Z.; Ma, Y.; Zhang, Y. Review of pixel-level remote sensing image fusion based on deep learning. Inf. Fusion 2023, 90, 36–58. [Google Scholar] [CrossRef]
Song, S.; Jin, K.; Zuo, B.; Yang, J. A novel change detection method combined with registration for SAR images. Remote Sens. Lett. 2019, 10, 669–678. [Google Scholar] [CrossRef]
Liu, Y.; Liu, Y.; Wang, J. RFM-GAN: Robust feature matching with GAN-based neighborhood representation for agricultural remote sensing image registration. IEEE Geosci. Remote Sens. Lett. 2023, 20, 1–5. [Google Scholar] [CrossRef]
Zhu, B.; Zhou, L.; Pu, S.; Fan, J.; Ye, Y. Advances and challenges in multimodal remote sensing image registration. IEEE J. Miniaturization Air Space Syst. 2023, 4, 165–174. [Google Scholar] [CrossRef]
Feng, R.; Shen, H.; Bai, J.; Li, X. Advances and opportunities in remote sensing image geometric registration: A systematic review of state-of-the-art approaches and future research directions. IEEE Geosci. Remote Sens. Mag. 2021, 9, 120–142. [Google Scholar] [CrossRef]
Maes, F.; Collignon, A.; Vandermeulen, D.; Marchal, G.; Suetens, P. Multimodality image registration by maximization of mutual information. IEEE Trans. Med Imaging 1997, 16, 187–198. [Google Scholar] [CrossRef]
Sarvaiya, J.N.; Patnaik, S.; Bombaywala, S. Image registration by template matching using normalized cross-correlation. In Proceedings of the 2009 International Conference on Advances in Computing, Control, and Telecommunication Technologies, Bangalore, India, 28–29 December 2009; IEEE: Piscataway, NJ, USA, 2009; pp. 819–822. [Google Scholar]
Lowe, D.G. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
Rublee, E.; Rabaud, V.; Konolige, K.; Bradski, G. ORB: An efficient alternative to SIFT or SURF. In Proceedings of the 2011 International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; IEEE: Piscataway, NJ, USA, 2011; pp. 2564–2571. [Google Scholar]
Bay, H. Surf: Speeded up robust features. In Computer Vision—ECCV; Springer: Berlin/Heidelberg, Germany, 2006. [Google Scholar]
Li, X.; Xu, F.; Liu, F.; Tong, Y.; Lyu, X.; Zhou, J. Semantic segmentation of remote sensing images by interactive representation refinement and geometric prior-guided inference. IEEE Trans. Geosci. Remote Sens. 2023, 62, 5400318. [Google Scholar] [CrossRef]
Li, J.; Bi, G.; Wang, X.; Nie, T.; Huang, L. Radiation-Variation Insensitive Coarse-to-Fine Image Registration for Infrared and Visible Remote Sensing Based on Zero-Shot Learning. Remote Sens. 2024, 16, 214. [Google Scholar] [CrossRef]
Shi, L.; Zhao, R.; Pan, B.; Zou, Z.; Shi, Z. Unsupervised multimodal remote sensing image registration via domain adaptation. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5626211. [Google Scholar] [CrossRef]
Li, X.; Xu, F.; Liu, F.; Lyu, X.; Tong, Y.; Xu, Z.; Zhou, J. A synergistical attention model for semantic segmentation of remote sensing images. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5400916. [Google Scholar] [CrossRef]
Wang, W.; Mu, K.; Liu, H. A Multi-Hierarchy Flow Field Prediction Network for multimodal remote sensing image registration. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 5232–5243. [Google Scholar] [CrossRef]
Hughes, L.H.; Marcos, D.; Lobry, S.; Tuia, D.; Schmitt, M. A deep learning framework for matching of SAR and optical imagery. ISPRS J. Photogramm. Remote Sens. 2020, 169, 166–179. [Google Scholar] [CrossRef]
Zhou, L.; Ye, Y.; Tang, T.; Nan, K.; Qin, Y. Robust matching for SAR and optical images using multiscale convolutional gradient features. IEEE Geosci. Remote Sens. Lett. 2021, 19, 4017605. [Google Scholar] [CrossRef]
Li, L.; Han, L.; Ding, M.; Liu, Z.; Cao, H. Remote sensing image registration based on deep learning regression model. IEEE Geosci. Remote Sens. Lett. 2020, 19, 8002905. [Google Scholar] [CrossRef]
Ye, Y.; Tang, T.; Zhu, B.; Yang, C.; Li, B.; Hao, S. A multiscale framework with unsupervised learning for remote sensing image registration. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5622215. [Google Scholar] [CrossRef]
Jaderberg, M.; Simonyan, K.; Zisserman, A.; Kavukcuoglu, K. Spatial transformer networks. Adv. Neural Inf. Process. Syst. 2015, 28, 1–9. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar]
Shadaydeh, M.; Sziranyi, T. An improved mutual information similarity measure for registration of multi-modal remote sensing images. In Proceedings of the Image and Signal Processing for Remote Sensing XXI, Toulouse, France, 21–23 September 2015; SPIE: Bellingham, WA, USA, 2015; Volume 9643, pp. 146–152. [Google Scholar]
Sedaghat, A.; Mokhtarzade, M.; Ebadi, H. Uniform robust scale-invariant feature matching for optical remote sensing images. IEEE Trans. Geosci. Remote Sens. 2011, 49, 4516–4527. [Google Scholar] [CrossRef]
Dellinger, F.; Delon, J.; Gousseau, Y.; Michel, J.; Tupin, F. SAR-SIFT: A SIFT-like algorithm for SAR images. IEEE Trans. Geosci. Remote Sens. 2014, 53, 453–466. [Google Scholar] [CrossRef]
Ye, Y.; Shen, L. Hopc: A novel similarity metric based on geometric structural properties for multi-modal remote sensing image matching. ISPRS Ann. Photogramm. Remote Sens. Spat. Inf. Sci. 2016, 3, 9–16. [Google Scholar]
Gao, C.; Li, W.; Tao, R.; Du, Q. MS-HLMO: Multiscale histogram of local main orientation for remote sensing image registration. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5626714. [Google Scholar] [CrossRef]
Li, J.; Hu, Q.; Ai, M. RIFT: Multi-modal image matching based on radiation-variation insensitive feature transform. IEEE Trans. Image Process. 2019, 29, 3296–3310. [Google Scholar] [CrossRef] [PubMed]
Wang, S.; Quan, D.; Liang, X.; Ning, M.; Guo, Y.; Jiao, L. A deep learning framework for remote sensing image registration. ISPRS J. Photogramm. Remote Sens. 2018, 145, 148–164. [Google Scholar] [CrossRef]
Xiao, Y.; Zhang, C.; Chen, Y.; Jiang, B.; Tang, J. ADRNet: Affine and Deformable Registration Networks for Multimodal Remote Sensing Images. IEEE Transactions on Geoscience and Remote Sensing 2024. [Google Scholar] [CrossRef]
Isola, P.; Zhu, J.Y.; Zhou, T.; Efros, A.A. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1125–1134. [Google Scholar]
Zhu, J.Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2223–2232. [Google Scholar]
Du, W.L.; Zhou, Y.; Zhu, H.; Zhao, J.; Shao, Z.; Tian, X. A Semi-Supervised Image-to-Image Translation Framework for SAR–Optical Image Matching. IEEE Geosci. Remote Sens. Lett. 2022, 19, 4516305. [Google Scholar] [CrossRef]
Wang, D.; Liu, J.; Fan, X.; Liu, R. Unsupervised Misaligned Infrared and Visible Image Fusion via Cross-Modality Image Generation and Registration. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22, Vienna, Austria, 23–29 July 2022; pp. 3508–3515. [Google Scholar]
Zheng, K.; Gao, L.; Hong, D.; Zhang, B.; Chanussot, J. NonRegSRNet: A nonrigid registration hyperspectral super-resolution network. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5520216. [Google Scholar] [CrossRef]
Tang, L.; Deng, Y.; Ma, Y.; Huang, J.; Ma, J. SuperFusion: A versatile image registration and fusion network with semantic awareness. IEEE/CAA J. Autom. Sin. 2022, 9, 2121–2137. [Google Scholar] [CrossRef]
Zhou, R.; Quan, D.; Wang, S.; Lv, C.; Cao, X.; Chanussot, J.; Li, Y.; Jiao, L. A unified deep learning network for remote sensing image registration and change detection. IEEE Trans. Geosci. Remote Sens. 2023, 62, 5101216. [Google Scholar] [CrossRef]
Wang, W.; He, J.; Liu, H. EMOST: A dual-branch hybrid network for medical image fusion via efficient model module and sparse transformer. Comput. Biol. Med. 2024, 179, 108771. [Google Scholar] [CrossRef] [PubMed]
Wang, X.; Wang, X.; Song, R.; Zhao, X.; Zhao, K. MCT-Net: Multi-hierarchical cross transformer for hyperspectral and multispectral image fusion. Knowl.-Based Syst. 2023, 264, 110362. [Google Scholar] [CrossRef]
Ni, Z.; Chen, X.; Zhai, Y.; Tang, Y.; Wang, Y. Context-Guided Spatial Feature Reconstruction for Efficient Semantic Segmentation. arXiv 2024, arXiv:2405.06228. [Google Scholar]
Zhang, J.; Li, X.; Li, J.; Liu, L.; Xue, Z.; Zhang, B.; Jiang, Z.; Huang, T.; Wang, Y.; Wang, C. Rethinking mobile block for efficient attention-based models. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–3 October 2023; IEEE Computer Society: Washington, DC, USA, 2023; pp. 1389–1400. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 27 2016; pp. 770–778. [Google Scholar]
Cao, S.Y.; Hu, J.; Sheng, Z.; Shen, H.L. Iterative deep homography estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 1879–1888. [Google Scholar]
Li, L.; Han, L.; Ding, M.; Cao, H. Multimodal image fusion framework for end-to-end remote sensing image registration. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5607214. [Google Scholar] [CrossRef]
Meng, X.; Xiong, Y.; Shao, F.; Shen, H.; Sun, W.; Yang, G.; Yuan, Q.; Fu, R.; Zhang, H. A large-scale benchmark data set for evaluating pansharpening performance: Overview and implementation. IEEE Geosci. Remote Sens. Mag. 2020, 9, 18–52. [Google Scholar] [CrossRef]
Mahapatra, D.; Ge, Z.; Sedai, S.; Chakravorty, R. Joint registration and segmentation of xray images using generative adversarial networks. In Proceedings of the Machine Learning in Medical Imaging: 9th International Workshop, MLMI 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, 16 September 2018; Proceedings 9. Springer: Berlin/Heidelberg, Germany, 2018; pp. 73–80. [Google Scholar]
Cao, X.; Yang, J.; Wang, L.; Xue, Z.; Wang, Q.; Shen, D. Deep learning based inter-modality image registration supervised by intra-modality similarity. In Proceedings of the Machine Learning in Medical Imaging: 9th International Workshop, MLMI 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, 16 September 2018; Proceedings 9. Springer: Berlin/Heidelberg, Germany, 2018; pp. 55–63. [Google Scholar]
DeTone, D.; Malisiewicz, T.; Rabinovich, A. Deep image homography estimation. arXiv 2016, arXiv:1606.03798. [Google Scholar]
Meng, L.; Zhou, J.; Liu, S.; Wang, Z.; Zhang, X.; Ding, L.; Shen, L.; Wang, S. A robust registration method for UAV thermal infrared and visible images taken by dual-cameras. ISPRS J. Photogramm. Remote Sens. 2022, 192, 189–214. [Google Scholar] [CrossRef]
Chen, J.; Frey, E.C.; He, Y.; Segars, W.P.; Li, Y.; Du, Y. Transmorph: Transformer for unsupervised medical image registration. Med Image Anal. 2022, 82, 102615. [Google Scholar] [CrossRef]

Figure 1. The overall process of the registration. F and M represent fixed image and moving image respectively.

Figure 2. The overall architecture of the network.

Figure 3. The internal structure of Swin Transformer network.

Figure 4. The overall architecture of the EMO module.

Figure 5. Registration process diagram using STN network.

Figure 6. Comparison of qualitative results under large-scale rigid transformations.

Figure 7. The registration results of our method under different rigid deformation conditions on panchromatic and multispectral dataset.

Figure 8. The registration results of our method under different rigid deformation conditions on IR and OPT dataset.

Figure 9. The registration results of our method under different rigid deformation conditions on SAR and OPT dataset.

Figure 10. The comparison results between the proposed method and other registration methods for slight rigid deformation registration.

Table 1. Comparison of quantitative indicators of PAN-MS registration results. The best quantitative indicators are in bold.

Method	RE↓	MI↑	NCC↑	${RMSE}_{4 cor}$ ↓
SIFT	11.5427	0.5943	0.8476	15.6192
RIFT	49.9124	0.5056	0.4663	37.9398
TWMM	45.0727	0.5265	0.5081	36.8204
TransMorph	45.3599	0.5465	0.5058	-
SuperFusion	43.1178	0.4377	0.5243	-
$A D R N e t_{a f f}$	6.7682	0.6726	0.9601	0.5172
ADRNet	5.3717	0.6800	0.9613	-
$Ours$	6.4662	0.6819	0.9695	0.4300

Table 2. Comparison of quantitative indicators of IR-OPT registration results. The best quantitative indicators are in bold.

Method	RE↓	MI↑	NCC↑	${RMSE}_{4 cor}$ ↓
SIFT	28.9698	0.3355	0.4048	98.2099
RIFT	27.4187	0.5182	0.5124	87.4960
TWMM	24.8630	0.5363	0.5378	82.5501
TransMorph	24.3016	0.4924	0.5005	-
SuperFusion	23.3576	0.5478	0.6381	-
$A D R N e t_{a f f}$	4.6481	0.6742	0.9406	0.5581
ADRNet	3.6888	0.7260	0.9598	-
$Ours$	3.5301	0.7305	0.9623	0.2586

Table 3. Comparison of quantitative indicators of SAR-OPT registration results. The best quantitative indicators are in bold.

Method	RE↓	MI↑	NCC↑	${RMSE}_{4 cor}$ ↓
SIFT	36.1812	0.2885	0.6193	356.0067
RIFT	35.5640	0.4421	0.6591	206.2428
TWMM	34.8395	0.4640	0.7273	90.1855
TransMorph	35.6048	0.4038	0.6891	-
SuperFusion	39.6567	0.3756	0.6758	-
$A D R N e t_{a f f}$	9.5703	0.4503	0.8406	1.2129
ADRNet	8.9019	0.4814	0.9541	-
Ours	8.6391	0.4842	0.9550	0.9397

Table 4. Comparison of qualitative indicators with advanced methods.

Method	SIFT	RIFT	TWMM	TransMorph	SuperFusion	${ADRNet}_{aff}$	ADRNet	Ours
Params (M)	-	-	-	31.10	1.96	22.77	75.51	22.15
Flops (G)	-	-	-	11.12	7.32	26.90	245.33	33.23
Test time (s)	0.0421	0.175	15.1	0.023	0.160	0.203	0.313	0.227

Table 5. Results of ablation experiment. RE (SAR/OPT) represents the RE of moving images that are either SAR or OPT. The best quantitative indicators are in bold.

Method	RE (SAR/OPT) ↓	${RMSE}_{4 cor}$ ↓	Flops (G)
w/o $Swin$	10.8353/14.6769	1.5701	30.0412
w/o $EMO$	10.6707/11.7660	1.5947	32.4949
w/o $SBM$	9.9812/8.7278	1.4563	32.7049
w/o $MSFE$	9.5335/8.6891	1.2668	32.9430
w/o $PCB$	9.2471/9.0489	1.1936	30.1435
Ours	8.6391/8.4939	0.9397	33.2373

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Mu, K.; Wang, W.; Liu, H.; Liang, L.; Zhang, S. A Dual-Branch Network of Strip Convolution and Swin Transformer for Multimodal Remote Sensing Image Registration. Remote Sens. 2025, 17, 1071. https://doi.org/10.3390/rs17061071

AMA Style

Mu K, Wang W, Liu H, Liang L, Zhang S. A Dual-Branch Network of Strip Convolution and Swin Transformer for Multimodal Remote Sensing Image Registration. Remote Sensing. 2025; 17(6):1071. https://doi.org/10.3390/rs17061071

Chicago/Turabian Style

Mu, Kunpeng, Wenqing Wang, Han Liu, Lili Liang, and Shuang Zhang. 2025. "A Dual-Branch Network of Strip Convolution and Swin Transformer for Multimodal Remote Sensing Image Registration" Remote Sensing 17, no. 6: 1071. https://doi.org/10.3390/rs17061071

APA Style

Mu, K., Wang, W., Liu, H., Liang, L., & Zhang, S. (2025). A Dual-Branch Network of Strip Convolution and Swin Transformer for Multimodal Remote Sensing Image Registration. Remote Sensing, 17(6), 1071. https://doi.org/10.3390/rs17061071

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Dual-Branch Network of Strip Convolution and Swin Transformer for Multimodal Remote Sensing Image Registration

Abstract

1. Introduction

2. Related Works

2.1. Based on Traditional Registration Method

2.2. Based on Learning Registration Method

3. Method

3.1. The Overview of Network Framework

3.2. Feature Extraction Module

3.2.1. The Upper Branch Feature Extraction

3.2.2. The Lower Branch Feature Extraction

3.3. Affine Parameter Regression Module

3.4. Spatial Transformer Network (STN)

3.5. Loss Function

4. Experiments

4.1. Dataset

4.1.1. PAN-MS

4.1.2. IR-OPT

4.1.3. SAR-OPT

4.2. Implementation Details

4.3. Evaluation Metrics

4.4. Experimental Results and Analysis

4.4.1. Qualitative Comparisons

4.4.2. Quantitative Comparisons

4.4.3. Further Result Analysis

4.5. Ablation Study

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI