An Optical–SAR Remote Sensing Image Automatic Registration Model Based on Multi-Constraint Optimization

Zhang, Yaqi; Chen, Shengbo; Xu, Xitong; Yang, Jiaqi; Suo, Yuqiao; Zhu, Jinchen; Wu, Menghan; Zhang, Aonan; Li, Qiqi

doi:10.3390/rs18020333

Open AccessArticle

An Optical–SAR Remote Sensing Image Automatic Registration Model Based on Multi-Constraint Optimization

by

Yaqi Zhang

,

Shengbo Chen

^*,

Xitong Xu

,

Jiaqi Yang

,

Yuqiao Suo

,

Jinchen Zhu

,

Menghan Wu

,

Aonan Zhang

and

Qiqi Li

College of Geo-Exploration Science and Technology, Jilin University, Changchun 130026, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2026, 18(2), 333; https://doi.org/10.3390/rs18020333

Submission received: 11 December 2025 / Revised: 15 January 2026 / Accepted: 17 January 2026 / Published: 19 January 2026

(This article belongs to the Section Remote Sensing Image Processing)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

We propose OSR-Net, an end-to-end optical–SAR registration framework that integrates multi-modal feature extraction, multi-scale channel attention, multi-scale affine transformation prediction, and an improved spatial transformer, achieving robust and accurate cross-modal alignment.
The multi-constraint joint optimization strategy with dynamic weighting enhances the consistency of global geometric estimation and local structure, generating state-of-the-art performance on the MultiResSAR dataset (RMSE < 1.25 px for different feature types).

What are the implications of the main findings?

This method provides a reliable geometric basis for multi-source remote sensing tasks and realizes more accurate optical–SAR fusion, change detection, and environmental monitoring.
The modular design and robustness of OSR-Net provide generalizable insights for developing advanced registration models in a wider range of multi-modal and multi-sensor remote sensing applications.

Abstract

Accurate registration of optical and synthetic aperture radar (SAR) images is a fundamental prerequisite for multi-source remote sensing data fusion and analysis. However, due to the substantial differences in imaging mechanisms, optical–SAR image pairs often exhibit significant radiometric discrepancies and spatially varying geometric inconsistencies, which severely limit the robustness of traditional feature or region-based registration methods in cross-modal scenarios. To address these challenges, this paper proposes an end-to-end Optical–SAR Registration Network (OSR-Net) based on multi-constraint joint optimization. The proposed framework explicitly decouples cross-modal feature alignment and geometric correction, enabling robust registration under large appearance variation. Specifically, a multi-modal feature extraction module constructs a shared high-level representation, while a multi-scale channel attention mechanism adaptively enhances cross-modal feature consistency. A multi-scale affine transformation prediction module provides a coarse-to-fine geometric initialization, which stabilizes parameter estimation under complex imaging conditions. Furthermore, an improved spatial transformer network is introduced to perform structure-preserving geometric refinement, mitigating spatial distortion induced by modality discrepancies. In addition, a multi-constraint loss formulation is designed to jointly enforce geometric accuracy, structural consistency, and physical plausibility. By employing a dynamic weighting strategy, the optimization process progressively shifts from global alignment to local structural refinement, effectively preventing degenerate solutions and improving robustness. Extensive experiments on public optical–SAR datasets demonstrate that the proposed method achieves accurate and stable registration across diverse scenes, providing a reliable geometric foundation for subsequent multi-source remote sensing data fusion.

Keywords:

remote sensing image registration; deep learning; multi-constraint optimization; affine transformation

1. Introduction

With the rapid advancement of spaceborne observation technologies, the availability of multi-source Earth observation data has grown substantially [1]. Among these, optical and synthetic aperture radar (SAR) imagery are two of the most complementary data sources, providing spectral and structural perspectives of the Earth’s surface. Optical images offer high spatial resolution and rich spectral details, while SAR systems operate independently of illumination and weather, enabling continuous all-weather observation [2]. The complementary characteristics make optical–SAR fusion the basis of applications such as change detection, land cover classification, and disaster monitoring [3,4]. Image registration is the foundation of this multi-modal fusion. The purpose of registration is to geometrically align images obtained from different sensors or at different times [5].

Feature-based registration methods achieve geometric alignment by extracting structural or shape features (such as corners, edges, line segments, etc.) from the image and establishing the spatial correspondence between them [6,7]. In areas where the structure of ground objects is clear and the texture is distinct (such as urban buildings, roads, or coastlines), optical and SAR images often have a certain structural correspondence. Therefore, feature-based methods can achieve registration quite well. However, speckle noise, structural blurring, and local geometric distortion are widespread in SAR images, which makes the extraction and matching of feature points prone to errors. Region-based registration methods achieve registration by comparing the similarities of image gray levels or statistical features. Common indicators include Mutual Information (MI), normalized cross-correlation (NCC), Phase Correlation, etc. [8,9]. In regions with relatively similar radiation distributions or stable local contrast (such as areas where the terrain contours in SAR images are consistent with those in optical images), regional similarity measurement can achieve certain effects. This type of method does not rely on explicit feature extraction and thus still has certain applicability in scenarios with sparse features. However, there is a strong nonlinear correspondence between optical and SAR images in terms of gray level and texture distribution. The light and dark distribution in optical images often do not have a direct correlation with SAR scattering intensity, making similarity measures based on gray level or intensity ineffective. Furthermore, the speckle noise of SAR will further reduce the stability of similarity calculation, causing the regional registration results to easily fall into a local optimum.

In recent years, the rapid development of deep learning has provided new research ideas for multi-modal image registration [10]. Neural networks can automatically learn hierarchical semantic and structural representations, which greatly enhance the recognition and matching of cross-modal features. In optical–SAR registration, the learning-based method has gradually replaced the traditional feature extraction [11], feature matching [12,13,14], and geometric transformation estimation [15,16] or is designed as an end-to-end training registration framework. Among them, affine transformation modeling has been widely adopted to represent the geometric relationship between image pairs [17]. This formula compactly describes global transformations, such as rotation, translation, scaling, and clipping, providing interpretability and reducing computational complexity [18]. In addition, affine constraints can be used as explicit geometric priors to guide network learning, which is helpful for more accurate and robust registration in complex cross-modal scenes. Therefore, the depth learning method based on affine modeling has become a promising direction of optical SAR image registration research.

Although the registration method based on deep learning has made progress, there are still some challenges. Most of the existing deep learning methods use a single-task architecture [19,20], emphasizing the global geometric estimation while ignoring the local structure consistency, which often leads to sub-pixel misalignment. Many methods still rely on a single objective function or the single optimization of feature extraction, transformation estimation, and similarity measurement [21]. Due to the lack of a unified optimization framework, it is difficult to balance geometric accuracy, structural similarity, and radiation coherence. In order to solve these problems, recent studies have explored multi-task learning [22,23] and multi-constraint [24] learning strategies to jointly optimize the geometry, structure, and radiation targets in end-to-end systems [25,26,27]. These methods improve the robustness of cross-scale and cross-modal registration and form a more reliable basis for cross-modal registration.

Based on the above, this work proposes an Optical–SAR Registration Network (OSR-Net). This method constructs an end-to-end automatic registration framework, using a dataset containing affine variation information as the training basis. The model jointly optimizes the network by introducing affine parameter supervision, geometric consistency constraints, structural similarity constraints, and a multi-constraint loss function for reference preservation, in order to simultaneously consider geometric transformation accuracy, structural consistency, and radiation stability. Through this design, this paper aims to achieve automatic and precise registration of optical and SAR images in complex cross-modal scenarios and provide a unified geometric basis for the subsequent fusion of multi-source remote sensing data.

2. Methods

This study proposes an automatic optical–SAR registration method based on deep learning, which is called the Optical–SAR Registration Network (OSR-Net). The framework integrates the functions of feature extraction, feature fusion, the attention module, and spatial transformation into a unified and trainable architecture. OSR-Net takes an optical image and two SAR images (the original and an offset image) as inputs. It captures geometric and radiation features through a multi-scale channel attention mechanism, predicts the corresponding affine transformation parameters, and uses a spatial transformation module for accurate registration.

This section provides a detailed description of six main components:

(1): The overall architecture of the Optical–SAR Registration Network;
(2): The multi-modal feature extraction module (MFE);
(3): The multi-scale channel attention module (MSCA);
(4): The multi-scale affine transformation prediction module (MATP);
(5): The improved spatial transformer network (ISTN);
(6): The loss function design.

2.1. Overall Architecture of the Optical–SAR Registration Network (OSR-Net)

The proposed OSR-Net is an end-to-end optical–SAR registration framework composed of four major components: multi-modal feature extraction, multi-scale channel attention, multi-scale affine transformation prediction, and an improved spatial transformer network.

The overall architecture is illustrated in Figure 1. First, the input optical image and the two SAR images (original and offset versions) are processed by a dual-branch feature extraction module to generate modality-specific feature representations. Next, the extracted optical and SAR features are spatially aligned and concatenated along the channel dimension. The network then performs cross-modal feature compression and aggregation using a convolutional fusion layer, integrating complementary spatial and radiometric information.

To further enhance feature discriminability and suppress redundant channel responses, the multi-scale channel attention (MSCA) mechanism is introduced. This module adaptively emphasizes informative features at multiple scales, thereby improving the robustness and registration sensitivity of the fused representation.

The fused features are then globally averaged and passed through a multi-layer fully connected network to regress the parameters of the affine transformation matrix. Finally, the predicted affine matrix is input to the spatial transformer (ST) module, which performs geometric correction and resampling on the offset SAR image to produce the registered SAR output.

In the training phase, the input image pairs are generated from the registration dataset through random affine transformations. This network optimization is supervised from two complementary perspectives: the first uses the random affine transformation parameters as regression labels; the second measures the similarity between the optical image and SAR image after registration to jointly improve registration accuracy and stability.

2.2. Multi-Modal Feature Extraction Module (MFE)

In the optical–SAR image registration, the input images come from different sensors. The optical sensors capture spectral and texture information, while the SAR system records the structure and scattering characteristics of the observation scene. Due to these inherent differences in imaging mechanisms, direct cross-modal registration in the pixel domain is highly sensitive to radiometric changes and noise, often leading to degraded registration accuracy.

In order to alleviate these problems, this study uses a multi-modal feature extraction (MFE) module based on the ResNet-18 backbone [28]. The module is designed to project the optical and SAR images into a unified high-level semantic feature space so that the discriminant and robust feature representation suitable for cross-modal matching can be extracted.

The MFE module adopts a modality-specific input design to account for the distinct imaging mechanisms of optical and SAR data. For the optical branch, standard three-channel RGB images are used. It should be emphasized that the RGB channels are not exploited for color semantics, but rather serve as multiple complementary structural observations of the same scene. Differences in illumination response, material reflectance, and local contrast across RGB channels provide diverse structural cues, which are beneficial for learning geometry-related features that are more robust to cross-modal discrepancies.

For the SAR branch, which contains only a single intensity channel, the ResNet-18 backbone modifies the structure of the first convolution layer by changing its input channel number from three to one while keeping the kernel size, stride, and subsequent residual structures unchanged. Compared with the strategy of simply duplicating the SAR channel three times, this design avoids introducing redundant information and spurious inter-channel correlations, and it is more consistent with the physical characteristics of SAR backscattering.

At the same time, the two branches preserve structural consistency in the network architecture, ensuring feature comparability across modalities and providing a solid foundation for subsequent cross-modal alignment and fusion. From a methodological perspective, the MFE module is not inherently restricted to RGB inputs; multispectral optical imagery can be incorporated by treating each spectral band as an independent structural observation and adapting the input channel configuration of the first convolution layer accordingly. The detailed architecture of the MFE module is illustrated in Figure 2.

ResNet-18 is chosen as the backbone of the multi-modal feature extraction (MFE) module. The main reason is that its residual learning architecture effectively alleviates the common problems of gradient disappearance and network degradation in the deep convolution network.

Each residual block is defined as:

y = F (x, W_{i}) + x

(1)

where x denotes the input feature, F(x,{

W_{i}

}) represents a nonlinear mapping composed of convolution, batch normalization, and activation functions, and y is the block output.

This structure enables the gradient to propagate directly to the shallow layer through identity mapping, so as to ensure that the model has good convergence while increasing depth.

The feature extraction module inputs the image I∈R^(C×H×W), where C is the number of channels and H and W, respectively, represent the height and width of the image. First, the input is through a convolution layer with a large receptive field (7 × 7, step size of 2) and the maximum pool operation to achieve preliminary spatial compression and low-level feature extraction. Then, the feature passes through four stages of residual network structure (layer1–layer4). In each stage, the network performs nonlinear mapping through basic residual blocks. At the same time, the spatial downsampling is realized through the convolution operation with a step size of 2 between stages so that the feature resolution is halved layer by layer and the number of channels is gradually expanded (from 64 to 512). This hierarchical design enables the network to encode structural information at multiple semantic levels. Therefore, the input image is mapped into a high-dimensional feature tensor, where the increase in the channel dimension corresponds to a series of rich learned feature responses, which collectively represent geometric structures, contextual relationships, and semantic patterns. This high-dimensional representation enhances the distinguishability and robustness of the features, which is crucial for reliable cross-modal matching and geometric alignment in optical synthetic aperture radar image registration.

The output feature tensor is formulated as:

F = f_{R e s N e t 18} (I) \in R^{C^{'} \times H^{'} \times W^{'}}

(2)

where f_ResNet18 represents the nonlinear mapping function and

C^{'}

,

H^{'},

and

W^{'}

represent the output channel number and spatial dimensions, respectively.

Therefore, the feature extraction module based on ResNet-18 not only effectively adapts to the modal differences between optical and SAR data but also maintains discriminability and robustness in multi-level semantic representation, providing a solid foundation for subsequent cross-modal feature fusion and geometric transformation prediction.

2.3. Multi-Scale Channel Attention Module (MSCA)

In the cross-modal image registration task of optical SAR, the effective fusion of optical and SAR features is the key to improving the robustness of registration. Due to the fundamental difference in the imaging mechanism between the two types of data, there is a modal difference in their feature distribution. If only a simple weighting or splicing operation is used to realize feature fusion, it is very easy to cause the discriminant features to be submerged or the noise information to be amplified. In recent years, attention mechanisms have been widely used in computer vision tasks. Typical representatives include the Squeeze-and-Excitation (SE) module [29] and the Convolutional Block Attention Module (CBAM) [30]. The core idea is to enhance the effectiveness and robustness of feature representation by adaptively allocating weights, highlighting the discriminant features related to tasks, and suppressing redundant or noisy information. However, there are some problems in optical SAR registration, such as insufficient expression ability and unstable learning process.

In optical–SAR cross-modal registration, the two modalities exhibit fundamentally different feature distributions due to their distinct imaging mechanisms. Optical images primarily encode surface reflectance and texture patterns governed by illumination conditions and material properties, resulting in relatively smooth intensity variations and strong spectral correlations. In contrast, SAR images represent microwave backscattering responses that are highly sensitive to surface roughness, geometric structures, and dielectric properties, often accompanied by speckle noise and nonlinear intensity distributions.

These inherent differences lead to a mismatch in feature statistics across modalities. The proposed multi-scale channel attention (MSCA) module is designed to mitigate this issue by adaptively reweighting channel-wise feature responses, suppressing modality-specific noise, and emphasizing structurally consistent features suitable for cross-modal geometric alignment.

Based on this problem, this study introduces a multi-scale channel attention (MSCA) module. By explicitly modeling the channel dependencies and multi-scale context information, the adaptive alignment and enhancement of cross-modal features are realized. The overall structure is shown in Figure 3.

Let

F^{o p t}

,

F^{s a r} \in R^{C \times H \times W}

denote the feature tensors extracted from the optical and SAR branches, where

C

,

H,

and

W

represent the channel number, height, and width, respectively. In order to eliminate the channel distribution difference between modes, 1 × 1 convolution is introduced as a linear projection operator to map the two channel features into a unified channel space:

{\hat{F}}^{o p t} = W_{o p t} * F^{o p t}; {\hat{F}}^{s a r} = W_{s a r} * F^{s a r}

(3)

where

W_{o p t}, W_{s a r} \in R^{C^{'} \times C \times 1 \times 1}, C^{'}

denotes the projected channel dimension. Then, they are spliced in the channel dimension to obtain fusion features:

F_{c a t} = C o n c a t ({\hat{F}}^{o p t}, {\hat{F}}^{s a r}) \in R^{2 C^{'} \times H \times W}

(4)

To capture the local and global dependencies in the cross-modal features, the MSCA module uses a multi-scale convolution structure to encode different receptive fields for the stitched features:

F^{m s} = \sum_{k \in {1,3, 5}} ϕ_{k} (F_{c a t})

(5)

where

ϕ_{k} (\cdot)

represents a convolution operation with a kernel size of k × k, followed by batch normalization and ReLU activation.

The design can preserve local texture details and global structure patterns at the same time, which is more in line with the diverse feature distribution of optical and SAR data. After obtaining the multi-scale representation

F^{m s}

, the attention weight is generated by channel compression and activation function:

A = σ (W_{2} \cdot δ (W_{1} \cdot G A P (F^{m s}))), A \in R^{2 C^{'}}

(6)

where

G A P (\cdot)

denotes global average pooling,

W_{1}

and

W_{2}

are fully connected layers,

δ (\cdot)

is the ReLU activation, and

σ (\cdot)

is the sigmoid function.

Attention weight acts on the feature representation after preliminary alignment, weighting the optical and SAR channels, respectively:

{\tilde{F}}^{o p t} = {\hat{F}}^{o p t} ⊙ A_{o p t}, {\tilde{F}}^{s a r} = {\hat{F}}^{s a r} ⊙ A_{s a r}

(7)

where ⊙ denotes the element-wise product and

A_{o p t}

and

A_{s a r}

correspond to the partitioned attention weights for the respective modalities. The reweighted features are then concatenated and passed through a 3 × 3 convolutional fusion block comprising convolution, batch normalization, and ReLU activation:

F^{f u s i o n} = g (C o n c a t ({\tilde{F}}^{o p t}, {\tilde{F}}^{s a r}))

(8)

where

g (\cdot)

represents the composite convolution, normalization, and activation mapping.

In general, compared with the typical attention mechanism, the module can capture both global and local information through the receptive fields corresponding to different convolution kernel sizes and naturally embed spatial context information in the channel dimension. Without explicitly introducing spatial branches, the network can dynamically select more discriminative channels according to the modal differences between optical and SAR features, which helps to improve the robustness of subsequent feature fusion and geometric transformation prediction.

2.4. Multi-Scale Affine Transformation Prediction Module (MATP)

After the MFE and MSCA modules, mapping high-dimensional feature representations to accurate geometric transformation parameters becomes a key challenge in optical–SAR image registration. Traditional methods often rely on single-scale feature matching and neglect the multi-scale structural characteristics of images, which may lead to inaccurate estimation under scale variation or spatially varying geometric inconsistencies. Therefore, this study designs a multi-scale affine transformation prediction (MATP) module that progressively constructs multi-resolution feature representations in a coarse-to-fine optimization manner, improving the robustness and accuracy of geometric parameter estimation.

Specifically, the MATP builds a hierarchical prediction framework based on feature representations extracted by ResNet-18 and the multi-scale channel attention mechanism (Figure 4). The affine parameters are first regressed from lower-resolution feature maps to obtain an initial alignment estimation. At this stage, high-level semantic and structural cues dominate the representation, enabling stable parameter estimation under large geometric discrepancies. To this end, the MATP module intentionally aggregates spatial information through pooling operations. Although this strategy further reduces spatial resolution, it effectively suppresses noise and modality-specific artifacts while preserving dominant geometric trends, which is more critical for reliable affine parameter regression. Subsequently, the affine parameter estimation is progressively refined using higher-resolution feature representations. This refinement process focuses on reducing residual misalignment by leveraging richer structural details encoded in the feature space, thereby improving overall registration accuracy without explicitly modeling nonrigid deformation.

The affine transformation is parameterized as a six-degree-of-freedom (6-DoF) model:

A = [\begin{matrix} a_{11} & a_{12} & t_{x} \\ a_{21} & a_{22} & t_{y} \end{matrix}]

(9)

where

(a_{11}, a_{12}, a_{21}, a_{22})

represent rotation, scaling, and shearing parameters, while

(t_{x}, t_{y})

denote translation. A fully connected layer (FC) aggregates the feature tensor globally and regresses these six affine parameters. This design emphasizes stable and interpretable geometric estimation under cross-modal conditions while maintaining robustness to SAR speckle noise and optical illumination variation.

In addition, the MATP module introduces a progressive supervision strategy by imposing loss constraints on affine predictions at different feature resolutions. This strategy guides the network toward a stable geometric configuration at early training stages and gradually improves estimation accuracy during subsequent optimization, reducing the risk of poor convergence. Through this progressive learning process, the network incrementally approximates the underlying geometric relationship in the feature space, facilitating robust cross-modal registration.

In summary, the proposed MATP module adopts a progressive refinement strategy for affine parameter estimation, leveraging multi-scale feature representations and hierarchical supervision. This design enhances the stability and accuracy of geometric estimation under cross-modal conditions, providing a reliable initialization for subsequent spatial transformation and high-precision registration.

2.5. Improved Spatial Transformer Network (ISTN)

In optical–SAR cross-modal registration tasks, relying solely on feature-level similarity measurement and affine parameter prediction is often insufficient to achieve accurate final alignment. This limitation arises from inherent differences in imaging geometry and sensing mechanisms between optical and SAR modalities, which introduce spatially varying and continuous geometric inconsistencies related to viewing angle, terrain effects, and scattering characteristics. To further improve geometric alignment, an improved spatial transformer network (ISTN) is constructed on top of the multi-scale affine prediction to enable end-to-end learnable spatial refinement [31]. It should be emphasized that the affine transformation regressed by the MATP module serves as a stable and physically interpretable initialization rather than a complete deformation model. The ISTN module operates on feature maps after affine initialization and progressively reduces residual misalignment through structure-preserving sampling and multi-scale feature interaction, without explicitly modeling nonrigid deformation parameters.

The traditional STN module is mainly composed of three parts: localization network, grid generator, and sampler. The localization network is responsible for regressing the geometric transformation parameters, the grid generator generates the sampling grid in the target coordinate system according to the parameters, and the sampler uses bilinear interpolation to map the input image to the target space. However, the standard structure has two deficiencies in the cross-modal task: First, the parameter estimation at a single scale cannot effectively deal with the geometric differences at different scales. Second, the sampling process is not enough to preserve the structural continuity, which easily leads to texture blur or edge information loss.

To solve these limitations, an improved spatial transformer network (ISTN) is designed with two key enhancements: multi-scale progressive transformation and structure-preserving sampling. The ISTN adopts a hierarchical refinement strategy guided by feature representations at different spatial resolutions, enabling progressive reduction in residual misalignment rather than explicit region-wise deformation modeling.

As illustrated in Figure 4, a conditional branch labeled “Judge H” is introduced to adaptively select the sampling strategy based on the spatial height H of the input feature map. When H exceeds a predefined threshold (512), a downsampled sampling path is applied to improve computational efficiency and numerical stability; otherwise, standard sampling is performed.

The ISTN follows a progressive refinement process across feature resolutions. Lower-resolution feature maps are first transformed to reduce dominant misalignment, while higher-resolution features are subsequently used to further refine residual discrepancies. This progressive strategy enhances optimization stability, convergence behavior, and alignment accuracy without introducing explicit nonrigid deformation models.

The formula is as follows:

I_{o u t} (u, v) = \sum_{x, y} I_{i n} (x, y) \cdot κ (x, y; T (u, v))

(10)

where

I_{i n}

and

I_{o u t}

denote the input and output images, respectively,

T (u, v)

is the transformation predicted by the localization network, and

κ (\cdot)

denotes a spatial weighting kernel that assigns higher weights to sampling points with closer spatial proximity. This kernel encourages locally smooth and structure-preserving interpolation, reducing discontinuities compared with standard bilinear sampling.

Compared with standard bilinear interpolation, the proposed structure-preserving sampling mechanism maintains geometric coherence and structural continuity during spatial transformation, alleviating degradation caused by modality discrepancies. Through these improvements, the ISTN module enhances robustness to speckle noise, illumination variation, and viewing-angle-induced inconsistencies, providing effective spatial refinement within a progressive cross-modal registration framework.

2.6. Multi-Constraint Joint Optimization Loss Function System

In optical–SAR image registration tasks, it is difficult to fully describe many key aspects of registration quality using a single loss function. For example, although only optimizing the parameters of an affine matrix can ensure geometric accuracy, it is difficult to correct the local deformation caused by the difference in the imaging mechanism. On the contrary, if only relying on the similarity constraint at the image level, it is easy to fall into local optimization, resulting in inaccurate transformation results. Therefore, building a multi-constraint joint optimization loss system that can achieve a balance between geometric accuracy, structural consistency, and physical rationality is of great help in improving the performance and robustness of cross-modal registration.

The proposed multi-constraint loss system consists of four core components: parameter supervision, geometric consistency, cross-modal structural similarity, and reference preservation. Different from the traditional methods, this study introduces a dynamic weighting strategy based on the training process, which effectively alleviates the common loss weight sensitivity problem in multi-task learning.

2.6.1. Theoretical Background of Weight Sensitivity

In the framework of multi-task learning (MTL), different sub-tasks usually have different scales, gradient amplitudes, and convergence rates [32]. If a fixed loss weight is used, some tasks may dominate the optimization process in early training, resulting in learning imbalance or even overfitting. The traditional static weighting method is difficult to adapt to the training dynamics. Although the weighting method based on uncertainty or gradient normalization can partially alleviate this problem, it is still insufficient in the cross-modal scene, which requires high geometric accuracy and structural consistency.

Therefore, this paper adopts an adaptive dynamic weighting strategy: in the initial stage of training, the network focuses on the global geometric constraints to ensure convergence stability and achieve coarse-scale alignment. With the deepening of training, the weight gradually shifts to local structure refinement and cross-modal similarity optimization, forming an optimization mechanism from coarse to fine, so as to achieve an effective balance between accuracy and robustness.

2.6.2. Composition and Definition of the Loss Function

To ensure accurate global geometric transformation prediction, the affine supervision loss is defined as:

L_{m a t r i x} = ∥ θ_{p r e d} - θ_{g t} ∥_{2}^{2}

(11)

where

θ_{p r e d}

and

θ_{g t}

denote the predicted and ground-truth affine transformation parameters, respectively. This term provides direct supervision in the geometric parameter space to ensure the global alignment accuracy and interpretability of the prediction results.

Feature point consistency loss optical and SAR images share structural correspondences such as corners and road intersections. To enforce local geometric consistency, we define a feature-point-based loss:

L_{f p} = \frac{1}{N} \sum_{i = 1}^{N} ∥ T_{θ_{p r e d}} (p_{s a r}^{i}) - p_{o p t}^{i} ∥_{2}^{2}

(12)

where

N

is the number of corresponding feature points,

p_{s a r}^{i}

and

p_{o p t}^{i}

denote the i-th feature points in the SAR and optical images, respectively, and

T_{θ_{p r e d}} (\cdot)

represents the affine transformation parameterized by

θ_{p r e d}

. This loss compensates for the global affine loss’s insensitivity to local deformations, ensuring fine-grained geometric alignment.

To address radiometric differences between modalities, we adopt the Structural Similarity Index (SSIM) to measure cross-modal alignment:

L_{s i m} = 1 - S S I M (I_{s a r}^{w a r p}, I_{o p t}^{g r a y})

(13)

where

I_{s a r}^{w a r p}

denotes the SAR image warped according to the predicted transformation and

I_{o p t}^{g r a y}

is the grayscale optical image. The SSIM evaluates luminance, contrast, and structural similarity simultaneously, promoting robust alignment under varying illumination and noise conditions.

It has been widely recognized in the image registration literature that geometric accuracy alone is insufficient to guarantee structurally consistent and physically plausible alignment and that additional regularization and structural constraints are necessary to avoid degenerate solutions [33]. A reference consistency constraint is formulated as:

L_{r e f} = λ_{m s e} \cdot ∥ I_{s a r}^{w a r p} - I_{s a r}^{o r i g} ∥_{2}^{2} + λ_{s s i m} \cdot (1 - S S I M (I_{s a r}^{w a r p}, I_{s a r}^{o r i g}))

(14)

where

I_{s a r}^{o r i g}

is the original SAR image and

λ_{m s e}

and

λ_{s s i m}

are weighting coefficients balancing the mean squared error and SSIM terms. This component acts as a regularizer, preventing overfitting and maintaining physical plausibility in the registration results.

This reference consistency term does not enforce pixel-wise radiometric similarity across modalities. Instead, it regularizes the spatial transformation during warping by penalizing excessive displacement and abrupt deformation. During optimization, transformations that introduce large spatial stretching, compression, or irregular sampling patterns result in increased loss values, thereby discouraging geometrically unrealistic solutions. This mechanism biases the learning process toward smooth, continuous, and bounded transformations that are consistent with typical remote sensing imaging geometry.

2.6.3. Dynamic Weighting Strategy

The total loss is formulated as:

L_{t o t a l} = α L_{m a t r i x} + β L_{s i m} + γ L_{f p} + δ L_{r e f}

(15)

where dynamic weights evolve over iterations.

The strategy strengthens the geometric and physical constraints in the initial stage of training and gradually increases the proportion of structural and local feature optimization in the later stage, so as to achieve stable convergence and higher precision joint optimization. At early training stages, higher weights are assigned to the geometric alignment loss to ensure stable convergence. As training proceeds, the weights of structural consistency and local feature constraints are gradually increased according to a predefined schedule to refine local alignment.

2.6.4. Complementarity and Effectiveness

The proposed multi-constraint loss establishes a four-level complementary supervision structure encompassing global, local, structural, and physical aspects:

L_{m a t r i x}

: global geometric alignment,

L_{f p}

: local structure consistency,

L_{s i m}

: cross-modal structural matching, and

L_{r e f}

: physical realism preservation.

The joint optimization of multiple constraints significantly enhances the stability of the model in a noise and deformation environment and effectively avoids the problem of mode collapse. Experimental results show that compared with the traditional single loss or static weighting strategy, the proposed system has better performance in terms of registration accuracy and robustness.

3. Experiments

3.1. Experimental Setup

To evaluate the effectiveness and robustness of the proposed method for optical–SAR image registration, the MultiResSAR dataset [34] was used as the experimental data source. This dataset is specifically designed for optical–SAR registration and related multi-modal remote sensing tasks. It contains more than 10,000 coregistered optical–SAR image pairs collected from multiple satellite platforms and covers diverse spatial resolutions, imaging conditions, and scene types, including urban areas, rural regions, plains, mountainous terrain, and water bodies.

The MultiResSAR dataset incorporates SAR data acquired by several satellite systems, such as Sentinel-1, GF-3, HT1-A, and Umbra, with spatial resolutions ranging from 0.16 m to 10 m. To ensure the reliability of the reference alignment, the dataset construction combines automated registration procedures with manual visual inspection, providing high-quality benchmark data for cross-modal image registration.

In this study, all experiments were conducted using a low-resolution subset of the currently publicly available MultiResSAR dataset. This subset mainly includes pairs of coregistered optical SAR images with spatial resolutions ranging from meters to ten meters. Each image pair includes an optical image and its corresponding SAR image. Optical data provide rich spectral and texture information under the near-lowest point observation geometry, while SAR data capture the structural and scattering features dominated by the radar backscattering mechanism and the side-view imaging geometry. Due to the slant projection and the variation in the incident angle, SAR images may exhibit geometric distortion and scale inconsistency. The inherent differences in these imaging mechanisms pose significant challenges to the precise registration of optical–SAR images and inspired the design of the proposed OSR-Net.

3.1.1. Data Preprocessing and Augmentation

Since the experiments were limited to the publicly available low-resolution subset of MultiResSAR, the optical and SAR images within each image pair shared comparable spatial resolutions. Prior to network input, both modalities were further resampled and resized to a fixed input size of 512 × 512 pixels to ensure consistency during training and evaluation. We designed a geometric enhancement strategy for optical SAR registration and applied controllable affine perturbations to SAR images, including small-angle rotation (±5°), pixel-level translation (e.g., ±10 px along the x-axis), and isotropic scaling. This enhancement strategy effectively simulates the common geometric distortion in the real world under the condition of multi-modal imaging.

Different from the traditional enhancement method, this method introduces a bidirectional normalized coordinate transformation matrix to ensure affine consistency in the normalized coordinate space. This design can directly generate a supervision matrix for regression-based training. We chose Harris corner detection combined with non-maximum suppression (NMS) to extract the key points of structural stability from optical images. Then, these key points were geometrically mapped to the enhanced SAR image by inverse affine transformation, and a clear feature-level geometric correspondence was established. This process preserved strict geometric consistency among augmented samples and provided accurate supervision for feature point consistency loss during network training.

3.1.2. Implementation Details

The proposed OSR-Net framework was implemented in PyTorch 2.5.1 and trained on a high-performance computing platform equipped with an Intel(R) Core(TM) Ultra 7 265K CPU, 96 GB RAM, and an NVIDIA GeForce RTX 5090 GPU. The AdamW optimizer [35] was employed with an initial learning rate of 1 × 10⁻⁴ and a weight decay of 1 × 10⁻⁵, together with a cosine annealing learning rate scheduler. The batch size was set to 4, and the total number of training epochs was 150.

All input images were resized to 512 × 512, and the network was initialized using ImageNet-pretrained ResNet-18 weights to accelerate convergence. Gradient clipping was used to enhance the stability of training. The system stored all training models, registration results, and training logs to ensure the repeatability of the experiment and facilitate subsequent quantitative comparison.

3.2. Registration Performance Analysis of OSR-Net

3.2.1. Qualitative Evaluation

To qualitatively evaluate the registration performance of the proposed OSR-Net under different land cover types, visual comparisons were made in four representative areas: urban, farmland, water, and mixed zone. Through checkerboard overlay and image stitching, the registration results were visualized (Figure 5), and the spatial differences before and after alignment were visually displayed.

As shown in Figure 5, optical and SAR images exhibit noticeable misalignment in edge contours, texture patterns, and structural layouts before registration, particularly in urban regions with clear building boundaries and farmland areas characterized by regular grid textures. After applying OSR-Net, the spatial consistency between the two modalities is significantly improved: urban structures become more regular, farmland boundaries and road networks are more coherent, and transitions at land–water interfaces appear smoother. The red boxes highlight representative regions where OSR-Net notably enhances the continuity and connectivity of road structures after registration. Furthermore, checkerboard visualizations demonstrate a clear reduction in geometric discrepancies between modalities, indicating that OSR-Net effectively mitigates cross-modal displacement and achieves reliable alignment of spatial and structural features across diverse land cover scenarios.

3.2.2. Quantitative Evaluation

In quantitative evaluation, root mean square error (RMSE) and average corner error (ACE) were used as the main performance indicators to evaluate the registration accuracy and stability of different land cover types.

RMSE measures the overall prediction deviation and is defined as:

R M S E = \sqrt{\frac{1}{N} \sum_{i = 1}^{N} ({\hat{y}}_{i} - y_{i})^{2}}

(16)

where

{\hat{y}}_{i}

and

y_{i}

denote the predicted and ground-truth values. ACE quantifies the localization accuracy of corner points and is given by:

A C E = \frac{1}{M} \sum_{i = 1}^{M} \sqrt{({\hat{x}}_{i} - x_{i})^{2} + ({\hat{y}}_{i} - y_{i})^{2}}

(17)

where

({\hat{x}}_{i}, {\hat{y}}_{i})

and

(x_{i}, y_{i})

represent the predicted and ground-truth corner coordinates, respectively, and M is the total number of corner points.

Figure 6 and Table 1 show that the model is particularly prominent in regions with regular structural patterns, such as farmland, with an average RMSE of 0.85. In urban scenes with sharp boundaries and complex geometric structures, the OSR network maintains a high registration accuracy, with an average RMSE of 0.98. Even in challenging waters, the model achieves competitive performance in the case of weak cross-modal correspondence, with an average RMSE of 1.25, proving its robustness under different imaging conditions. In the mixed land cover area, the average RMSE is 1.10, which further reflects the adaptability of the proposed framework to complex and heterogeneous scenarios.

In addition to the overall accuracy index, the experiment also verified the stability of the model. We respectively tested the standard deviation of error (RMSE_std < 0.10, ACE_std < 0.44), indicating that the model prediction results have high consistency and reliability and are not significantly affected by scene differences. In conclusion, the quantitative assessment of the system jointly verified the advancement and practical value of OSR-Net in dealing with the complexity of optical–SAR cross-modal registration from two dimensions: accuracy performance and statistical stability.

3.3. Comparative Experimental Analysis

In order to comprehensively evaluate the registration performance of the proposed OSR-Net, three representative baseline methods were selected, including the traditional feature matching method and the method based on deep learning.

For traditional methods, OS-SIFT [19] and RIFT2 [36] were adopted. OS-SIFT is a feature-based manual method specifically designed for optical SAR image registration, which utilizes gradient information to achieve robust matching. RIFT2, based on phase consistency features, demonstrates better adaptability when processing nonrigid distortion (NRD) multi-modal remote sensing images.

For the comparison of deep learning, the deep Homography Network (DHN, [17]) was selected as the main benchmark. DHN is one of the first end-to-end frameworks capable of directly regressing the geometric transformation parameters between image pairs. Although it was originally designed for remote sensing image registration, its main idea is also direct affine transformation regression, so it is a suitable reference model for this paper. In addition, XFeat [37], a recent deep learning-based registration method, was also evaluated. XFeat incorporates multi-modal feature extraction and robust matching strategies, achieving slightly better registration accuracy than DHN in most scenarios, particularly in urban and farmland areas with complex structures.

Figure 7, Figure 8, Figure 9 and Figure 10 present the registration results of different methods, including OS-SIFT, RIFT2, DHN, XFeat, and OSR-Net. The red rectangles represent the predicted registered positions, while the green rectangles indicate the ground-truth alignment.

Table 2 summarizes the registration accuracy of different methods for four representative land cover types (water, city, farmland, and mixed areas). Visual and quantitative comparisons show that traditional methods (OS-SIFT and RIFT2) usually produce higher errors than deep learning methods, reflecting their limitations in handling the strong radiometric differences between optical and SAR images. OS-SIFT achieves relatively consistent but coarse alignment, showing that gradient-based features are generally robust but limited in accuracy. RIFT2 performs slightly better in some areas, especially over water surfaces and mixed regions, but is unstable in farmland scenes due to repeated textures and insufficient key points affecting phase-consistency matching.

Compared with traditional methods, DHN achieves lower overall error, indicating the potential of end-to-end learning in multi-modal registration. XFeat slightly outperforms DHN by leveraging enhanced multi-modal feature extraction and robust matching, resulting in more precise alignment in most scenarios. However, the experimental results show that both DHN and XFeat still fall short of OSR-Net, which benefits from specialized multi-modal feature extraction, channel attention, and multi-scale prediction.

Across all land cover categories, OSR-Net consistently demonstrates the highest registration accuracy, with an average RMSE of less than ~1.10 pixels and an ACE of about 5 pixels. The model maintains stable performance in diverse scenarios, including complex urban structures, regular farmland, weakly textured water surfaces, and highly heterogeneous mixed regions. This indicates that OSR-Net has strong generalization ability and robustness under different imaging conditions.

In summary, traditional methods are limited by the sensitivity of handcrafted features to modality differences. While DHN demonstrates end-to-end learning capability, it lacks modality-specific design. XFeat shows moderate improvement over DHN through enhanced feature extraction, yet OSR-Net achieves the most reliable and accurate registration results via the synergy of multi-modal feature extraction, channel attention, and multi-scale prediction modules.

In Figure 7, Figure 8, Figure 9 and Figure 10, (a) shows the target image to be registered, (b) shows the registration result using OS-SIFT, (c) shows the registration result using RIFT2, (d) shows the registration result using DHN, (e) shows the registration result using XFeat, and (f) shows the registration result using the proposed OSR-Net. The red boxes denote predicted positions, and the yellow boxes represent ground-truth locations.

3.4. Ablation Study and Loss Function Analysis

3.4.1. Ablation Study

To evaluate the effectiveness of each module and its cooperation mechanism, we conducted an ablation study of the model. The experiment started with a basic network that only contained the multi-modal feature extraction (MFE) module, and then integrated the multi-scale channel attention (MSCA) module, standard spatial transformation network (STN), and improved spatial transformation network (ISTN). The proportion of samples with registration error less than or equal to 3 pixels (≤3 pixels) and 1 pixel (≤1 pixel) was used as the main evaluation index. The results (Table 3) show the impact of each component on the whole system.

Specifically, only the basic model of the MFE module achieves a registration success rate of 66.4% (≤3 px) and 50.2% (≤1 px). This shows that MFE effectively maps heterogeneous images to a unified high-level semantic feature space through the Resnet-18 backbone, reduces pixel-level radiation differences, and aligns structured targets at the feature level.

After the MSCA module was added, the two indicators increased to 70.2% and 52.9%, respectively, indicating that the attention mechanism made a positive contribution to cross-modal feature recognition. MSCA adaptively reweights the channel features related to the registration task, thereby improving the registration accuracy in areas with repetitive textures or weak features, such as farmland.

After further introducing the standard STN, the model performance increased to 77.4% (≤3 px) and 58.2% (≤1 px), verifying the effectiveness of the combination of geometric parameter estimation and end-to-end spatial transformation learning.

Finally, the proposed ISTN was used to replace the STN to achieve the best performance, reaching 86.4% (≤3 px) and 60.8% (≤1 px), respectively. Compared with the standard STN, the ISTN combines structural preservation constraints during sampling to ensure smooth transitions along boundaries, thereby improving sub-pixel-level registration accuracy.

In conclusion, the results of the ablation study indicate that the proposed modules work in coordination to jointly form a collaborative and efficient OSR-Net.

3.4.2. Multi-Constraint Joint Loss Function Analysis

For the network architecture, it is important to design appropriate optimization objectives for stable and effective training, especially in the challenging cross-modal registration scenario. The multi-constraint joint loss function proposed in this paper integrates affine matrix loss, structural similarity loss, reference consistency loss, and key point consistency loss, providing complementary supervision at global and local scales [8].

Convergence Behavior of Sub-Losses

The total loss tends to be stable after about 70 times, and all sub-loss terms show coordinated convergence (Figure 11). The loss of the affine matrix decreases the fastest in the early training stage, indicating that the network can quickly capture the global geometric relationship and the coarse-scale geometric supervision can accelerate the global convergence [3]. At the same time, the consistency loss of the SSIM and key points gradually decreased in the later stage, reflecting the continuous refinement of local structure alignment. The overall trend reflects a learning strategy from coarse to fine, which can enhance the robustness of multi-modal registration to a certain extent and make the results more reliable.

2.: Weight Sensitivity and Trade-offs

In order to analyze the influence of the weight setting of the loss function on the performance of the model, comparative experiments of different parameter combinations (α, β, γ, δ) were carried out in this paper. An overly high matrix loss weight will accelerate geometric convergence, but it will affect the preservation of local structure, and an insufficient key point consistency weight will reduce the fine-grained alignment accuracy. Finally, the combination of α = 1.0, β = 0.1, γ = 0.5, and δ = 1.0 was selected to ensure the global geometric accuracy and local structure fidelity. These findings are consistent with previous studies, emphasizing that multi-task or multi-constraint learning benefits from careful weighting to avoid over-fitting a single goal [10].

3.: Geometric Registration Accuracy

To further quantify the geometric registration performance, three metrics were employed: translation error (E_trans), rotation error (E_rot), and scale error (E_scale).

E_{t r a n s} = ∥ t_{p r e d} - t_{g t} ∥_{2}; E_{r o t} = a r c c o s (\frac{T r (R_{p r e d}^{T} R_{g t}) - 1}{2}); E_{s c a l e} = | s_{p r e d} - s_{g t} |

(18)

Figure 12 illustrates the evolution of the three errors metrics during training. The model converges after around 40 epochs, with rotation error showing the most significant decline, translation error remaining consistently low, and scale error approaching zero. These results indicate that the proposed OSR-Net achieves stable and accurate modeling across all geometric transformation components.

4.: Robustness to Cross-Modal Challenges

OSR-Net organically combines global geometric constraints, local key point consistency constraints, and radiation consistency constraints. It can simultaneously capture structural information at different scales and radiation characteristics across modes in the training process, so as to effectively solve complex problems, such as nonlinear radiation distortion, local geometric deformation, and structural ambiguity between optical images and SAR images.

In this framework, the global affine constraint ensures the consistency of the image in the overall spatial position, the local key point constraint enhances the alignment of the detail structure, and the radiative consistency constraint suppresses the mismatches of brightness, contrast, and texture caused by the differences in sensor imaging mechanisms. This multi-constraint collaborative optimization mechanism enables OSR-Net to achieve stable and high-precision cross-modal registration in various complex scenarios, including urban areas, cultivated land areas, and mixed terrain environments with complex terrain features, compared with traditional methods, such as those relying solely on global affine supervision or based on manual feature matching [38]. The multi-constraint strategy of OSR-Net significantly enhances the robustness and generalization ability of the network in the face of cross-modal radiation differences and local deformations. This approach ensures the consistency and reliability of registration outcomes across diverse geographical environments and varying imaging conditions.

5.: Overall Implications and Significance

This loss system can simultaneously guide the network to achieve global alignment, local geometric consistency, structural similarity, and radiation reliability, ensuring the robustness, accuracy, and physical interpretability of optical SAR registration. The good quantitative and qualitative results show that this loss system is helpful for practical applications such as optical SAR accurate registration.

4. Conclusions

This study presents a unified deep learning-based automatic registration framework, termed OSR-Net, to address the substantial imaging differences and nonrigid geometric distortions between optical and SAR imagery. MFE, MSCA, MATP, and ISTN are built to achieve end-to-end learning from feature representation to geometric transformation prediction. Comprehensive experiments on MultiResSAR datasets demonstrate the adaptability and registration accuracy of this method under different land cover scenarios.

The main contribution of this work is the development of a multi-constraint joint optimization loss system, which effectively balances geometric accuracy, structural consistency, and physical fidelity. This system integrates four complementary constraints: affine matrix supervision, geometric consistency of feature points, cross-modal structural similarity, and reference preservation. A key innovation is the inclusion of an adaptive weighting strategy for training plans, which dynamically balances these objectives and effectively alleviates the common weight-sensitivity problem in multi-task learning. This mechanism enables the model to converge rapidly to a reasonable global transformation during early training and then gradually refine the local structure alignment in the later stages.

The effectiveness of OSR-Net was confirmed by the evaluation results of experiments. This model achieved high registration accuracy in different land cover types. The average RMSE value for farmland was 0.85 pixels; for cities, it was 0.98 pixels, for mixed areas, it was 1.10 pixels, and for water bodies, it was 1.25 pixels. These results were significantly better than those of traditional methods. The ablation study further verified the contribution of each component, indicating that the registration success rate of the complete model (MFE + MSCA + ISTN) was 86.4% when the error was ≤3 pixels and 60.8% when the error was sub-pixel level (≤1 pixel), which was a significant improvement compared to the baseline configuration.

Future work will focus on three directions:

Investigating the theoretical interpretability and optimization behavior of the dynamic weighting mechanism within the proposed loss system;
Validating its generalization capability on larger and more diverse multi-sensor datasets;
Extending the framework to handle nonrigid transformations and high-dimensional multi-modal tasks to enhance adaptability in complex or extreme environments.

Author Contributions

Conceptualization, Y.Z.; methodology, Y.Z. and X.X.; software, Y.S. and J.Y.; validation, J.Y., J.Z. and M.W.; investigation, A.Z. and Q.L.; writing—original draft preparation, Y.Z.; writing—review and editing, S.C. and X.X.; supervision, S.C.; project administration, S.C.; funding acquisition, S.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are available in Computer Vision and Pattern Recognition at https://doi.org/10.48550/arXiv.2502.01002 (accessed on 6 May 2025), reference number arXiv:2502.01002. These data were derived from the following resources available in the public domain: https://github.com/betterlll/Multi-Resolution-SAR-dataset-. GitHub 3.5.0—betterlll/Multi-Resolution-SAR-dataset-.

Acknowledgments

In the Experimental Section, the MultiResSAR dataset is used. The author also thanks the anonymous reviewers and editors for their insightful opinions and useful suggestions on our article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Velesaca, H.O.; Bastidas, G.; Rouhani, M.; Sappa, A.D. Multimodal image registration techniques: A comprehensive survey. Multimed. Tools Appl. 2024, 83, 63919–63947. [Google Scholar] [CrossRef]
Zhu, B.; Zhou, L.; Pu, S.; Fan, J.; Ye, Y. Advances and challenges in multimodal remote sensing image registration. IEEE J. Miniaturization Air Space Syst. 2023, 4, 165–174. [Google Scholar] [CrossRef]
Feng, R.; Shen, H.; Bai, J.; Li, X. Advances and opportunities in remote sensing image geometric registration: A systematic review of state-of-the-art approaches and future research directions. IEEE Geosci. Remote Sens. Mag. 2021, 9, 120–142. [Google Scholar] [CrossRef]
Schmitt, M.; Hughes, L.H.; Zhu, X.X. The SEN1-2 dataset for deep learning in SAR-optical data fusion. arXiv 2018, arXiv:180701569. [Google Scholar] [CrossRef]
Sommervold, O.; Gazzea, M.; Arghandeh, R. A survey on SAR and optical satellite image registration. Remote Sens. 2023, 15, 850. [Google Scholar] [CrossRef]
Lowe, D.G. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
Xiang, Y.; Wang, F.; You, H. OS-SIFT: A robust SIFT-like algorithm for high-resolution optical-to-SAR image registration in suburban areas. IEEE Trans. Geosci. Remote Sens. 2018, 56, 3078–3090. [Google Scholar] [CrossRef]
Fan, J.; Ye, Y.; Li, J.; Liu, G.; Li, Y. A novel multiscale adaptive binning phase congruency feature for SAR and optical image registration. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5235216. [Google Scholar] [CrossRef]
Ye, Y.; Shen, L. Hopc: A novel similarity metric based on geometric structural properties for multi-modal remote sensing image matching. ISPRS Ann. Photogramm. Remote Sens. Spat. Inf. Sci. 2016, 3, 9–16. [Google Scholar]
Ma, L.; Liu, Y.; Zhang, X.; Ye, Y.; Yin, G.; Johnson, B.A. Deep learning in remote sensing applications: A meta-analysis and review. ISPRS J. Photogramm. Remote Sens. 2019, 152, 166–177. [Google Scholar] [CrossRef]
Kuppala, K.; Banda, S.; Barige, T.R. An overview of deep learning methods for image registration with focus on feature-based approaches. Int. J. Image Data Fusion. 2020, 11, 113–135. [Google Scholar] [CrossRef]
Bürgmann, T.; Koppe, W.; Schmitt, M. Matching of TerraSAR-X derived ground control points to optical image patches using deep learning. ISPRS J. Photogramm. Remote Sens. 2019, 158, 241–248. [Google Scholar] [CrossRef]
Li, L.; Han, L.; Ye, Y. Self-supervised keypoint detection and cross-fusion matching networks for multimodal remote sensing image registration. Remote Sens. 2022, 14, 3599. [Google Scholar] [CrossRef]
Chen, J.; Xie, H.; Zhang, L.; Hu, J.; Jiang, H.; Wang, G. SAR and optical image registration based on deep learning with co-attention matching module. Remote Sens. 2023, 15, 3879. [Google Scholar] [CrossRef]
Karras, T.; Aila, T.; Laine, S.; Lehtinen, J. Progressive growing of gans for improved quality, stability, and variation. arXiv 2017, arXiv:171010196. [Google Scholar]
Du, W.-L.; Zhou, Y.; Zhao, J.; Tian, X. K-means clustering guided generative adversarial networks for SAR-optical image matching. IEEE Access 2020, 8, 217554–217572. [Google Scholar] [CrossRef]
DeTone, D.; Malisiewicz, T.; Rabinovich, A. Deep image homography estimation. arXiv 2016, arXiv:1606.03798. [Google Scholar] [CrossRef]
De Vos, B.D.; Berendsen, F.F.; Viergever, M.A.; Sokooti, H.; Staring, M.; Išgum, I. A deep learning framework for unsupervised affine and deformable image registration. Med. Image Anal. 2019, 52, 128–143. [Google Scholar] [CrossRef]
Xiang, D.; Xie, Y.; Cheng, J.; Xu, Y.; Zhang, H.; Zheng, Y. Optical and SAR image registration based on feature decoupling network. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5235913. [Google Scholar] [CrossRef]
Ye, Y.; Tang, T.; Zhu, B.; Yang, C.; Li, B.; Hao, S. A multiscale framework with unsupervised learning for remote sensing image registration. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5622215. [Google Scholar] [CrossRef]
Hughes, L.H.; Marcos, D.; Lobry, S.; Tuia, D.; Schmitt, M. A deep learning framework for matching of SAR and optical imagery. ISPRS J. Photogramm. Remote Sens. 2020, 169, 166–179. [Google Scholar] [CrossRef]
Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual attention network for scene segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3146–3154. [Google Scholar]
Vandenhende, S.; Georgoulis, S.; Van Gansbeke, W.; Proesmans, M.; Dai, D.; Van Gool, L. Multi-task learning for dense prediction tasks: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 3614–3633. [Google Scholar] [CrossRef]
Wang, Y.; Tang, X.; Ma, J.; Zhang, X.; Liu, F.; Jiao, L. Cross-modal remote sensing image–text retrieval via context and uncertainty-aware prompt. IEEE Trans. Neural Netw. Learn. Syst. 2024, 36, 11384–11398. [Google Scholar] [CrossRef]
Chen, Z.; Badrinarayanan, V.; Drozdov, G.; Rabinovich, A. Estimating depth from rgb and sparse sensing. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 167–182. [Google Scholar]
Liu, S.; Johns, E.; Davison, A.J. End-to-end multi-task learning with attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 1871–1880. [Google Scholar]
Zhang, H.; Lei, L.; Ni, W.; Tang, T.; Wu, J.; Xiang, D.; Kuang, G. Explore better network framework for high-resolution optical and SAR image matching. IEEE Trans. Geosci. Remote Sens. 2021, 60, 4704418. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, Nevada, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Jaderberg, M.; Simonyan, K.; Zisserman, A. Spatial transformer networks. Adv. Neural. Inf. Process. Syst. 2015, 28. [Google Scholar] [CrossRef]
Kendall, A.; Gal, Y.; Cipolla, R. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7482–7491. [Google Scholar]
Sotiras, A.; Davatzikos, C.; Paragios, N. Deformable medical image registration: A survey. IEEE Trans. Med. Imaging 2013, 32, 1153–1190. [Google Scholar] [CrossRef] [PubMed]
Zhang, W.; Zhao, R.; Yao, Y.; Wan, Y.; Wu, P.; Li, J.; Li, Y.; Zhang, Y. Multi-resolution sar and optical remote sensing image registration methods: A review, datasets, and future perspectives. arXiv 2025, arXiv:2502.01002. [Google Scholar] [CrossRef]
Liu, L.; Jiang, H.; He, P.; Chen, W.; Liu, X.; Gao, J.; Han, J. On the variance of the adaptive learning rate and beyond. arXiv 2019, arXiv:190803265. [Google Scholar]
Li, J.; Shi, P.; Hu, Q.; Zhang, Y. RIFT2: Speeding-up RIFT with a new rotation-invariance technique. arXiv 2023, arXiv:230300319. [Google Scholar]
Potje, G.; Cadar, F.; Araujo, A.; Martins, R.; Nascimento, E.R. Xfeat: Accelerated features for lightweight image matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 2682–2691. [Google Scholar]
Suri, S.; Reinartz, P. Mutual-information-based registration of TerraSAR-X and Ikonos imagery in urban areas. IEEE Trans. Geosci. Remote Sens. 2009, 48, 939–949. [Google Scholar] [CrossRef]

Figure 1. Optical–SAR Registration Network (OSR-Net) framework.

Figure 2. Architecture of the multi-modal feature extraction (MFE) module. It summarizes the specific details of the module. The main part is based on the ResNet-18 architecture and consists of four layers (layer1–layer4).

Figure 3. Architecture of the multi-scale channel attention (MSCA) module. This figure is a detailed description of the network. The 1 × 1 convolutional layer performs channel projection and is used for dimension reduction and cross-channel information integration.

Figure 4. Architecture of the multi-scale affine transformation prediction (MATP) module and improved spatial transformer (ISTP) module.

Figure 5. Examples of OSR-Net registration results. (a) Optical image; (b) SAR image; (c) registered result by OSR-Net; (d) checkerboard overlay between the optical image and the unregistered SAR image; (e) checkerboard overlay between the optical image and the registered SAR image.

Figure 6. Registration accuracy of the network across various land cover types.

Figure 7. Registration results in water-body regions using different methods. (a) shows the target image to be registered, (b) shows the registration result using OS-SIFT, (c) shows the registration result using RIFT2, (d) shows the registration result using DHN, (e) shows the registration result using XFeat, and (f) shows the registration result using the proposed OSR-Net. The red boxes denote predicted positions, and the yellow boxes represent ground-truth locations.

Figure 8. Registration results in urban areas using different methods.

Figure 9. Registration results in farmland area cover regions using different methods.

Figure 10. Registration results in mixed land cover regions using different methods.

Figure 11. Convergence curves of the multi-constraint joint optimization losses.

Figure 12. Evaluation of the multi-constraint joint loss function.

Table 1. Accuracy evaluation of the network across different land cover types.

	Water	Urban	Farmland	MixZone
RMSE_mean	1.25	0.98	0.85	1.10
RMSE_std	0.10	0.08	0.07	0.09
ACE_mean	5.21	4.54	3.82	4.95
ACE_std	0.44	0.32	0.25	0.37

Table 2. Quantitative comparison of registration accuracy across different methods.

	CLASS	Mean of RMSE/pix	Mean of ACE/pix
OS-SIFT	Water	18.29	24.06
	Urban	20.88	26.46
	Farmland	15.51	22.87
	Mix Zone	21.34	27.29
RIFT2	Water	15.03	22.30
	Urban	16.92	23.81
	Farmland	90.36	177.75
	Mix Zone	16.23	23.42
DHN	Water	5.25	8.76
	Urban	4.52	7.68
	Farmland	3.68	6.04
	Mix Zone	6.12	9.64
XFeat	Water	4.89	7.95
	Urban	5.41	8.82
	Farmland	4.58	7.26
	Mix Zone	6.58	9.92
OSR-Net	Water	1.25	5.21
	Urban	0.98	4.54
	Farmland	0.85	3.82
	Mix Zone	1.10	4.95

Table 3. Results of the ablation experiments.

	≤3 px (%)	≤1 px (%)
MFE	66.4	50.2
MFE + MSCA	70.2	52.9
MFE + MSCA + STN	77.4	58.2
MFE + MSCA + ISTN	86.4	60.8

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, Y.; Chen, S.; Xu, X.; Yang, J.; Suo, Y.; Zhu, J.; Wu, M.; Zhang, A.; Li, Q. An Optical–SAR Remote Sensing Image Automatic Registration Model Based on Multi-Constraint Optimization. Remote Sens. 2026, 18, 333. https://doi.org/10.3390/rs18020333

AMA Style

Zhang Y, Chen S, Xu X, Yang J, Suo Y, Zhu J, Wu M, Zhang A, Li Q. An Optical–SAR Remote Sensing Image Automatic Registration Model Based on Multi-Constraint Optimization. Remote Sensing. 2026; 18(2):333. https://doi.org/10.3390/rs18020333

Chicago/Turabian Style

Zhang, Yaqi, Shengbo Chen, Xitong Xu, Jiaqi Yang, Yuqiao Suo, Jinchen Zhu, Menghan Wu, Aonan Zhang, and Qiqi Li. 2026. "An Optical–SAR Remote Sensing Image Automatic Registration Model Based on Multi-Constraint Optimization" Remote Sensing 18, no. 2: 333. https://doi.org/10.3390/rs18020333

APA Style

Zhang, Y., Chen, S., Xu, X., Yang, J., Suo, Y., Zhu, J., Wu, M., Zhang, A., & Li, Q. (2026). An Optical–SAR Remote Sensing Image Automatic Registration Model Based on Multi-Constraint Optimization. Remote Sensing, 18(2), 333. https://doi.org/10.3390/rs18020333

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Optical–SAR Remote Sensing Image Automatic Registration Model Based on Multi-Constraint Optimization

Highlights

Abstract

1. Introduction

2. Methods

2.1. Overall Architecture of the Optical–SAR Registration Network (OSR-Net)

2.2. Multi-Modal Feature Extraction Module (MFE)

2.3. Multi-Scale Channel Attention Module (MSCA)

2.4. Multi-Scale Affine Transformation Prediction Module (MATP)

2.5. Improved Spatial Transformer Network (ISTN)

2.6. Multi-Constraint Joint Optimization Loss Function System

2.6.1. Theoretical Background of Weight Sensitivity

2.6.2. Composition and Definition of the Loss Function

2.6.3. Dynamic Weighting Strategy

2.6.4. Complementarity and Effectiveness

3. Experiments

3.1. Experimental Setup

3.1.1. Data Preprocessing and Augmentation

3.1.2. Implementation Details

3.2. Registration Performance Analysis of OSR-Net

3.2.1. Qualitative Evaluation

3.2.2. Quantitative Evaluation

3.3. Comparative Experimental Analysis

3.4. Ablation Study and Loss Function Analysis

3.4.1. Ablation Study

3.4.2. Multi-Constraint Joint Loss Function Analysis

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI