Global Optical and SAR Image Registration Method Based on Local Distortion Division

Li, Bangjie; Guan, Dongdong; Xie, Yuzhen; Zheng, Xiaolong; Chen, Zhengsheng; Pan, Lefei; Zhao, Weiheng; Xiang, Deliang

doi:10.3390/rs17091642

Open AccessArticle

Global Optical and SAR Image Registration Method Based on Local Distortion Division

by

Bangjie Li

¹,

Dongdong Guan

^1,*

,

Yuzhen Xie

²,

Xiaolong Zheng

¹,

Zhengsheng Chen

¹,

Lefei Pan

¹,

Weiheng Zhao

¹ and

Deliang Xiang

²

¹

High-Tech Institute of Xi’an, Xi’an 710025, China

²

College of Information Science and Technology, Beijing University of Chemical Technology, Beijing 100029, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(9), 1642; https://doi.org/10.3390/rs17091642

Submission received: 7 April 2025 / Revised: 30 April 2025 / Accepted: 30 April 2025 / Published: 6 May 2025

(This article belongs to the Special Issue Temporal and Spatial Analysis of Multi-Source Remote Sensing Images)

Download

Browse Figures

Versions Notes

Abstract

Variations in terrain elevation cause images acquired under different imaging modalities to deviate from a linear mapping relationship. This effect is particularly pronounced between optical and SAR images, where the range-based imaging mechanism of SAR sensors leads to significant local geometric distortions, such as perspective shrinkage and occlusion. As a result, it becomes difficult to represent the spatial correspondence between optical and SAR images using a single geometric model. To address this challenge, we propose a global optical-SAR image registration method that leverages local distortion characteristics. Specifically, we introduce a Superpixel-based Local Distortion Division (SLDD) method, which defines superpixel region features and segments the image into local distortion and normal regions by computing the Mahalanobis distance between superpixel features. We further design a Multi-Feature Fusion Capsule Network (MFFCN) that integrates shallow salient features with deep structural details, reconstructing the dimensions of digital capsules to generate feature descriptors encompassing texture, phase, structure, and amplitude information. This design effectively mitigates the information loss and feature degradation problems caused by pooling operations in conventional convolutional neural networks (CNNs). Additionally, a hard negative mining loss is incorporated to further enhance feature discriminability. Feature descriptors are extracted separately from regions with different distortion levels, and corresponding transformation models are built for local registration. Finally, the local registration results are fused to generate a globally aligned image. Experimental results on public datasets demonstrate that the proposed method achieves superior performance over state-of-the-art (SOTA) approaches in terms of Root Mean Squared Error (RMSE), Correct Match Number (CMN), Distribution of Matched Points (Scat), Edge Fidelity (EF), and overall visual quality.

Keywords:

synthetic aperture radar; local distortion; superpixel features; digital capsule feature descriptor

1. Introduction

Optical remote sensing and Synthetic Aperture Radar (SAR) technology are two main remote sensing methods for obtaining information about the Earth’s surface [1,2,3]. Optical remote sensing generates images by recording the reflection or radiation of visible light, while SAR technology uses microwave signals to probe the surface and generate images [4]. Since optical remote sensing and SAR technology operate in different electromagnetic bands and physical imaging mechanisms, the images obtained by these two types of sensors exhibit unique features and advantages. Optical images possess high-resolution spatial information and rich spectral information, enabling precise identification of ground features, however, they cannot be obtained under cloudy or nighttime conditions [5]. In contrast, SAR technology can penetrate clouds and observe at night, possessing all-weather observation capabilities [6]. However, terrain and surface cover (such as buildings, trees, and water bodies) can affect the propagation and reflection of microwave signals, leading to multipath reflection and scattering, which cause distortion and interference in the images [7]. These factors render the registration of optical images and SAR images an ongoing challenging research issue [8].

Image registration methods are generally divided into two categories: region-based methods and feature-based methods [9]. Region-based methods evaluate the similarity of image blocks using template matching techniques. Classic similarity measures include mutual information, normalized cross-correlation, and cumulative residual entropy [10]. Ye et al. [11] replaced gradient information with phase consistency information to construct a directional phase consistency histogram (HOPC), improving the registration accuracy of multimodal images. Xiang et al. [12] mapped 3D image features and employed phase correlation methods for image registration. Xiang et al. [13] further improved the phase consistency model. In summary, incorporating structural features significantly enhances the registration accuracy of template matching methods. However, these methods often overlook the geometric relationships and spatial structures between local regions, hindering their ability to effectively manage geometric changes caused by local distortions, which ultimately degrades matching performance. Additionally, in certain algorithms, local features may be overshadowed by the dominance of global features, diminishing their significance during the registration process. These limitations present significant challenges for region-based methods in addressing complex distortions [14].

Compared to region-based global matching methods, feature-based methods incur lower computational costs [15]. Feature-based methods rely on key points or significant features in the image registration, which require relatively low computational cost. This is particularly advantageous for large-scale images, as it significantly improves registration efficiency. Feature-based methods mainly involve four steps: control point detection, feature description, feature matching, and image transformation. Scale-Invariant Feature Transform (SIFT) and its variants are the most common artificial descriptors [16]. Dellinger et al. [17] proposed a gradient computation method specifically designed for SAR images, which is robust to speckle noise. Yu et al. [18] designed a rotation-invariant amplitudes of log-Gabor orientation histograms (RI-ALGH) descriptor and proposed an improved SAR and optical image registration algorithm based on a nonlinear SIFT framework. The OS-SIFT algorithm proposed by [19] uses the Sobel and ROEWA operators to calculate the gradient directions and magnitudes of optical and SAR images, respectively, thereby enabling image registration within the classic SIFT framework. Jia et al. [20] proposed a registration algorithm based on multi-scale directional phase consistency mapping (MSPCO), which performs better in retaining edge and texture information. Wang et al. [21] proposed a scale-invariant SOI-SIFT algorithm, which enhances the edge information in optical and SAR images and then uses the ROEWA operator to construct SIFT-like feature vector descriptors. Although these strategies can generally achieve a certain degree of point matching on a global scale, significant positional shifts caused by distortions in local regions remain challenging to fully eliminate. This shift is not merely a pixel-level error but a mismatch distortion caused by the systemic differences in image geometric characteristics. For instance, in local regions such as ridgelines or edges with steep terrain undulations, slight variations in control point positions may lead to mismatched point pairs, which are challenging to rectify through global optimization strategies. Furthermore, such local shifts can accumulate in subsequent geometric transformation and registration steps, ultimately reducing the overall registration accuracy. Therefore, to effectively address the issue of point mismatches caused by local distortions, it is imperative to develop more robust point matching methods to enhance the reliability and accuracy of optical and SAR image registration.

In response to the above issues, deep learning-based optical and SAR image registration methods have garnered significant attention in the research community [22,23]. Merkle et al. [24] introduced a Siamese feature extraction network employing dilated convolutions to estimate feature similarity by correlating 2D feature vectors, representing the first effort in deep learning-based optical and SAR template image matching. Ye et al. [25] proposed a pre-trained VGG16 model to extract CNN features and combined them with SIFT features to construct a joint feature for optical and SAR image registration. Zhang et al. [26] proposed a training strategy that addresses the significant impact of the training sample selection strategy during matching network training, enhancing training performance by maximizing the feature distance between positive samples and easily confusable negative samples. Liu et al. [27] proposed a dual-branch matching network based on Transformer-CNN for extracting multi-scale image features. Quan et al. [28] proposed self-distillation learning and ordered similarity metrics, leveraging Siamese and Pseudo-Siamese hybrid networks to improve the learning of isomorphic features in optical and SAR heterogeneous images. Ye et al. [29] proposed an unsupervised image registration method that demonstrates stronger robustness when handling images with geometric and radiometric distortions.

However, local distortions alter the feature representation of the same object, impacting feature extraction and matching capabilities. Additionally, as CNNs perform convolution and pooling layer-by-layer in the feature extraction process, the features captured at different layers often influence one another, leading to issues such as feature aliasing, information loss, and semantic ambiguity [30]. Specifically, in traditional CNNs, shallow and deep features may blend within the same feature descriptor, making it difficult for the model to distinguish information at different levels, resulting in feature aliasing. Additionally, certain details may be discarded in pooling layers, impacting the integrity of features and leading to information loss. Moreover, the mixed features cause the final feature representation to lack clarity in optical and SAR registration tasks, affecting model performance and resulting in semantic ambiguity.

Most importantly, the majority of the above registration methods follow the global registration pipeline, including feature detection, feature description, descriptor matching, and transformation estimation [31,32,33], with only the feature extraction or matching computation methods replaced. Local distortions may cause the same ground object in the image to exhibit inconsistent geometric features at different locations or scales. For example, the presence of terrain undulations or occlusions can cause changes in the shape, size, or position of the same ground object in satellite images, making the correspondence between control points or feature areas in the registration process ambiguous or uncertain, thus reducing registration accuracy [34]. Moreover, local distortions increase the complexity and uncertainty of transformation parameters, making the parameter estimation in the registration process more difficult. During the registration process, a transformation model is typically assumed to describe the geometric transformation relationship between images. For flat terrain images where surface elevation changes can be ignored, affine or projective geometric transformations [35] and other linear models can be used to describe spatial mapping relationships. However, when there is significant terrain elevation variation, images captured under different imaging geometries will no longer satisfy the linear mapping relationship. Local distortions may cause these assumed transformation models to be inaccurate, failing to capture the true transformation relationship between images, thus affecting registration accuracy. Especially in optical and SAR image registration, the distance imaging mechanism of SAR sensors can generate significant local geometric distortions in the image under terrain undulation conditions, such as perspective shrinkage and shadowing, making it difficult to describe the spatial mapping relationship between optical and SAR images through explicit geometric relationships. Moreover, with the increase in image spatial resolution, even extremely small surface elevation changes may cause significant local geometric distortions. As shown in Figure 1, after performing spatial registration using global geometric transformations, the lower part of the image (blue rectangular annotation) is accurately registered, while the upper part (red rectangular annotation) exhibits noticeable registration errors.

To address the aforementioned issues, we propose a global optical and SAR image registration method based on local distortion division. First, a local distortion region division algorithm based on superpixel features (SLDD) is introduced. By evaluating the distortion characteristics of different regions in optical and SAR images, the images are segmented based on distortion levels. This approach prevents inaccuracies in the global transformation model caused by local distortions and improves registration accuracy. We apply a fast multi-scale superpixel segmentation method based on a minimum spanning tree [36] to segment the images, then calculate the mean and variance of the gray levels in each superpixel region as features. The similarity between superpixels in optical and SAR images is measured using the Mahalanobis distance, and the images are segmented based on their distortion levels. Next, inspired by capsule networks, we propose a Multi-Feature Fusion Capsule Network (MFFCN) to extract both shallow significant features and deep detailed features from optical and SAR images. Unlike traditional capsule networks, we set the reconstruction dimension of the capsules to 4, which accelerates dynamic routing computation and preserves key features for control point matching in optical and SAR registration tasks, such as texture, phase, structure, and amplitude. Additionally, a distance metric is defined for digital capsule feature descriptors to enhance robustness against local distortions. Furthermore, we design a loss function based on hard negative sample mining to further improve the accuracy of the model in handling locally distorted images. Finally, the regions divided by SLDD are input into MFFCN for feature extraction and matching. In contrast to the conventional pipeline involving feature extraction, descriptor construction, and matching, we apply region-specific transformation models according to varying levels of distortion. Furthermore, unlike traditional registration methods (e.g., SIFT, Harris) that depend on handcrafted keypoint extraction and similarity measures, our method introduces an innovative framework that leverages superpixel geometric centroids as feature points. By leveraging the inherent spatial coherence of superpixels, our approach ensures a more uniform and robust distribution of feature points across the image, thus enhancing the overall registration performance. Such spatially balanced distribution not only captures comprehensive structural cues but also reduces localization errors commonly induced by clustering or sparsity in traditional keypoints. Specifically, when distorted regions are categorized into N classes, the image undergoes N region-specific resampling operations, resulting in N distinct outputs. The regional registration outputs are subsequently fused to obtain a globally aligned result.

The main contributions of this paper are summarized as follows:

(1): We propose a local distortion region division algorithm, SLDD, based on superpixel features. This method defines a superpixel region feature and quickly segments optical and SAR images based on the degree of distortion by calculating the Mahalanobis distance between these features.
(2): We design a MFFCN that constructs primary capsules by fusing shallow significant features and deep detail features. These capsules are then reconstructed into capsule-based feature descriptors, containing the relevant information required for registration. Additionally, we define a Mahalanobis distance-based similarity metric for the digital capsule feature descriptors and construct a loss function that incorporates hard negative sample mining, enabling the model to achieve the desired training results.
(3): We propose a global optical and SAR image registration method based on local distortion division, combining SLDD and MFFCN. This method estimates and segments the distortion in image regions, applies different transformation models to different distortion regions, and ultimately performs global image registration by fusion and reconstruction, thereby overcoming the issue of single registration models being vulnerable to local distortions.

The structure of this paper is as follows: Section 2 introduces the details of the proposed method; Section 3 presents the registration results of our method, along with comparisons to other state-of-the-art methods. Finally, discussions and conclusions are given in Section 4 and Section 5.

2. Methodology

In this section, we will provide a detailed introduction to the proposed optical and SAR image registration method. Figure 2 illustrates the flowchart of the proposed method. The method consists of SLDD and MFFCN and primarily includes three key steps: local distortion region assessment and partition, feature descriptor extraction and matching, and geometric transformation and fusion reconstruction.

2.1. Superpixel Segmentation

Superpixels are a method of aggregating pixels based on similar properties in an image. This process groups neighboring pixels with similar grayscale values, colors, brightness, textures, and other features into contiguous regions. The generation of superpixels is mainly achieved by clustering similar pixels in the image. Unlike pixel-by-pixel processing, superpixel generation algorithms cluster based on local region features, which effectively organizes the image information. Therefore, superpixels, as collections of pixels with similar attributes, share similar feature information within the superpixel. This makes them a holistic replacement for pixel-by-pixel processing in subsequent information processing tasks. Although superpixel algorithms, such as Simple Linear Iterative Clustering (SLIC) [37], have achieved good results in natural images, they are challenging to directly apply to SAR image superpixel segmentation [38]. Therefore, we employ the MST-based superpixel segmentation method to segment the reference and sensed images. Compared to conventional superpixel segmentation methods, this approach is more suitable for handling speckle noise and complex texture structures in SAR images.

First, backscatter dissimilarity, edge penalty, and homogeneity dissimilarity are used to define the dissimilarity measure

d (a, b)

, to adapt to the multiplicative speckle noise characteristic in SAR images. Subsequently, two 5 × 5 local areas,

G_{a}

and

G_{b}

are defined as the areas surrounding the central pixel, with the mean amplitude of each sample respectively defined as

μ_{a}

and

μ_{b}

. Based on the backscatter model, the backscatter dissimilarity is calculated as:

\begin{matrix} Δ_{S} (a, b) \\ = \frac{\sqrt{N π} \cdot (1 - min \{\frac{μ_{a}}{μ_{b}}, \frac{μ_{b}}{μ_{a}}\})}{\sqrt{\frac{4 - π}{|G_{a}|} {(min \{\frac{μ_{a}}{μ_{b}}, 1\})}^{2} + \frac{4 - π}{|G_{b}|} {(min \{\frac{μ_{b}}{μ_{a}}, 1\})}^{2}}} \end{matrix}

(1)

where

|G_{a}|

and

|G_{b}|

represent the number of pixels in the respective regions, and N denotes the number of observations. This dissimilarity measure, calculated based on the mean amplitude ratio, exhibits strong robustness against noise.

Edge information is determined by the edge similarity measure (ESM), which is calculated through the transformation of the mean ratio between two windows:

ESM (a) = 1 - Ratio (a)

(2)

where

Ratio (a)

represents the mean amplitude ratio of pixel a.

The edge penalty function increases the dissimilarity between edge pixels, thus preventing them from being assigned to the same superpixel, the edge penalty function is defined as follows:

Δ_{E} (a, b) = max {ESM (a), ESM (b)} .

(3)

Additionally, homogeneity information also plays an important role in the dissimilarity measurement. The homogeneity dissimilarity is measured by calculating the coefficient of variation (CoV) of the region, defined as:

Δ_{H} (a, b) = |CoV (G_{a}) - CoV (G_{b})| .

(4)

Next, combining the three dissimilarity components mentioned above, the final dissimilarity measurement is obtained:

d (a, b) = Δ_{S} (a, b) + Δ_{E} (a, b) + Δ_{H} (a, b) .

(5)

Finally, the Boruvka algorithm [39] is employed to construct the minimum spanning tree. After obtaining the minimum spanning tree structure, edges with larger weights are cut by setting a dissimilarity threshold, and independent superpixel regions are divided, achieving superpixel segmentation of the image.

2.2. Superpixel-Based Local Distortion Division (SLDD)

Due to differences in terrain effects, imaging mechanisms, and other factors, the details of optical and SAR images differ in local regions, leading to the generation of local distortion. Local distortion makes it challenging for traditional control point selection methods to find robust control points, as these points may exhibit different geometric features in optical and SAR images, leading to a degradation in registration performance, especially in complex terrain or heterogeneous scenes. This paper proposes an image partitioning method based on local distortion, which calculates the superpixel feature similarity between the reference image and the sensed image and divides each image into local distortion regions and normal regions based on the degree of distortion. Specifically, the mean and variance of the grayscale values are used as the features of superpixels, and the Mahalanobis distance is then employed to measure the similarity between superpixels in optical and SAR images. For a superpixel region S, suppose the superpixel contains N pixels

I_{1}, I_{2}, \dots I_{n}

, the formulas for calculating the mean and variance of their grayscale values are as follows:

\begin{matrix} μ_{S} = \frac{1}{N} \sum_{i = 1}^{N} I_{i}, σ_{S}^{2} = \frac{1}{N} \sum_{i = 1}^{N} {(I_{i} - μ_{S})}^{2} . \end{matrix}

(6)

Mahalanobis distance is used to measure the distance between a point and a distribution, taking into account the covariance structure of the data and the correlations between the data points, which is why it is widely used in multivariate data analysis and pattern recognition. Given two superpixel regions

S_{1}

and

S_{2}

, the feature vector of each superpixel region is represented as:

ξ_{1} = (\binom{μ_{S_{1}}}{σ_{S_{1}}^{2}}), ξ_{2} = (\binom{μ_{S_{2}}}{σ_{S_{2}}^{2}}) .

(7)

To evaluate the similarity between the two superpixel features, the Mahalanobis distance

D_{M}

is employed, and it is given by the following formula:

D_{M} (ξ_{1}, ξ_{2}) = \sqrt{{(ξ_{1} - ξ_{2})}^{T} Σ^{- 1} (ξ_{1} - ξ_{2})}

(8)

where

Σ

is the covariance matrix of the feature vectors, and

Σ^{- 1}

is its inverse matrix. The formula for the covariance matrix

Σ

is:

Σ = Cov (ξ_{1}, ξ_{2}) = (\begin{matrix} σ_{μ}^{2} & σ_{μ σ} \\ σ_{μ σ} & σ_{σ}^{2} \end{matrix})

(9)

where

σ_{μ}^{2}

is the mean of the grayscale values,

σ_{σ}^{2}

is the variance of the grayscale values, and

σ_{μ σ}

is the covariance between the mean and variance of the grayscale values.

In summary, given the feature vectors

ξ_{1}

and

ξ_{2}

of two superpixel regions

S_{1}

and

S_{2}

, the calculation formula for superpixel feature values is:

\begin{matrix} V_{s} = D_{M} (ξ_{1}, ξ_{2}) \\ = \sqrt{{(\binom{μ_{S_{1}} - μ_{S_{2}}}{σ_{S_{1}}^{2} - σ_{S_{2}}^{2}})}^{T} Σ^{- 1} (\binom{μ_{S_{1}} - μ_{S_{2}}}{σ_{S_{1}}^{2} - σ_{S_{2}}^{2}})} . \end{matrix}

(10)

We apply the MST algorithm to perform superpixel segmentation on both the reference image and the sensed image. For each superpixel pair, the Mahalanobis distance between the superpixel features is calculated, identifying the position where the feature similarity value

V_{s}

is minimized. This minimum value indicates the highest similarity between the superpixel in the reference image and its counterpart in the sensed image. Based on this similarity measure, we define the criterion for segmenting regions by local distortion.

\{\begin{matrix} V_{s} > λ, & Local distortion area \\ V_{s} < λ, & Normal area \end{matrix},

(11)

λ = μ_{V_{s}} + 3 σ_{V_{s}}

(12)

where

μ_{V_{s}}

is the average value of the superpixel features in the region with the smallest feature values,

σ_{V_{s}}

is its standard deviation, and

λ

is the division threshold.

The proposed SLDD method proceeds as follows. First, the reference image and the sensed image are segmented into superpixel regions using the MST algorithm. To account for local misalignments between the reference and sensed images, a spatial neighborhood with a radius of 30 pixels around each reference superpixel center is defined. Within this neighborhood in the sensed image, the Mahalanobis distance between the feature vectors of superpixels is computed to find the most similar superpixel. The similarity value

μ_{V_{s}}

is recorded for each reference superpixel. Finally, an adaptive threshold based on the distribution of all

μ_{V_{s}}

values is used to classify regions into local distortion areas and normal areas.

As shown in Figure 3(a1–f1) and Figure 4(a1–f1), the visualization results of the superpixel features from Equation (7) are provided, while Figure 3(a2–f2) and Figure 4(a2–f2) presents the heatmaps of the superpixel feature regions of the images to be registered. It can be observed that the proposed superpixel features effectively reflect the image distortion in different superpixel regions, covering typical distortion areas such as field-water boundaries, terrain undulations, building elevation areas, and regions with side-view angle variations. Notably, while most detected distortion regions align with true geometric distortions (e.g., building elevation shifts in urban areas), a small subset of false positives may occur in homogeneous farmland regions. These pseudo-distortion signals primarily originate from cross-modal radiometric differences: optical images exhibit smooth vegetation textures in farmland, whereas SAR images present random speckle patterns due to coherent imaging mechanisms. Thus, we can effectively segment local distortion regions and normal regions caused by factors such as imaging principles, signal characteristic differences, and terrain effects. As shown in Figure 3(a3–f3) and Figure 4(a3–f3), the red areas indicate local distortion regions, typically including urban areas, terrain, and areas with building elevation variations, while the remaining areas are considered normal regions.

2.3. Multi-Feature Fusion Capsule Network (MFFCN)

To achieve robust and precise matching between optical and SAR images, it is critical to effectively capture spatial features while being resilient to deformations and modality differences. Traditional convolutional neural networks (CNNs) often suffer from feature aliasing, semantic ambiguity, and information loss due to pooling operations. Capsule networks (CapsNets), with their ability to preserve pose information and spatial relationships, offer a promising alternative. In this work, we propose the Multi-Feature Fusion Capsule Network (MFFCN) to specifically address the challenges of multimodal registration. Unlike standard CapsNets, MFFCN is designed to extract structured descriptors aligned with key physical properties, including texture, phase, structure, and amplitude, ensuring better compatibility between optical and SAR features.

CapsNets introduce a dynamic routing mechanism that effectively mitigates information loss and prevents the degradation of feature representations typically caused by pooling operations in conventional CNNs. This mechanism enhances the efficiency of feature aggregation, enabling CapsNets to exhibit improved resistance to noise and local spatial distortions. As a result, the network demonstrates superior generalization capability, allowing for consistent recognition of similar objects across diverse instances. Integrating CapsNets into Siamese network architectures further enhances the expressiveness of feature representations, improves resistance to geometric deformations, and strengthens overall network robustness, while simultaneously reducing information loss. Moreover, the unique dynamic routing mechanism allows CapsNets to capture spatial relationships between features, enhancing overall network performance.

The dynamic routing mechanism, illustrated in Figure 5, is executed through the following iterative procedure: Each capsule is represented by a vector, where the magnitude indicates the probability of a feature’s existence, and the direction encodes its attributes. Initially, coupling coefficients between lower-level and higher-level capsules are randomly initialized to facilitate the transmission of feature information. Higher-level capsules generate prediction vectors based on outputs from lower-level capsules as follows:

{\hat{u}}_{j} = W_{i j} u_{i}

(13)

where

{\hat{u}}_{j}

is the prediction vector of the higher-level capsule,

W_{i j}

is the connection weight, and

u_{i}

is the output of the lower-level capsule.

Dynamic routing updates the connection weights through iterations. During each iteration, outputs from lower capsules are aggregated to form the activation vector of higher capsules:

s_{j} = \sum_{i} c_{i j} {\hat{u}}_{j} .

(14)

The aggregated vector is then transformed via a nonlinear activation function, typically the Squash function:

v_{j} = \frac{{∥s_{j}∥}^{2}}{1 + {∥s_{j}∥}^{2}} s_{j} .

(15)

Subsequently, the routing weights are updated based on the consistency between the outputs of higher-level capsules and the prediction vectors:

c_{i j} \leftarrow c_{i j} + b_{i j} .

(16)

This iterative routing continues until convergence, resulting in final capsule outputs that effectively encode complex feature representations.

Although capsule networks offer significant advantages in feature representation and spatial transformation handling, certain challenges remain when applying them to optical and SAR image registration tasks. Typically, digital capsules have higher dimensionality than primary capsules to accommodate complex feature encoding and richer feature combinations. However, this increase in dimensionality complicates the feature representation process. The increased dimensionality adds difficulty to model design and hyperparameter tuning, and the abstract features represented may not align with their actual physical meanings. To address this issue, we set the dimensionality of the digital capsules in our proposed MFFCN to 4, corresponding to the key feature information of texture, phase, structure, and amplitude in optical and SAR image registration. Additionally, to maintain dimensional consistency with other feature descriptors, the feature dimension was set to 32 by replacing the original 1 × 128 vector with a 4 × 32 matrix representation. The detailed structure of the proposed MFFCN is shown in Figure 6, with specific network parameters listed in Table 1.

Due to the special form and significance of the feature descriptors extracted by capsule networks, they cannot be simply reshaped into one-dimensional vectors for calculation using classical Euclidean distance. We define a feature descriptor in the optical image as

v_{i}

, and its corresponding feature descriptor in the SAR image as

u_{i}

.

\begin{matrix} v_{i} = {[v_{i 1}, v_{i 2}, v_{i 3}, v_{i 4}]}^{T} \\ u_{i} = {[u_{i 1}, u_{i 2}, u_{i 3}, u_{i 4}]}^{T} \end{matrix}

(17)

Unlike the Euclidean distance, which treats all feature dimensions as independent and equally scaled, the Mahalanobis distance accounts for the correlations between different feature components by incorporating their covariance structure. This is particularly important when comparing optical and SAR modalities, where the feature distributions often exhibit anisotropy and varying inter-dimensional dependencies. By considering the intrinsic statistical relationships among features, the Mahalanobis distance provides a more accurate and discriminative measurement of cross-modal similarity, thus enhancing the robustness of feature matching between optical and SAR domains. For the feature descriptor

v_{i}

in the optical image and the feature descriptor

u_{i}

in the SAR image, the Mahalanobis distance

D_{M} (v_{i}, u_{i})

is defined as:

D_{M} (v_{i}, u_{i}) = \sqrt{{(v_{i} - u_{i})}^{T} Σ^{- 1} (v_{i} - u_{i})}

(18)

where

Σ

is the 4 × 4 covariance matrix of the feature descriptors, describing the correlation between features;

Σ^{- 1}

is the inverse of the covariance matrix.

To further enhance model training, we apply Hard Negative Mining, which prioritizes difficult negative examples during optimization to strengthen discriminative learning. Specifically, negative samples close to positive pairs in Mahalanobis distance but belonging to different classes are selected for enhanced supervision. By focusing training on these “hard” samples, the model improves its discriminative ability. The corresponding Mahalanobis distance loss incorporating hard negative mining is formulated as:

L_{D_{M}} = \frac{1}{N} \sum_{i = 1}^{N} {[d_{M} (v_{i}, u_{i}^{+}) - d_{M} (v_{i}, u_{i}^{-}) + α]}_{+}

(19)

where

d_{M} (v_{i}, u_{i}^{+})

is the Mahalanobis distance between positive examples (correctly matched optical and SAR images), while

d_{M} (v_{i}, u_{i}^{-})

is the Mahalanobis distance between negative examples (incorrectly matched optical and SAR images).

α

is a margin used to control the distance difference between positive and negative examples, ensuring that the distance for positive examples is less than that for negative examples.

{[x]}_{+}

is a positive value operation, indicating that it returns this value when

{[x]}_{+} > 0

, otherwise, it returns 0.

The MFFCN module is integrated into the overall registration framework as a dedicated feature extraction branch. It replaces the traditional feature extraction path, enabling the network to generate structured descriptors that better capture spatial and semantic attributes. By combining the spatially aware features from capsules with conventional texture-based features, the system achieves improved robustness to geometric deformation, modality gaps, and local distortions. Specifically, the input image patches are first processed through three convolutional modules, which progressively extract local detail features such as texture and edges while preserving the spatial structure. The resulting feature maps are then reshaped and passed into the Primary Capsule layer, where they are transformed into low-level capsule units that encode preliminary pose and structural information. Subsequently, the Digital Capsule layer aggregates these low-level capsules into structured high-dimensional feature descriptors using a dynamic routing mechanism. The final output is represented as a 4 × 32 matrix, where each row corresponds to a distinct physical attribute: texture, phase, structure, and amplitude, respectively. This process is independently conducted on both the reference image and the sensed image using two separate MFFCN branches without shared parameters. The feature descriptors output from the two branches are then compared using the Mahalanobis distance, enabling high-precision feature matching across modalities and under varying geometric distortions.

2.4. Fusion Reconstruction

In conventional image registration processes, a global transformation model is often used to describe the spatial relationship between images. However, due to the inherent differences between optical and SAR images in terms of land cover representation and imaging principles, a single model struggles to handle local distortions effectively. Local distortions induced by various surface types, such as differences in the appearance of vegetation and buildings, often introduce significant deviations in the registration process. Therefore, registering optical and SAR images requires both a global perspective and flexible local models to accommodate regional variations.

In this paper, we propose a global optical and SAR image registration method based on local distortion division. First, the distortion in different regions of the optical and SAR images is evaluated, with SLDD used to estimate image distortions and divide the image regions into normal and local distortion areas. Then, MFFCN is used to match feature descriptors in each region, and separate transformation models are constructed to obtain registration results for both types of distortion areas. The registration result of the local distortion area is denoted as conseqL, referred to as the first registration result, while the registration result of the normal area is denoted as conseqN, referred to as the second registration result. As shown in Figure 7, the first registration result provides better alignment in locally distorted regions, whereas the second registration result achieves superior performance in normal regions. Finally, based on the transformed images of these two types of regions and their corresponding superpixel segmentation results, the final transformed sensing image is obtained using a superpixel replacement strategy, as illustrated in Figure 8.

Specifically, superpixel segmentation is performed on both the first and second registration results to obtain the first and second superpixel segmentation results, respectively. The target superpixels within the local distortion area are identified based on the first superpixel segmentation result. These target superpixels are then replaced with the corresponding superpixels from the normal area, as determined from the second segmentation result, while the remaining superpixels in the general area are preserved. Through this process of fusion and reconstruction, we integrate the registration results of the two regions to generate the final registered image. It is worth noting that, since both registration results are resampled based on the reference image, there are no visible seams at the boundaries between different region categories.

3. Experimental Results and Analysis

3.1. Data Description and Parameter Settings

The experiment used the following datasets: the WHU-SEN-City [40] dataset and the OS dataset [13]. The WHU-SEN-City dataset was used for network training, and the OS dataset was used for the evaluation of comparison experiments. It is worth noting that both datasets underwent a preliminary linear error correction before applying our method, which enables the proposed approach to focus more effectively on the fine correction of local distortion errors.

(1): WHU-SEN-City is a large-scale SAR-optical image dataset that covers 32 representative cities across China, with image sizes ranging from 1885 × 1733 pixels to 8925 × 4611 pixels. The dataset consists of SAR images acquired by Sentinel-1 and optical images captured by Sentinel-2. The scenes in this dataset exhibit diverse terrain characteristics, including flat regions, river boundaries, urban areas with elevation variations, and agricultural zones with rich texture features, providing structurally complex and representative samples.
(2): The OS dataset is a high-resolution SAR-optical image dataset consisting of 2376 pairs of 512 × 512 pixel image patches with a spatial resolution of 1 m. The optical images are sourced from Google Earth, while the SAR images are acquired by China’s Gaofen-3 (GF-3) satellite using spotlight imaging mode. The dataset covers structurally rich areas (such as roads and buildings) as well as structurally sparse areas (such as vegetation, bare soil, and water bodies), offering substantial terrain heterogeneity and texture diversity.

The experiments were conducted under unified training and testing settings to ensure fairness. Specifically, all deep learning-based methods, including SOHardNet, BSS-2chDCNN, and the proposed method, were trained on the WHU-SEN-City dataset using the Adam optimizer with a learning rate of 0.01. The traditional method, PSO-SIFT, was implemented strictly according to the parameter settings in its original paper. Both deep learning and traditional methods employed the RANSAC algorithm to eliminate mismatched points, with the threshold uniformly set to 0.9. All experiments were carried out on the same hardware platform, and the detailed experimental environment is listed in Table 2.

(1): PSO-SIFT [41] introduces a gradient definition aimed at reducing intensity differences between images, and proposes a robust matching algorithm that combines the position, scale, and orientation of keypoints to increase the number of correct matches.
(2): The SOHardNet [42] model is based on the original architecture and has been specifically optimized for SAR and optical image registration tasks. To improve network training effectiveness, the loss function employs a triplet margin loss strategy based on the Euclidean distance of difficult negative samples, thereby improving registration accuracy.
(3): BSS-2chDCNN [43] uses a deep learning network to optimize the accuracy of image patch matching. Through joint optimization of deep feature extraction and distance metrics, it establishes more precise correspondences between optical and SAR images.

3.2. Comparison of Registration Results

To evaluate the registration performance of the proposed method, we present the checkerboard registration results obtained using different methods, as shown in Figure 9 and Table 3. The comparison is based on the average RMSE, average registration time, average CMN and average Scat values [44].

RMSE is one of the most commonly used accuracy metrics in image registration. It measures the average Euclidean distance error between matched points before and after registration.

RMSE = \sqrt{\frac{1}{N} \sum_{i = 1}^{N} [{(x_{i} - x_{i}^{'})}^{2} + {(y_{i} - y_{i}^{'})}^{2}]}

(20)

where N denotes the total number of matched point pairs used for registration evaluation.

(x_{i}, y_{i})

and

(x_{i}^{'}, y_{i}^{'})

represent the coordinates of the i-th matched point in the reference image and the sensed image, respectively.

Scat is used to evaluate the spatial uniformity of matched points across the image. This metric reflects whether the matched points are representative and globally distributed over the entire image, serving as an important complementary indicator to assess the spatial coverage of registration accuracy. Given a set of matched point coordinates in the reference image

{x_{1}, x_{2}, \dots, x_{N}}

, the Scat value is computed as:

S_{cat} = \frac{1}{N} \sum_{i = 1}^{N} M (\{∥ x_{i} - x_{j} ∥ ∣ j = 1, 2, \dots, N\})

(21)

where

M (\cdot)

denotes the median operator applied to each distance set. All coordinates are normalized before computation. A higher

S_{cat}

value indicates that the matched points are more evenly distributed across the image, while a lower value suggests spatial clustering.

To intuitively display the registration results, we randomly selected 12 pairs of test images, including 6 pairs with distinct building elevation changes and strong scattering regions. All images exhibit localized distortions caused by terrain undulations. It is important to note that the proposed method is specifically intended to address localized distortions in suburban block-structured terrains (e.g., levees and engineered structures) and urban areas with localized elevation variations (e.g., building complexes and bridges). These scenarios are typically characterized by pronounced structural regularity, such as right-angled corners of man-made buildings and planar structures that facilitate the formation of stable superpixel boundaries, and moderate texture complexity, balancing homogeneous areas with high-contrast edges. This structural environment enables the reliable extraction of superpixel geometric centroids while preserving semantic consistency. The scenes in Figure 10(a1–f1,a2–f2) exhibit significant terrain undulations at field boundaries, water areas, forests, rivers, and urban clusters. In Figure 11(a1–f1,a2–f2), due to differences in surface material scattering coefficients, ground object scenes not only exhibit geometrically distorted areas but also contain regions of strong scattering to varying degrees. Furthermore, owing to the side-looking imaging mechanism of SAR, buildings with elevation variations display significant geometric distortions, as shown in Figure 11(b1,d1,f1,b2,d2,f2).

As shown in Figure 12(a1–a4), PSO-SIFT achieved good registration results in flat areas but exhibited mismatches in the yellow-framed regions with terrain undulation. Due to similarities between image blocks, BSS-2chDCNN and SOHardNet also experienced mismatches during feature matching. As shown in the yellow-framed areas in Figure 12(b1–b4), PSO-SIFT, BSS-2chDCNN, and SOHardNet all produced mismatches in the locally distorted undulating regions, affecting the overall registration results. In Figure 12(c1–c4), local distortion regions occupy a larger portion of the global image, with only our proposed method achieving satisfactory registration. PSO-SIFT, BSS-2chDCNN, and SOHardNet all experienced mismatches in the yellow-framed areas. In Figure 12(d1–d4), PSO-SIFT shows considerable error, particularly at the boundaries between block-shaped field areas and roads, where local distortions were not effectively handled. Overall, BSS-2chDCNN performs slightly better than PSO-SIFT, but is still affected by locally similar regions, leading to reduced registration accuracy in field areas. SOHardNet achieves relatively good overall registration, but still has some limitations in finer registration details. In Figure 12(e1–e4), the presence of urban building clusters significantly affects PSO-SIFT’s feature matching accuracy. Although BSS-2chDCNN and SOHardNet show some improvement over PSO-SIFT overall, they are still limited by their single registration model, resulting in notable registration errors in local areas. In Figure 12(f1–f4), local distortions due to building and vegetation undulations caused mismatches in PSO-SIFT, BSS-2chDCNN, and SOHardNet, achieving acceptable accuracy in only some areas rather than the entire scene. In scenarios with local distortions, only our proposed method achieved satisfactory registration results in both distorted and non-distorted areas.

Figure 13 shows the registration results for six image pairs containing significant building elevation variations and strong scattering regions. PSO-SIFT exhibited mismatches within the yellow-framed regions, while BSS-2chDCNN and SOHardNet, constrained by a single registration model, achieved higher accuracy only in normal regions, as shown in Figure 13(a1–a4). In Figure 13(b1–b4), the building regions exhibit significant perspective shrinkage, causing PSO-SIFT, BSS-2chDCNN, and SOHardNet to fail in registration, thereby affecting the overall registration results of the global scene. As shown in Figure 13(c1–c4), the inevitable nonlinear differences between optical and SAR images impact the registration model calculation, even when local distortion areas occupy a small portion of the global image, leading to decreased registration accuracy. Figure 13(d1–d4) show a composite scene with numerous local distortion and normal regions, where PSO-SIFT fails to achieve good control point matching, resulting in suboptimal overall registration. BSS-2chDCNN and SOHardNet also exhibit decreased accuracy in local detail registration due to their lack of effective handling for local distortions. Similarly, in the composite scene of Figure 13(e1–e4), our method effectively handles local distortions within the global scene, achieving optimal registration results at both the water-land boundaries and building cluster areas. Figure 13(f1–f4) show that although PSO-SIFT, BSS-2chDCNN, and SOHardNet achieve relatively good overall registration, the lack of local distortion handling results in difficulties estimating an accurate transformation model, causing registration misalignment.

From the analysis of the registration results above, it is evident that while PSO-SIFT shows certain robustness to scale and rotation, it remains sensitive to radiometric differences and elevation variations, limiting accurate matching in areas with significant terrain variation. BSS-2chDCNN performs well in flat areas and low-texture scenes; however, its reliance on a single model based on global features limits its adaptability to terrain elevation changes, and it is also susceptible to texture repetition. These limitations further restrict its overall registration performance. SOHardNet achieves high registration accuracy in flat areas with complex textures; however, its feature extraction and matching process exhibit limited adaptability to local distortions. It does not fully account for the effects of image distortions caused by elevation changes, resulting in unstable performance. Our proposed method not only achieves superior registration in normal areas but also demonstrates high registration accuracy in local distortion regions by applying separate registration to areas with varying degrees of distortion.

4. Discussions

4.1. The Impact of Multi-Feature Fusion Strategies

In optical and SAR image registration, both shallow texture features and deep semantic features serve distinct yet complementary roles. To assess the effectiveness of the proposed feature descriptors, we conducted an ablation study. As illustrated in Figure 14, we compare the architecture of a conventional capsule network with our proposed design. While traditional capsule networks construct primary capsules solely from shallow features, our approach incorporates both shallow texture and deep semantic features, thereby enriching the primary capsules with more comprehensive and robust information.

As demonstrated in Table 4, the multi-feature fusion capsule network maintains comparable computational efficiency while achieving superior registration accuracy over the traditional counterpart. These findings underscore that the fundamental modality disparities between optical and SAR imaging systems lead to inherent misalignments in shallow feature correspondence. Such discrepancies, exacerbated by local geometric distortions and sensor noise, often result in registration mismatches or performance instability. The quantitative analysis results in this study provide compelling evidence that the introduction of a multimodal feature fusion mechanism enables capsule network architectures to exhibit significant intrinsic advantages in cross-modal registration tasks between optical and SAR images. The MFFCN innovatively integrates texture, phase, structural, and amplitude information to construct a more comprehensive feature representation space, effectively addressing feature aliasing, information loss, and semantic ambiguity issues present in traditional CNN feature descriptors. In conjunction with the results presented Figure 9, this further confirms that our improvements to the capsule architecture not only enhance registration accuracy, but also improve geometric structural consistency and the preservation of scattering center distributions. In conclusion, capsule network architectures are well-suited for optical-to-SAR image registration, and when augmented with a multimodal feature fusion strategy, they demonstrate superior performance in complex scenes involving local geometric distortions.

4.2. Impact of Local Distortion Region Division

Traditional registration methods often assume a globally consistent transformation across the entire image, ignoring the spatial heterogeneity caused by local geometric distortions. This simplified assumption may lead to cumulative registration errors in complex scenarios such as suburban transition zones and mountainous urban areas. To address this issue, this paper proposes a superpixel-based local distortion region division strategy. The core idea lies in quantifying the degree of local geometric distortion through cross-modal feature measurement of superpixel blocks. Based on this assessment, the image to be registered is divided into locally distorted regions and general regions, for which separate transformation matrices are constructed. To verify the superiority and effectiveness of the proposed strategy, we adopt Edge Fidelity (EF) as a quantitative metric to evaluate the preservation of registration detail. In addition, ablation experiments are conducted to compare the performance of MFFCN with and without the SLDD strategy.

EF evaluates the preservation of details in the connection regions of images, reflecting the consistency of edge information and the presence of noticeable discontinuities or distortions. For optical and SAR images, the normalized edge fidelity is used to measure registration details, and the calculation formula is as follows:

\begin{matrix} EF = 1 - \frac{\sum |G_{1, norm} - G_{2, norm}|}{\sum (G_{1, norm} + G_{2, norm}) + ϵ} \end{matrix}

(22)

where,

G_{1, norm}

and

G_{2, norm}

represent the normalized gradient magnitudes of the two images, respectively.

ϵ

is a very small positive number used to prevent division by zero. A normalized edge fidelity value closer to 0 indicates better consistency and detail preservation in the connection of local regions during the registration of the two images.

The EF comparison results shown in Table 5 indicate that our method achieves the lowest average EF value among all compared approaches. This clearly demonstrates the superiority of our method in terms of edge continuity and structural consistency in registration details. These results validate the effectiveness of the proposed SLDD strategy and also reflect the robustness of MFFCN in handling optical and SAR image registration.

Local structural misalignments in registration results are typically caused by terrain undulation, cross-resolution differences, and sensor-specific distortions. As further confirmed by the results in Table 3 and Table 6, although using a single global transformation model may improve computational efficiency, it often leads to significant misregistration in regions with abrupt elevation changes or large sensor viewpoint differences—due to its lack of local adaptability. In contrast, our method achieves a better balance between global efficiency and local precision by constructing separate transformation models for different region types.

To further assess the statistical significance of the improvement in edge fidelity, we conducted paired t-tests comparing the EF values of our method with those of three baseline methods across 12 pairs of test images. As shown in Table 7, all comparisons exhibit statistically decisive evidence, indicating that the observed improvements in registration detail are highly significant and not due to random variation.

Specifically, the differences in terrain complexity between optical and SAR images limit the effectiveness of conventional global transformation models in addressing local distortions. The proposed SLDD module adaptively identifies and segments locally distorted regions. Additionally, by leveraging the architectural strengths of capsule networks, the MFFCN is able to extract informative features from both optical and SAR images, enhancing local adaptability and achieving superior performance in terms of both registration detail preservation and overall alignment consistency. As illustrated in Figure 15 and Figure 16, the proposed local distortion region division strategy significantly improves registration results, particularly in challenging areas such as water-land boundaries, terrain variations, and building elevation changes—outperforming methods that apply a single transformation model without distortion-aware region division. Moreover, the registration accuracy in normal regions, such as road networks and farmland boundaries, is also improved. Figure 17 shows the violin plots of EF distributions for all compared methods. It can be seen that, compared with global transformation-based approaches, our method achieves higher registration accuracy by enabling differentiated treatment of locally distorted and regular regions, making it particularly effective for complex multimodal scenarios involving optical and SAR images.

5. Conclusions

This paper proposes a global optical and SAR image registration method based on local distortions division, aiming to address the decline in registration accuracy and insufficient detail precision caused by local geometric distortions resulting from elevation variations and imaging condition changes. We introduce the SLDD method based on superpixel features to divide the image into different distortion regions and employ MFFCN to perform feature extraction and transformation model calculation for each region. The proposed MFFCN integrates shallow and deep features, reconstructing dimensions to ensure that the digital capsule descriptors encapsulate more comprehensive and explicit registration feature information. This effectively addresses feature aliasing, information loss, and semantic ambiguity issues present in traditional CNN feature descriptors. In addition, the proposed loss function based on hard negative mining further enhances the model’s registration accuracy. Finally, by fusing and reconstructing the registration results of each region, the global registration image is generated. Experimental results demonstrate that the proposed method outperforms existing advanced methods on datasets with local distortion regions, validating its effectiveness and reliability.

Although the proposed method demonstrates promising registration accuracy and robustness across multiple datasets, several limitations remain. For instance, the current distortion region partitioning module relies on the continuity of superpixel generation and statistical thresholding, which may introduce instability in low-contrast homogeneous areas, such as farmland. In future work, we plan to explore a novel strategy based on graph convolutional networks (GCNs) to model complex spatial topologies and semantic relationships in non-Euclidean spaces. This approach is expected to better handle large-scale image scenes with complex structures and irregular spatial relationships, and to extend the applicability of the method to the registration of more diverse multimodal datasets. In conclusion, the proposed framework enables the segmentation of local distortion regions in optical and SAR images and applies different geometric transformation models to different regions. It effectively addresses the issue of local misalignment induced by single-model registration strategies and provides a new pathway for optical-to-SAR image registration.

Author Contributions

Conceptualization, B.L. and D.G.; Methodology, B.L., D.G. and Y.X. Software, D.G., Y.X., X.Z., W.Z. and Z.C.; Resources, X.Z., L.P. and D.X.; Writing, B.L. and D.G. All authors have read and agreed tothe published version of the manuscript.

Funding

This research received no funding.

Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Conflicts of Interest

The authors declare no conflict of interest.

References

Khan, K.; Khan, S.N.; Ali, A.; Khokhar, M.F.; Khan, J.A. Estimating Aboveground Biomass and Carbon Sequestration in Afforestation Areas Using Optical/SAR Data Fusion and Machine Learning. Remote Sens. 2025, 17, 934. [Google Scholar] [CrossRef]
Zhang, W.; Mei, J.; Wang, Y. DMDiff: A Dual-Branch Multimodal Conditional Guided Diffusion Model for Cloud Removal Through SAR-Optical Data Fusion. Remote Sens. 2025, 17, 965. [Google Scholar] [CrossRef]
Ozdemir, E.G.; Abdikan, S. Forest Aboveground Biomass Estimation in Küre Mountains National Park Using Multifrequency SAR and Multispectral Optical Data with Machine-Learning Regression Models. Remote Sens. 2025, 17, 1063. [Google Scholar] [CrossRef]
Xu, W.; Yuan, X.; Hu, Q.; Li, J. SAR-optical feature matching: A large-scale patch dataset and a deep local descriptor. Int. J. Appl. Earth Obs. Geoinf. 2023, 122, 103433. [Google Scholar] [CrossRef]
Ye, Y.; Yang, C.; Zhu, B.; Zhou, L.; He, Y.; Jia, H. Improving co-registration for sentinel-1 SAR and sentinel-2 optical images. Remote Sens. 2021, 13, 928. [Google Scholar] [CrossRef]
Chen, J.; Xie, H.; Zhang, L.; Hu, J.; Jiang, H.; Wang, G. SAR and optical image registration based on deep learning with co-attention matching module. Remote Sens. 2023, 15, 3879. [Google Scholar] [CrossRef]
Zampieri, A.; Charpiat, G.; Girard, N.; Tarabalka, Y. Multimodal image alignment through a multiscale chain of neural networks with application to remote sensing. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 657–673. [Google Scholar]
Sommervold, O.; Gazzea, M.; Arghandeh, R. A survey on SAR and optical satellite image registration. Remote Sens. 2023, 15, 850. [Google Scholar] [CrossRef]
Zhang, H.; Lei, L.; Ni, W.; Cheng, K.; Tang, T.; Wang, P.; Kuang, G. Registration of Large Optical and SAR Images with Non-Flat Terrain by Investigating Reliable Sparse Correspondences. Remote Sens. 2023, 15, 4458. [Google Scholar] [CrossRef]
Xiang, D.; Xie, Y.; Cheng, J.; Xu, Y.; Zhang, H.; Zheng, Y. Optical and SAR image registration based on feature decoupling network. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5235913. [Google Scholar] [CrossRef]
Ye, Y.; Shan, J.; Bruzzone, L.; Shen, L. Robust registration of multimodal remote sensing images based on structural similarity. IEEE Trans. Geosci. Remote Sens. 2017, 55, 2941–2958. [Google Scholar] [CrossRef]
Xiang, Y.; Tao, R.; Wan, L.; Wang, F.; You, H. OS-PC: Combining feature representation and 3-D phase correlation for subpixel optical and SAR image registration. IEEE Trans. Geosci. Remote Sens. 2020, 58, 6451–6466. [Google Scholar] [CrossRef]
Xiang, Y.; Tao, R.; Wang, F.; You, H.; Han, B. Automatic registration of optical and SAR images via improved phase congruency model. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 13, 5847–5861. [Google Scholar] [CrossRef]
Ye, Y.; Bruzzone, L.; Shan, J.; Bovolo, F.; Zhu, Q. Fast and robust matching for multimodal remote sensing image registration. IEEE Trans. Geosci. Remote Sens. 2019, 57, 9059–9070. [Google Scholar] [CrossRef]
Paul, S.; Pati, U.C. SAR image registration using an improved SAR-SIFT algorithm and Delaunay-triangulation-based local matching. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2019, 12, 2958–2966. [Google Scholar] [CrossRef]
Lowe, D.G. Object recognition from local scale-invariant features. In Proceedings of the Seventh IEEE International Conference on Computer Vision, Corfu, Greece, 20–27 September 1999; Volume 2, pp. 1150–1157. [Google Scholar]
Dellinger, F.; Delon, J.; Gousseau, Y.; Michel, J.; Tupin, F. SAR-SIFT: A SIFT-like algorithm for SAR images. IEEE Trans. Geosci. Remote Sens. 2014, 53, 453–466. [Google Scholar] [CrossRef]
Yu, Q.; Ni, D.; Jiang, Y.; Yan, Y.; An, J.; Sun, T. Universal SAR and optical image registration via a novel SIFT framework based on nonlinear diffusion and a polar spatial-frequency descriptor. ISPRS J. Photogramm. Remote Sens. 2021, 171, 1–17. [Google Scholar] [CrossRef]
Xiang, Y.; Wang, F.; You, H. OS-SIFT: A robust SIFT-like algorithm for high-resolution optical-to-SAR image registration in suburban areas. IEEE Trans. Geosci. Remote Sens. 2018, 56, 3078–3090. [Google Scholar] [CrossRef]
Jia, L.; Dong, J.; Huang, S.; Liu, L.; Zhang, J. Optical and SAR image registration based on multi-scale orientated map of phase congruency. Electronics 2023, 12, 1635. [Google Scholar] [CrossRef]
Wang, Y.; Yu, X.; Zhang, Y.; Pei, J.; Huo, W.; Huang, Y.; Yang, J. An adaptive SAR and optical images registration approach based on SOI-SIFT. In Proceedings of the IGARSS 2022-2022 IEEE International Geoscience and Remote Sensing Symposium, Kuala Lumpur, Malaysia, 17–22 July 2022; pp. 2582–2585. [Google Scholar]
Sun, X.; Yun, Z.; Hu, C.; Chen, H.; Yang, C. SPLM-Net: Large Scene SAR Image Registration Based on Point and Line Matching Network. In Proceedings of the IGARSS 2024-2024 IEEE International Geoscience and Remote Sensing Symposium, Athens, Greece, 7–12 July 2024; pp. 7371–7374. [Google Scholar]
Liao, Y.; Di, Y.; Zhou, H.; Li, A.; Liu, J.; Lu, M.; Duan, Q. Feature matching and position matching between optical and SAR with local deep feature descriptor. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 15, 448–462. [Google Scholar] [CrossRef]
Merkle, N.; Luo, W.; Auer, S.; Müller, R.; Urtasun, R. Exploiting deep matching and SAR data for the geo-localization accuracy improvement of optical satellite images. Remote Sens. 2017, 9, 586. [Google Scholar] [CrossRef]
Ye, F.; Su, Y.; Xiao, H.; Zhao, X.; Min, W. Remote sensing image registration using convolutional neural network features. IEEE Geosci. Remote Sens. Lett. 2018, 15, 232–236. [Google Scholar] [CrossRef]
Zhang, H.; Ni, W.; Yan, W.; Xiang, D.; Wu, J.; Yang, X.; Bian, H. Registration of multimodal remote sensing image based on deep fully convolutional neural network. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2019, 12, 3028–3042. [Google Scholar] [CrossRef]
Liu, Y.; Lin, M.; Mo, Y.; Wang, Q. SAR-Optical Image Matching Using Self-Supervised Detection and A Transformer-CNN-Based Network. IEEE Geosci. Remote Sens. Lett. 2024, 21, 4002505. [Google Scholar] [CrossRef]
Quan, D.; Wei, H.; Wang, S.; Lei, R.; Duan, B.; Li, Y.; Hou, B.; Jiao, L. Self-distillation feature learning network for optical and SAR image registration. IEEE Trans. Geosci. Remote Sens. 2022, 60, 4706718. [Google Scholar] [CrossRef]
Ye, Y.; Tang, T.; Zhu, B.; Yang, C.; Li, B.; Hao, S. A multiscale framework with unsupervised learning for remote sensing image registration. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5622215. [Google Scholar] [CrossRef]
Liu, L.; Wang, Y.; Peng, J.; Zhang, L. GLR-CNN: CNN-based framework with global latent relationship embedding for high-resolution remote sensing image scene classification. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5633913. [Google Scholar] [CrossRef]
Xiang, Y.; Jiang, L.; Wang, F.; You, H.; Qiu, X.; Fu, K. Detector-free feature matching for optical and SAR images based on a two-step strategy. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5214216. [Google Scholar] [CrossRef]
Ye, Y.; Zhu, B.; Tang, T.; Yang, C.; Xu, Q.; Zhang, G. A robust multimodal remote sensing image registration method and system using steerable filters with first-and second-order gradients. ISPRS J. Photogramm. Remote Sens. 2022, 188, 331–350. [Google Scholar] [CrossRef]
Xiang, D.; Ding, H.; Sun, X.; Cheng, J.; Hu, C.; Su, Y. PolSAR image registration combining Siamese multiscale attention network and joint filter. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5208414. [Google Scholar] [CrossRef]
Li, B.; Guan, D.; Zheng, X.; Chen, Z.; Pan, L. SD-CapsNet: A Siamese Dense Capsule Network for SAR Image Registration with Complex Scenes. Remote Sens. 2023, 15, 1871. [Google Scholar] [CrossRef]
Bentoutou, Y.; Taleb, N.; Kpalma, K.; Ronsin, J. An automatic image registration for applications in remote sensing. IEEE Trans. Geosci. Remote Sens. 2005, 43, 2127–2137. [Google Scholar] [CrossRef]
Zhang, W.; Xiang, D.; Su, Y. Fast multiscale superpixel segmentation for SAR imagery. IEEE Geosci. Remote Sens. Lett. 2020, 19, 4001805. [Google Scholar] [CrossRef]
Achanta, R.; Shaji, A.; Smith, K.; Lucchi, A.; Fua, P.; Süsstrunk, S. SLIC superpixels compared to state-of-the-art superpixel methods. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 34, 2274–2282. [Google Scholar] [CrossRef]
Sun, Y.; Lei, L.; Li, X.; Tan, X.; Kuang, G. Structure consistency-based graph for unsupervised change detection with homogeneous and heterogeneous remote sensing images. IEEE Trans. Geosci. Remote Sens. 2021, 60, 4700221. [Google Scholar] [CrossRef]
Nešetril, J.; Nešetrilová, H. The origins of minimal spanning tree algorithms–Boruvka and Jarnık. Doc. Math. 2012, 127–141. [Google Scholar] [CrossRef]
Wang, L.; Xu, X.; Yu, Y.; Yang, R.; Gui, R.; Xu, Z.; Pu, F. SAR-to-optical image translation using supervised cycle-consistent adversarial networks. IEEE Access 2019, 7, 129136–129149. [Google Scholar] [CrossRef]
Ma, W.; Wen, Z.; Wu, Y.; Jiao, L.; Gong, M.; Zheng, Y.; Liu, L. Remote sensing image registration with modified SIFT and enhanced feature matching. IEEE Geosci. Remote Sens. Lett. 2016, 14, 3–7. [Google Scholar] [CrossRef]
Bürgmann, T.; Koppe, W.; Schmitt, M. Matching of TerraSAR-X derived ground control points to optical image patches using deep learning. ISPRS J. Photogramm. Remote Sens. 2019, 158, 241–248. [Google Scholar] [CrossRef]
Fan, D.; Yang, D.; Zhang, Y. Satellite image matching method based on deep convolutional neural network. J. Geod. Geoinf. Sci. 2019, 2, 90. [Google Scholar]
Gonçalves, H.; Gonçalves, J.A.; Corte-Real, L. Measures for an objective evaluation of the geometric correction process quality. IEEE Geosci. Remote Sens. Lett. 2009, 6, 292–296. [Google Scholar] [CrossRef]

Figure 1. Diagram of Local Region Registration Error.

Figure 2. Schematic of the proposed method.

Figure 3. Superpixel feature diagram and Distortion Region Division Results. (a1–f1) Visualization of superpixel feature vectors. (a2–f2) Heatmap of superpixel feature regions in the Sensed images. (a3–f3) Division Result.

Figure 4. Superpixel feature diagram and Distortion Region Division Results. (a1–f1) Visualization of superpixel feature vectors. (a2–f2) Heatmap of superpixel feature regions in the Sensed images. (a3–f3) Division Result.

Figure 5. Dynamic Routing.

Figure 6. Schematic Diagram of the Proposed MFFCN.

Figure 7. Examples of registration results for two types of regions using independent transformation models. The green box represents a local distortion area, while the red box represents a normal area. (a) Checkerboard pattern of ConseqL in the local distortion area. (b) Checkerboard pattern of ConseqN in the normal area.

Figure 8. Illustrative Diagram of Fusion Reconstruction.

Figure 9. Time-Accuracy Plot for Four Methods.

Figure 10. Test image pairs with local distortions caused by terrain undulations. (a1–f1) are optical images. (a2–f2) are SAR images.

Figure 11. Test image pairs in areas with significant building elevation variations and strong scattering regions. (a1–f1) are optical images. (a2–f2) are SAR images.

Figure 12. Checkerboard Overlays of Different Methods. (a1–f1) Registration results of PSO-SIFT. (a2–f2) Registration results of BSS-2chDCNN. (a3–f3) Registration results of SOHardNet. (a4–f4) Registration results of the proposed method.

Figure 13. Checkerboard overlays for different methods in areas with significant building elevation variations and strong scattering regions. (a1–f1) Registration results of PSO-SIFT. (a2–f2) Registration results of BSS-2chDCNN. (a3–f3) Registration results of SOHardNet. (a4–f4) Registration results of the proposed method.

Figure 14. A Comparative Illustration of Capsule Network Structuress.

Figure 15. Connection diagrams of local details of different methods. (a–f) The first row of each column shows PSO-SIFT, the second row shows BSS-2chDCNN, the third row shows SOHardNet, and the last row shows ours.

Figure 16. Connection diagrams of local details of different methods in regions with significant building height variations and strong scattering. (a–f) The first row of each column shows PSO-SIFT, the second row shows BSS-2chDCNN, the third row shows SOHardNet, and the last row shows ours.

Figure 17. The violin plot of EF for registration results of different methods.

Table 1. The specific parameters of the MFFCN structure.

Network Module	Network Layer	Input Size	Output Size
Input layer	-	$32 \times 32 \times 1$	-
Conv1	Conv(3×3), stride(1)	$32 \times 32 \times 1$	$32 \times 32 \times 32$
Conv2	Conv(3×3), stride(1)	$32 \times 32 \times 32$	$32 \times 32 \times 64$
Conv3	Conv(3×3), stride(1)	$32 \times 32 \times 64$	$32 \times 32 \times 128$
Primary Capsule	Conv(9×9), stride(2)	$32 \times 32 \times 256$	$12 \times 12 \times 256$
Primary Capsule	Reshape	$12 \times 12 \times 256$	$8 \times 4032$
Digital Capsule	Dynamic Routing	$8 \times 4032$	$4 \times 32$

Table 2. The basic experimental environment settings.

Platform	Windows 10
Torch	V 1.10.1
Matlab	2019a
CPU	Intel(R) Core(TM) i5-13400
GPU	Nvidia GeForce RTX 3060
Video memory	12G

Table 3. Comparison of average root mean square error (RMSE), registration time, and normalized Scat for different methods.

Method	Average RMSE	Average Time (s)	Average Scat	Average CMN
PSO-SIFT	1.51	22.08	0.186	7
BSS-2chDCNN	1.30	6.67	0.252	13
SOHardNet	1.08	15.44	0.283	19
Ours	0.85	13.58	0.418	28

Table 4. Ablation study results of multi-feature fusion strategy.

Method	Average RMSE	Average Time (s)
Traditional Capsule Network	1.39	3.36
Multi-Feature Fusion Capsule Network	1.03	3.50

Table 5. Normalized edge fidelity (EF) results of different methods.

Method	Image 1	Image 2	Image 3	Image 4	Image 5	Image 6	Image 7	Image 8	Image 9	Image 10	Image 11	Image 12	Average
PSO-SIFT	0.4838	0.5086	0.4754	0.4014	0.5373	0.4454	0.5301	0.4542	0.4817	0.5050	0.4639	0.4747	0.4801
BSS-2chDCNN	0.4104	0.4449	0.4806	0.4472	0.5476	0.4469	0.4529	0.4378	0.4613	0.4706	0.4570	0.4812	0.4615
SOHardNet	0.4252	0.4166	0.4350	0.4245	0.5005	0.4601	0.4725	0.4076	0.4337	0.4588	0.4387	0.4915	0.4470
Ours	0.3948	0.4018	0.3986	0.3852	0.4831	0.4230	0.4388	0.3847	0.4099	0.4285	0.4083	0.4524	0.4174

Table 6. Ablation study results of distortion region division.

Method	Average RMSE	Average Time (s)
MFFCN Without SLDD	1.03	3.50
MFFCN With SLDD	0.85	13.58

Table 7. Paired t-test results of normalized edge fidelity (EF) compared with the proposed method.

Method	Mean EF	Std Dev	t-Value vs. Ours	p-Value	Significance
PSO-SIFT	0.4801	0.0361	7.3854	<0.0001	***
BSS-2chDCNN	0.4615	0.0319	7.4015	<0.0001	***
SOHardNet	0.4470	0.0284	12.4169	<0.0001	***

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, B.; Guan, D.; Xie, Y.; Zheng, X.; Chen, Z.; Pan, L.; Zhao, W.; Xiang, D. Global Optical and SAR Image Registration Method Based on Local Distortion Division. Remote Sens. 2025, 17, 1642. https://doi.org/10.3390/rs17091642

AMA Style

Li B, Guan D, Xie Y, Zheng X, Chen Z, Pan L, Zhao W, Xiang D. Global Optical and SAR Image Registration Method Based on Local Distortion Division. Remote Sensing. 2025; 17(9):1642. https://doi.org/10.3390/rs17091642

Chicago/Turabian Style

Li, Bangjie, Dongdong Guan, Yuzhen Xie, Xiaolong Zheng, Zhengsheng Chen, Lefei Pan, Weiheng Zhao, and Deliang Xiang. 2025. "Global Optical and SAR Image Registration Method Based on Local Distortion Division" Remote Sensing 17, no. 9: 1642. https://doi.org/10.3390/rs17091642

APA Style

Li, B., Guan, D., Xie, Y., Zheng, X., Chen, Z., Pan, L., Zhao, W., & Xiang, D. (2025). Global Optical and SAR Image Registration Method Based on Local Distortion Division. Remote Sensing, 17(9), 1642. https://doi.org/10.3390/rs17091642

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Global Optical and SAR Image Registration Method Based on Local Distortion Division

Abstract

1. Introduction

2. Methodology

2.1. Superpixel Segmentation

2.2. Superpixel-Based Local Distortion Division (SLDD)

2.3. Multi-Feature Fusion Capsule Network (MFFCN)

2.4. Fusion Reconstruction

3. Experimental Results and Analysis

3.1. Data Description and Parameter Settings

3.2. Comparison of Registration Results

4. Discussions

4.1. The Impact of Multi-Feature Fusion Strategies

4.2. Impact of Local Distortion Region Division

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI