1. Introduction
Optical remote sensing and Synthetic Aperture Radar (SAR) technology are two main remote sensing methods for obtaining information about the Earth’s surface [
1,
2,
3]. Optical remote sensing generates images by recording the reflection or radiation of visible light, while SAR technology uses microwave signals to probe the surface and generate images [
4]. Since optical remote sensing and SAR technology operate in different electromagnetic bands and physical imaging mechanisms, the images obtained by these two types of sensors exhibit unique features and advantages. Optical images possess high-resolution spatial information and rich spectral information, enabling precise identification of ground features, however, they cannot be obtained under cloudy or nighttime conditions [
5]. In contrast, SAR technology can penetrate clouds and observe at night, possessing all-weather observation capabilities [
6]. However, terrain and surface cover (such as buildings, trees, and water bodies) can affect the propagation and reflection of microwave signals, leading to multipath reflection and scattering, which cause distortion and interference in the images [
7]. These factors render the registration of optical images and SAR images an ongoing challenging research issue [
8].
Image registration methods are generally divided into two categories: region-based methods and feature-based methods [
9]. Region-based methods evaluate the similarity of image blocks using template matching techniques. Classic similarity measures include mutual information, normalized cross-correlation, and cumulative residual entropy [
10]. Ye et al. [
11] replaced gradient information with phase consistency information to construct a directional phase consistency histogram (HOPC), improving the registration accuracy of multimodal images. Xiang et al. [
12] mapped 3D image features and employed phase correlation methods for image registration. Xiang et al. [
13] further improved the phase consistency model. In summary, incorporating structural features significantly enhances the registration accuracy of template matching methods. However, these methods often overlook the geometric relationships and spatial structures between local regions, hindering their ability to effectively manage geometric changes caused by local distortions, which ultimately degrades matching performance. Additionally, in certain algorithms, local features may be overshadowed by the dominance of global features, diminishing their significance during the registration process. These limitations present significant challenges for region-based methods in addressing complex distortions [
14].
Compared to region-based global matching methods, feature-based methods incur lower computational costs [
15]. Feature-based methods rely on key points or significant features in the image registration, which require relatively low computational cost. This is particularly advantageous for large-scale images, as it significantly improves registration efficiency. Feature-based methods mainly involve four steps: control point detection, feature description, feature matching, and image transformation. Scale-Invariant Feature Transform (SIFT) and its variants are the most common artificial descriptors [
16]. Dellinger et al. [
17] proposed a gradient computation method specifically designed for SAR images, which is robust to speckle noise. Yu et al. [
18] designed a rotation-invariant amplitudes of log-Gabor orientation histograms (RI-ALGH) descriptor and proposed an improved SAR and optical image registration algorithm based on a nonlinear SIFT framework. The OS-SIFT algorithm proposed by [
19] uses the Sobel and ROEWA operators to calculate the gradient directions and magnitudes of optical and SAR images, respectively, thereby enabling image registration within the classic SIFT framework. Jia et al. [
20] proposed a registration algorithm based on multi-scale directional phase consistency mapping (MSPCO), which performs better in retaining edge and texture information. Wang et al. [
21] proposed a scale-invariant SOI-SIFT algorithm, which enhances the edge information in optical and SAR images and then uses the ROEWA operator to construct SIFT-like feature vector descriptors. Although these strategies can generally achieve a certain degree of point matching on a global scale, significant positional shifts caused by distortions in local regions remain challenging to fully eliminate. This shift is not merely a pixel-level error but a mismatch distortion caused by the systemic differences in image geometric characteristics. For instance, in local regions such as ridgelines or edges with steep terrain undulations, slight variations in control point positions may lead to mismatched point pairs, which are challenging to rectify through global optimization strategies. Furthermore, such local shifts can accumulate in subsequent geometric transformation and registration steps, ultimately reducing the overall registration accuracy. Therefore, to effectively address the issue of point mismatches caused by local distortions, it is imperative to develop more robust point matching methods to enhance the reliability and accuracy of optical and SAR image registration.
In response to the above issues, deep learning-based optical and SAR image registration methods have garnered significant attention in the research community [
22,
23]. Merkle et al. [
24] introduced a Siamese feature extraction network employing dilated convolutions to estimate feature similarity by correlating 2D feature vectors, representing the first effort in deep learning-based optical and SAR template image matching. Ye et al. [
25] proposed a pre-trained VGG16 model to extract CNN features and combined them with SIFT features to construct a joint feature for optical and SAR image registration. Zhang et al. [
26] proposed a training strategy that addresses the significant impact of the training sample selection strategy during matching network training, enhancing training performance by maximizing the feature distance between positive samples and easily confusable negative samples. Liu et al. [
27] proposed a dual-branch matching network based on Transformer-CNN for extracting multi-scale image features. Quan et al. [
28] proposed self-distillation learning and ordered similarity metrics, leveraging Siamese and Pseudo-Siamese hybrid networks to improve the learning of isomorphic features in optical and SAR heterogeneous images. Ye et al. [
29] proposed an unsupervised image registration method that demonstrates stronger robustness when handling images with geometric and radiometric distortions.
However, local distortions alter the feature representation of the same object, impacting feature extraction and matching capabilities. Additionally, as CNNs perform convolution and pooling layer-by-layer in the feature extraction process, the features captured at different layers often influence one another, leading to issues such as feature aliasing, information loss, and semantic ambiguity [
30]. Specifically, in traditional CNNs, shallow and deep features may blend within the same feature descriptor, making it difficult for the model to distinguish information at different levels, resulting in feature aliasing. Additionally, certain details may be discarded in pooling layers, impacting the integrity of features and leading to information loss. Moreover, the mixed features cause the final feature representation to lack clarity in optical and SAR registration tasks, affecting model performance and resulting in semantic ambiguity.
Most importantly, the majority of the above registration methods follow the global registration pipeline, including feature detection, feature description, descriptor matching, and transformation estimation [
31,
32,
33], with only the feature extraction or matching computation methods replaced. Local distortions may cause the same ground object in the image to exhibit inconsistent geometric features at different locations or scales. For example, the presence of terrain undulations or occlusions can cause changes in the shape, size, or position of the same ground object in satellite images, making the correspondence between control points or feature areas in the registration process ambiguous or uncertain, thus reducing registration accuracy [
34]. Moreover, local distortions increase the complexity and uncertainty of transformation parameters, making the parameter estimation in the registration process more difficult. During the registration process, a transformation model is typically assumed to describe the geometric transformation relationship between images. For flat terrain images where surface elevation changes can be ignored, affine or projective geometric transformations [
35] and other linear models can be used to describe spatial mapping relationships. However, when there is significant terrain elevation variation, images captured under different imaging geometries will no longer satisfy the linear mapping relationship. Local distortions may cause these assumed transformation models to be inaccurate, failing to capture the true transformation relationship between images, thus affecting registration accuracy. Especially in optical and SAR image registration, the distance imaging mechanism of SAR sensors can generate significant local geometric distortions in the image under terrain undulation conditions, such as perspective shrinkage and shadowing, making it difficult to describe the spatial mapping relationship between optical and SAR images through explicit geometric relationships. Moreover, with the increase in image spatial resolution, even extremely small surface elevation changes may cause significant local geometric distortions. As shown in
Figure 1, after performing spatial registration using global geometric transformations, the lower part of the image (blue rectangular annotation) is accurately registered, while the upper part (red rectangular annotation) exhibits noticeable registration errors.
To address the aforementioned issues, we propose a global optical and SAR image registration method based on local distortion division. First, a local distortion region division algorithm based on superpixel features (SLDD) is introduced. By evaluating the distortion characteristics of different regions in optical and SAR images, the images are segmented based on distortion levels. This approach prevents inaccuracies in the global transformation model caused by local distortions and improves registration accuracy. We apply a fast multi-scale superpixel segmentation method based on a minimum spanning tree [
36] to segment the images, then calculate the mean and variance of the gray levels in each superpixel region as features. The similarity between superpixels in optical and SAR images is measured using the Mahalanobis distance, and the images are segmented based on their distortion levels. Next, inspired by capsule networks, we propose a Multi-Feature Fusion Capsule Network (MFFCN) to extract both shallow significant features and deep detailed features from optical and SAR images. Unlike traditional capsule networks, we set the reconstruction dimension of the capsules to 4, which accelerates dynamic routing computation and preserves key features for control point matching in optical and SAR registration tasks, such as texture, phase, structure, and amplitude. Additionally, a distance metric is defined for digital capsule feature descriptors to enhance robustness against local distortions. Furthermore, we design a loss function based on hard negative sample mining to further improve the accuracy of the model in handling locally distorted images. Finally, the regions divided by SLDD are input into MFFCN for feature extraction and matching. In contrast to the conventional pipeline involving feature extraction, descriptor construction, and matching, we apply region-specific transformation models according to varying levels of distortion. Furthermore, unlike traditional registration methods (e.g., SIFT, Harris) that depend on handcrafted keypoint extraction and similarity measures, our method introduces an innovative framework that leverages superpixel geometric centroids as feature points. By leveraging the inherent spatial coherence of superpixels, our approach ensures a more uniform and robust distribution of feature points across the image, thus enhancing the overall registration performance. Such spatially balanced distribution not only captures comprehensive structural cues but also reduces localization errors commonly induced by clustering or sparsity in traditional keypoints. Specifically, when distorted regions are categorized into N classes, the image undergoes N region-specific resampling operations, resulting in N distinct outputs. The regional registration outputs are subsequently fused to obtain a globally aligned result.
The main contributions of this paper are summarized as follows:
- (1)
We propose a local distortion region division algorithm, SLDD, based on superpixel features. This method defines a superpixel region feature and quickly segments optical and SAR images based on the degree of distortion by calculating the Mahalanobis distance between these features.
- (2)
We design a MFFCN that constructs primary capsules by fusing shallow significant features and deep detail features. These capsules are then reconstructed into capsule-based feature descriptors, containing the relevant information required for registration. Additionally, we define a Mahalanobis distance-based similarity metric for the digital capsule feature descriptors and construct a loss function that incorporates hard negative sample mining, enabling the model to achieve the desired training results.
- (3)
We propose a global optical and SAR image registration method based on local distortion division, combining SLDD and MFFCN. This method estimates and segments the distortion in image regions, applies different transformation models to different distortion regions, and ultimately performs global image registration by fusion and reconstruction, thereby overcoming the issue of single registration models being vulnerable to local distortions.
The structure of this paper is as follows:
Section 2 introduces the details of the proposed method;
Section 3 presents the registration results of our method, along with comparisons to other state-of-the-art methods. Finally, discussions and conclusions are given in
Section 4 and
Section 5.
2. Methodology
In this section, we will provide a detailed introduction to the proposed optical and SAR image registration method.
Figure 2 illustrates the flowchart of the proposed method. The method consists of SLDD and MFFCN and primarily includes three key steps: local distortion region assessment and partition, feature descriptor extraction and matching, and geometric transformation and fusion reconstruction.
2.1. Superpixel Segmentation
Superpixels are a method of aggregating pixels based on similar properties in an image. This process groups neighboring pixels with similar grayscale values, colors, brightness, textures, and other features into contiguous regions. The generation of superpixels is mainly achieved by clustering similar pixels in the image. Unlike pixel-by-pixel processing, superpixel generation algorithms cluster based on local region features, which effectively organizes the image information. Therefore, superpixels, as collections of pixels with similar attributes, share similar feature information within the superpixel. This makes them a holistic replacement for pixel-by-pixel processing in subsequent information processing tasks. Although superpixel algorithms, such as Simple Linear Iterative Clustering (SLIC) [
37], have achieved good results in natural images, they are challenging to directly apply to SAR image superpixel segmentation [
38]. Therefore, we employ the MST-based superpixel segmentation method to segment the reference and sensed images. Compared to conventional superpixel segmentation methods, this approach is more suitable for handling speckle noise and complex texture structures in SAR images.
First, backscatter dissimilarity, edge penalty, and homogeneity dissimilarity are used to define the dissimilarity measure
, to adapt to the multiplicative speckle noise characteristic in SAR images. Subsequently, two 5 × 5 local areas,
and
are defined as the areas surrounding the central pixel, with the mean amplitude of each sample respectively defined as
and
. Based on the backscatter model, the backscatter dissimilarity is calculated as:
where
and
represent the number of pixels in the respective regions, and
N denotes the number of observations. This dissimilarity measure, calculated based on the mean amplitude ratio, exhibits strong robustness against noise.
Edge information is determined by the edge similarity measure (ESM), which is calculated through the transformation of the mean ratio between two windows:
where
represents the mean amplitude ratio of pixel
a.
The edge penalty function increases the dissimilarity between edge pixels, thus preventing them from being assigned to the same superpixel, the edge penalty function is defined as follows:
Additionally, homogeneity information also plays an important role in the dissimilarity measurement. The homogeneity dissimilarity is measured by calculating the coefficient of variation (CoV) of the region, defined as:
Next, combining the three dissimilarity components mentioned above, the final dissimilarity measurement is obtained:
Finally, the Boruvka algorithm [
39] is employed to construct the minimum spanning tree. After obtaining the minimum spanning tree structure, edges with larger weights are cut by setting a dissimilarity threshold, and independent superpixel regions are divided, achieving superpixel segmentation of the image.
2.2. Superpixel-Based Local Distortion Division (SLDD)
Due to differences in terrain effects, imaging mechanisms, and other factors, the details of optical and SAR images differ in local regions, leading to the generation of local distortion. Local distortion makes it challenging for traditional control point selection methods to find robust control points, as these points may exhibit different geometric features in optical and SAR images, leading to a degradation in registration performance, especially in complex terrain or heterogeneous scenes. This paper proposes an image partitioning method based on local distortion, which calculates the superpixel feature similarity between the reference image and the sensed image and divides each image into local distortion regions and normal regions based on the degree of distortion. Specifically, the mean and variance of the grayscale values are used as the features of superpixels, and the Mahalanobis distance is then employed to measure the similarity between superpixels in optical and SAR images. For a superpixel region
S, suppose the superpixel contains
N pixels
, the formulas for calculating the mean and variance of their grayscale values are as follows:
Mahalanobis distance is used to measure the distance between a point and a distribution, taking into account the covariance structure of the data and the correlations between the data points, which is why it is widely used in multivariate data analysis and pattern recognition. Given two superpixel regions
and
, the feature vector of each superpixel region is represented as:
To evaluate the similarity between the two superpixel features, the Mahalanobis distance
is employed, and it is given by the following formula:
where
is the covariance matrix of the feature vectors, and
is its inverse matrix. The formula for the covariance matrix
is:
where
is the mean of the grayscale values,
is the variance of the grayscale values, and
is the covariance between the mean and variance of the grayscale values.
In summary, given the feature vectors
and
of two superpixel regions
and
, the calculation formula for superpixel feature values is:
We apply the MST algorithm to perform superpixel segmentation on both the reference image and the sensed image. For each superpixel pair, the Mahalanobis distance between the superpixel features is calculated, identifying the position where the feature similarity value
is minimized. This minimum value indicates the highest similarity between the superpixel in the reference image and its counterpart in the sensed image. Based on this similarity measure, we define the criterion for segmenting regions by local distortion.
where
is the average value of the superpixel features in the region with the smallest feature values,
is its standard deviation, and
is the division threshold.
The proposed SLDD method proceeds as follows. First, the reference image and the sensed image are segmented into superpixel regions using the MST algorithm. To account for local misalignments between the reference and sensed images, a spatial neighborhood with a radius of 30 pixels around each reference superpixel center is defined. Within this neighborhood in the sensed image, the Mahalanobis distance between the feature vectors of superpixels is computed to find the most similar superpixel. The similarity value is recorded for each reference superpixel. Finally, an adaptive threshold based on the distribution of all values is used to classify regions into local distortion areas and normal areas.
As shown in
Figure 3(a1–f1) and
Figure 4(a1–f1), the visualization results of the superpixel features from Equation (
7) are provided, while
Figure 3(a2–f2) and
Figure 4(a2–f2) presents the heatmaps of the superpixel feature regions of the images to be registered. It can be observed that the proposed superpixel features effectively reflect the image distortion in different superpixel regions, covering typical distortion areas such as field-water boundaries, terrain undulations, building elevation areas, and regions with side-view angle variations. Notably, while most detected distortion regions align with true geometric distortions (e.g., building elevation shifts in urban areas), a small subset of false positives may occur in homogeneous farmland regions. These pseudo-distortion signals primarily originate from cross-modal radiometric differences: optical images exhibit smooth vegetation textures in farmland, whereas SAR images present random speckle patterns due to coherent imaging mechanisms. Thus, we can effectively segment local distortion regions and normal regions caused by factors such as imaging principles, signal characteristic differences, and terrain effects. As shown in
Figure 3(a3–f3) and
Figure 4(a3–f3), the red areas indicate local distortion regions, typically including urban areas, terrain, and areas with building elevation variations, while the remaining areas are considered normal regions.
2.3. Multi-Feature Fusion Capsule Network (MFFCN)
To achieve robust and precise matching between optical and SAR images, it is critical to effectively capture spatial features while being resilient to deformations and modality differences. Traditional convolutional neural networks (CNNs) often suffer from feature aliasing, semantic ambiguity, and information loss due to pooling operations. Capsule networks (CapsNets), with their ability to preserve pose information and spatial relationships, offer a promising alternative. In this work, we propose the Multi-Feature Fusion Capsule Network (MFFCN) to specifically address the challenges of multimodal registration. Unlike standard CapsNets, MFFCN is designed to extract structured descriptors aligned with key physical properties, including texture, phase, structure, and amplitude, ensuring better compatibility between optical and SAR features.
CapsNets introduce a dynamic routing mechanism that effectively mitigates information loss and prevents the degradation of feature representations typically caused by pooling operations in conventional CNNs. This mechanism enhances the efficiency of feature aggregation, enabling CapsNets to exhibit improved resistance to noise and local spatial distortions. As a result, the network demonstrates superior generalization capability, allowing for consistent recognition of similar objects across diverse instances. Integrating CapsNets into Siamese network architectures further enhances the expressiveness of feature representations, improves resistance to geometric deformations, and strengthens overall network robustness, while simultaneously reducing information loss. Moreover, the unique dynamic routing mechanism allows CapsNets to capture spatial relationships between features, enhancing overall network performance.
The dynamic routing mechanism, illustrated in
Figure 5, is executed through the following iterative procedure: Each capsule is represented by a vector, where the magnitude indicates the probability of a feature’s existence, and the direction encodes its attributes. Initially, coupling coefficients between lower-level and higher-level capsules are randomly initialized to facilitate the transmission of feature information. Higher-level capsules generate prediction vectors based on outputs from lower-level capsules as follows:
where
is the prediction vector of the higher-level capsule,
is the connection weight, and
is the output of the lower-level capsule.
Dynamic routing updates the connection weights through iterations. During each iteration, outputs from lower capsules are aggregated to form the activation vector of higher capsules:
The aggregated vector is then transformed via a nonlinear activation function, typically the Squash function:
Subsequently, the routing weights are updated based on the consistency between the outputs of higher-level capsules and the prediction vectors:
This iterative routing continues until convergence, resulting in final capsule outputs that effectively encode complex feature representations.
Although capsule networks offer significant advantages in feature representation and spatial transformation handling, certain challenges remain when applying them to optical and SAR image registration tasks. Typically, digital capsules have higher dimensionality than primary capsules to accommodate complex feature encoding and richer feature combinations. However, this increase in dimensionality complicates the feature representation process. The increased dimensionality adds difficulty to model design and hyperparameter tuning, and the abstract features represented may not align with their actual physical meanings. To address this issue, we set the dimensionality of the digital capsules in our proposed MFFCN to 4, corresponding to the key feature information of texture, phase, structure, and amplitude in optical and SAR image registration. Additionally, to maintain dimensional consistency with other feature descriptors, the feature dimension was set to 32 by replacing the original 1 × 128 vector with a 4 × 32 matrix representation. The detailed structure of the proposed MFFCN is shown in
Figure 6, with specific network parameters listed in
Table 1.
Due to the special form and significance of the feature descriptors extracted by capsule networks, they cannot be simply reshaped into one-dimensional vectors for calculation using classical Euclidean distance. We define a feature descriptor in the optical image as
, and its corresponding feature descriptor in the SAR image as
.
Unlike the Euclidean distance, which treats all feature dimensions as independent and equally scaled, the Mahalanobis distance accounts for the correlations between different feature components by incorporating their covariance structure. This is particularly important when comparing optical and SAR modalities, where the feature distributions often exhibit anisotropy and varying inter-dimensional dependencies. By considering the intrinsic statistical relationships among features, the Mahalanobis distance provides a more accurate and discriminative measurement of cross-modal similarity, thus enhancing the robustness of feature matching between optical and SAR domains. For the feature descriptor
in the optical image and the feature descriptor
in the SAR image, the Mahalanobis distance
is defined as:
where
is the 4 × 4 covariance matrix of the feature descriptors, describing the correlation between features;
is the inverse of the covariance matrix.
To further enhance model training, we apply Hard Negative Mining, which prioritizes difficult negative examples during optimization to strengthen discriminative learning. Specifically, negative samples close to positive pairs in Mahalanobis distance but belonging to different classes are selected for enhanced supervision. By focusing training on these “hard” samples, the model improves its discriminative ability. The corresponding Mahalanobis distance loss incorporating hard negative mining is formulated as:
where
is the Mahalanobis distance between positive examples (correctly matched optical and SAR images), while
is the Mahalanobis distance between negative examples (incorrectly matched optical and SAR images).
is a margin used to control the distance difference between positive and negative examples, ensuring that the distance for positive examples is less than that for negative examples.
is a positive value operation, indicating that it returns this value when
, otherwise, it returns 0.
The MFFCN module is integrated into the overall registration framework as a dedicated feature extraction branch. It replaces the traditional feature extraction path, enabling the network to generate structured descriptors that better capture spatial and semantic attributes. By combining the spatially aware features from capsules with conventional texture-based features, the system achieves improved robustness to geometric deformation, modality gaps, and local distortions. Specifically, the input image patches are first processed through three convolutional modules, which progressively extract local detail features such as texture and edges while preserving the spatial structure. The resulting feature maps are then reshaped and passed into the Primary Capsule layer, where they are transformed into low-level capsule units that encode preliminary pose and structural information. Subsequently, the Digital Capsule layer aggregates these low-level capsules into structured high-dimensional feature descriptors using a dynamic routing mechanism. The final output is represented as a 4 × 32 matrix, where each row corresponds to a distinct physical attribute: texture, phase, structure, and amplitude, respectively. This process is independently conducted on both the reference image and the sensed image using two separate MFFCN branches without shared parameters. The feature descriptors output from the two branches are then compared using the Mahalanobis distance, enabling high-precision feature matching across modalities and under varying geometric distortions.
2.4. Fusion Reconstruction
In conventional image registration processes, a global transformation model is often used to describe the spatial relationship between images. However, due to the inherent differences between optical and SAR images in terms of land cover representation and imaging principles, a single model struggles to handle local distortions effectively. Local distortions induced by various surface types, such as differences in the appearance of vegetation and buildings, often introduce significant deviations in the registration process. Therefore, registering optical and SAR images requires both a global perspective and flexible local models to accommodate regional variations.
In this paper, we propose a global optical and SAR image registration method based on local distortion division. First, the distortion in different regions of the optical and SAR images is evaluated, with SLDD used to estimate image distortions and divide the image regions into normal and local distortion areas. Then, MFFCN is used to match feature descriptors in each region, and separate transformation models are constructed to obtain registration results for both types of distortion areas. The registration result of the local distortion area is denoted as conseqL, referred to as the first registration result, while the registration result of the normal area is denoted as conseqN, referred to as the second registration result. As shown in
Figure 7, the first registration result provides better alignment in locally distorted regions, whereas the second registration result achieves superior performance in normal regions. Finally, based on the transformed images of these two types of regions and their corresponding superpixel segmentation results, the final transformed sensing image is obtained using a superpixel replacement strategy, as illustrated in
Figure 8.
Specifically, superpixel segmentation is performed on both the first and second registration results to obtain the first and second superpixel segmentation results, respectively. The target superpixels within the local distortion area are identified based on the first superpixel segmentation result. These target superpixels are then replaced with the corresponding superpixels from the normal area, as determined from the second segmentation result, while the remaining superpixels in the general area are preserved. Through this process of fusion and reconstruction, we integrate the registration results of the two regions to generate the final registered image. It is worth noting that, since both registration results are resampled based on the reference image, there are no visible seams at the boundaries between different region categories.