Optical and SAR Image Registration Based on Pseudo-SAR Image Generation Strategy

: The registration of optical and SAR images has always been a challenging task due to the different imaging mechanisms of the corresponding sensors. To mitigate this difference, this paper proposes a registration algorithm based on a pseudo-SAR image generation strategy and an improved deep learning-based network. The method consists of two stages: a pseudo-SAR image generation strategy and an image registration network. In the pseudo-SAR image generation section, an improved Restormer network is used to convert optical images into pseudo-SAR images. An L2 loss function is adopted in the network, and the loss function ﬂuctuates less at the optimal point, making it easier for the model to reach the ﬁtting state. In the registration part, the ROEWA operator is used to construct the Harris scale space for pseudo-SAR and real SAR images, respectively, and each extreme point in the scale space is extracted and added to the keypoint set. The image patches around the keypoints are selected and fed into the network to obtain the feature descriptor. The pseudo-SAR and real SAR images are matched according to the descriptors, and outliers are removed by the RANSAC algorithm to obtain the ﬁnal registration result. The proposed method is tested on a public dataset. The experimental analysis shows that the average value of NCM surpasses similar methods over 30%, and the average value of RMSE is lower than similar methods by more than 0.04. The results demonstrate that the proposed strategy is more robust than other state-of-the-art methods.


Introduction
Optical and synthetic aperture radar (SAR) images are two types of products formed by distinct sensors. Optical images result from the passive reception of naturally reflected light, while SAR images are generated by actively transmitting and receiving radar electromagnetic waves. The quality of optical images is heavily affected by cloudy conditions and night variations. In contrast, SAR is capable of observing the earth under various weather conditions and demonstrates strong performance during day and night. Therefore, it is necessary to jointly use the effective information from these two different imaging sensors. Thus, how to match the two kinds of images becomes the top priority. However, SAR images often exhibit unique manifestations, such as shadows, superimposition, and foreshortening, which are caused by special imaging mechanisms. These imaging disparities pose challenges to the registration of optical and SAR images.
In recent years, many researchers have proposed diverse approaches for matching optical and SAR images [1]. Presently, three primary frameworks for the image registration have been established [2]: area-based matching [3], feature-based matching [4], and deep learning-based matching methods [5].
Transformer model for pseudo-SAR image generation; then, it seeks registration between pseudo-SAR and real SAR images.
To address the registration of homologous SAR images, in the field of traditional methods, Schwind et al. [36] combined the Best-Bin-First algorithm with SIFT for SAR image registration. Delinger et al. proposed the SAR-SIFT [37] method, which replaced the DoG pyramid with the Harris pyramid and utilized the ratio gradient [38,39], achieving good results in SAR image registration. These traditional methods based on SAR image gray-scale variations struggle to detect effective keypoints in SAR images with narrow dynamic ranges, but they could lead to unsatisfied results sometimes. In the field of deep learning, Du et al. [40] introduced FM-CycleGAN to achieve feature-matching consistency. Ye et al. [5] achieved remote sensing image registration by fusing SIFT and CNN feature descriptors. However, as the convolutional layers deepen, detail features in SAR images are gradually lost. Effectively utilizing these detail features has become a research direction in SAR image registration. Yun et al. [23] presented an improved Siamese [19] network, called MatchosNet, to avoid the loss of detail features. In proposed framework, a refined scheme including MatchosNet is constructed in the registration stage between pseudo-SAR and real SAR images.
Inspired from the Transformer and the MatchosNet, this paper presents a novel method for the registration of optical and SAR images, adopting a pseudo-SAR generation strategy and an improved registration network between pseudo-SAR and real SAR images. The overall methodology is illustrated in Figure 1. In the first step, an improved Restormer [41] network is utilized to transform optical images to pseudo-SAR images. This network originated from Transformer and comprises several encoding and decoding blocks, which are each equipped with self-attention mechanisms. These mechanisms effectively capture the feature differences present in the local features of optical and SAR images. To enhance performance, we adopt the L2 loss function, which facilitates better estimating similarity between the pseudo-SAR and real SAR images, aiding the convergence of network weights toward an optimal fitting state. In the second step, the registration is conducted between the pseudo-SAR and real SAR images. Initially, the ROEWA [16] operator is applied to construct a multi-scale Harris scale space for both images. Subsequently, extremal points are selected from each scale space and incorporated into the keypoint sets obtained from the pseudo-SAR and real SAR images. The ROEWA operator effectively extracts informative features from SAR images, while the Harris scale space construction facilitates the extraction of keypoints at multiple scales, thereby enhancing the robustness of the extracted keypoints. For each keypoint in the pseudo-SAR and real SAR image keypoint sets, an image patch surrounding it is extracted and fed into the MatchosNet network. The network employs deep feature extraction, resulting in the generation of robust descriptor vectors. MatchosNet utilizes a twin-branch network with shared weights, enabling the maximum utilization of feature information from both pseudo-SAR and real SAR images, thereby producing optimal descriptor matches. Finally, the RANSAC [42] algorithm is employed to eliminate outlier matching pairs. The contributions of this paper are as follows:

•
In the pseudo-SAR generation strategy, this paper use the improved Restormer network to eliminate the feature differences between optical and SAR images. • For the registration part, a refined keypoint extraction method using the ROEWA operator is designed to construct the Harris scale space and used to extract the extreme points in each scale.
The remaining sections of this paper are organized as follows. Section 2 provides a detailed description of the proposed method. The results of registration and the ablation experiment are presented in Section 3. Section 4 shows the experimental results and a discussion of research prospects. Conclusions are given in Section 5.

Materials and Methods
As Figure 1 shows, the schematic diagram of our proposed method consists of two main parts. Firstly, this paper adopts an improved Restormer to accomplish our pseudo-SAR generation strategy, transforming optical images into pseudo-SAR images. Secondly, the Harris scale space is constructed for both pseudo-SAR and real SAR images using the ROEWA operator, and we extract extremal points in each scale space. And then, image patches are selected around the keypoints, and we input them into the MatchosNet network to obtain robust descriptors. Based on the feature descriptors, keypoints are matched in pseudo-SAR and real SAR images, and we utilize the RANSAC algorithm to eliminate outliers, resulting in the final matching results.

Network Architecture
Inspired by the Restormer network, a deep learning-based transformation method is proposed. Specifically, an encoder network is employed to extract feature information from the optical image and a decoder to decode the feature information. Within the encoder-decoder network, the transposed self-attention mechanism is used to enhance the model's robustness of the feature differences in optical and SAR images. Finally, a pseudo-SAR image consistent with the feature imaging of real SAR images is obtained. The specific network structure is illustrated in Figure 2.

Materials and Methods
As Figure 1 shows, the schematic diagram of our proposed method consists of two main parts. Firstly, this paper adopts an improved Restormer to accomplish our pseudo-SAR generation strategy, transforming optical images into pseudo-SAR images. Secondly, the Harris scale space is constructed for both pseudo-SAR and real SAR images using the ROEWA operator, and we extract extremal points in each scale space. And then, image patches are selected around the keypoints, and we input them into the MatchosNet network to obtain robust descriptors. Based on the feature descriptors, keypoints are matched in pseudo-SAR and real SAR images, and we utilize the RANSAC algorithm to eliminate outliers, resulting in the final matching results.
2.1. Pseudo-SAR Image Generation Strategy 2.1.1. Network Architecture Inspired by the Restormer network, a deep learning-based transformation method is proposed. Specifically, an encoder network is employed to extract feature information from the optical image and a decoder to decode the feature information. Within the encoderdecoder network, the transposed self-attention mechanism is used to enhance the model's robustness of the feature differences in optical and SAR images. Finally, a pseudo-SAR image consistent with the feature imaging of real SAR images is obtained. The specific network structure is illustrated in Figure 2.
As shown in Figure 2, the input to the network is an optical image to be transformed, and the output is a pseudo-SAR image. The input image is first processed by convolution to expand the number of channels to obtain a high-dimensional feature matrix. These features are then transformed into deeper feature maps through a symmetric encoderdecoder at each level. There are a total of four levels of corresponding to the encoder and decoder blocks. Each level of encoder and decoder has multiple Transformer blocks. The number of Transformer blocks gradually increases from top to bottom, mainly for deep feature extraction. The attention mechanism establishes a connection between local and global features. The encoder uses downsampling to continuously reduce the spatial size of the input image and increase feature dimension. The decoder employs upsampling to progressively enlarge the image while compressing the feature dimensions. In order to transfer the features extracted from each downsampling layer, skip connections are added after each downsampling layer, and the feature matrix obtained by upsampling are concatenated in the channel dimension and compressed by convolution. Then, the feature matrix is further refined through several Transformer blocks and a convolution layer. To better learn the differences between the optical and SAR image features, the network introduces element-wise operation between the original optical image feature and the high-level feature matrix to help restore the lost texture and semantic details in the image. Finally, the network outputs the generated pseudo-SAR image. As shown in Figure 2, the input to the network is an optical image to be transformed, and the output is a pseudo-SAR image. The input image is first processed by convolution to expand the number of channels to obtain a high-dimensional feature matrix. These features are then transformed into deeper feature maps through a symmetric encoder-decoder at each level. There are a total of four levels of corresponding to the encoder and decoder blocks. Each level of encoder and decoder has multiple Transformer blocks. The number of Transformer blocks gradually increases from top to bottom, mainly for deep feature extraction. The attention mechanism establishes a connection between local and global features. The encoder uses downsampling to continuously reduce the spatial size of the input image and increase feature dimension. The decoder employs upsampling to progressively enlarge the image while compressing the feature dimensions. In order to transfer the features extracted from each downsampling layer, skip connections are added after each downsampling layer, and the feature matrix obtained by upsampling are concatenated in the channel dimension and compressed by convolution. Then, the feature matrix is further refined through several Transformer blocks and a convolution layer. To better learn the differences between the optical and SAR image features, the network introduces element-wise operation between the original optical image feature and the highlevel feature matrix to help restore the lost texture and semantic details in the image. Finally, the network outputs the generated pseudo-SAR image.
The structure of the Transformer block is shown in Figure 3, which includes two modules, MDTA [41] and GDFN. The MDTA module is Multi-Dconv Head Transposed Attention. The main network structure of this module is a self-attention mechanism. Firstly, normalize the input feature matrix, and then generate Q, K, V projections through 1×1 point-wise convolution and 3×3 depth-wise convolution, respectively. Here, Q projection is query projection, K projection is keyword projection, and V is value projection. Then, multiply the Q and K matrices to obtain the Transposed Attention Map. The Transposed Attention Map is element-wise multiplied with the V, and the resulting matrix is obtained through 1×1 point-wise convolution to generate the output matrix of MDTA. The principle formula is as follows: The structure of the Transformer block is shown in Figure 3, which includes two modules, MDTA [41] and GDFN. The MDTA module is Multi-Dconv Head Transposed Attention. The main network structure of this module is a self-attention mechanism. Firstly, normalize the input feature matrix, and then generate Q, K, V projections through 1 × 1 point-wise convolution and 3 × 3 depth-wise convolution, respectively. Here, Q projection is query projection, K projection is keyword projection, and V is value projection. Then, multiply the Q and K matrices to obtain the Transposed Attention Map. The Transposed Attention Map is element-wise multiplied with the V, and the resulting matrix is obtained through 1 × 1 point-wise convolution to generate the output matrix of MDTA. The principle formula is as follows:X where X andX are the input and output vector feature maps, respectively, and theQ ∈ RĤŴ ×Ĉ ,K ∈ RĈ ×ĤŴ , andV ∈ RĤŴ ×Ĉ matrices are obtained by transformation from the original matrix RĤ ×Ŵ×Ĉ . The GDFN module is Gated Dconv Feedforward Network [41]. As shown in Figure 3, the GDFN structure is divided into two parallel paths. Both paths undergo 1 × 1 convolution and 3 × 3 depth convolution. One of the paths is nonlinear activated by the GELU. Finally, the outputs of the two paths are multiplied element-wise and passed through a 1 × 1 convolution layer. The resulting matrix is then added element-wise to the input matrix to obtain the output matrix. The corresponding formula for this method is as follows: lution and 3×3 depth convolution. One of the paths is nonlinear activated by the GELU. Finally, the outputs of the two paths are multiplied element-wise and passed through a 1×1 convolution layer. The resulting matrix is then added element-wise to the input matrix to obtain the output matrix. The corresponding formula for this method is as follows: Gating( ) where ⊙ represents the multiplication of each element in the vector, and represents the nonlinear GELU function. LN is a layer normalization operation.

Pseudo-SAR Generation Network Loss Function
In the application of Restormer to optical image, an L1 loss function is usually employed. Compared to the L1 loss function, the L2 loss function emphasizes the penalization of erroneous pixels, resulting in smoother fluctuations around the best fit in the model [43,44]. The later experiments will further demonstrate its effectiveness. Therefore, the L2 loss function is used here to calculate the difference between the pseudo-SAR and real SAR image. The formula for the L2 loss function is given below:

Pseudo-SAR Generation Network Loss Function
In the application of Restormer to optical image, an L1 loss function is usually employed. Compared to the L1 loss function, the L2 loss function emphasizes the penalization of erroneous pixels, resulting in smoother fluctuations around the best fit in the model [43,44]. The later experiments will further demonstrate its effectiveness. Therefore, the L2 loss function is used here to calculate the difference between the pseudo-SAR and real SAR image. The formula for the L2 loss function is given below: where y ij is the pixel values of real SAR images at coordinates (i, j), and f (x) ij is the pixel values of pseudo-SAR image at coordinates (i, j). n and m represent the size parameters of images.

Pseudo-SAR Generation Performance Evaluation
The performance evaluation of the pseudo-SAR generation strategy is conducted using a combination of subjective visual assessment and objective evaluation metrics. The objective quantitative evaluation indicators used for testing are the Average Gradient (AG), Structural Similarity (SSIM), Peak Signal-to-noise Ratio (PSNR), Learned Perceptual Image Patch Similarity (LPIPS) [45], and Mean Absolute Error (MAE) index. The pseudo-SAR image and results of these indicators will also be mentioned in the ablation experiments. The Average Gradient index is calculated using the following formula: where m and n are the size parameters of the image, and ∆I x and ∆I y are the differences on the horizontal and vertical coordinates, respectively. The AG reflects the difference in gray-scale near the edge of an image and is used to measure the clarity of the image, and a larger value represents a clearer image.
(2) SSIM The formula for SSIM is given as follows: where µ x is the mean of x, µ y is the mean of y, σ 2 x is the variance of x, σ 2 y is the variance of y, σ xy is the covariance of x and y, and c 1 and c 2 are used to maintain stability in the dynamic range of pixel values. Among c 1 and c 2 , L is the dynamic value range of the pixel value, k 1 and k 2 represent the hyperparameters, which are generally 0.01 and 0.03. In this evaluation indicator, a value closer to 1 indicates a higher similarity between the pseudo-SAR and the real SAR image.

(3) PSNR
Regarding PSNR, the formula is given as follows: where MSE refers to the mean square error, m and n represent the dimensions of the image, and i and j represent the positions of the pixels. MAX I represents the maximum pixel value. Generally, a higher PSNR value represents the better similarity of pseudo-SAR and real SAR images.

(4) LPIPS
In this paper, LPIPS [45] represents the distance between pseudo-SAR and real SAR image features. LPIPS utilizes a CNN network to extract features from images and calculates distances using these features. A smaller value of LPIPS indicates a higher similarity between the two images. The formula is shown as follows: where x and x 0 represent the pseudo-SAR and real SAR image, respectively.ŷ l hw andŷ l 0hw denote the features extracted from the CNN network at the L-th layer. w l represents the weights of the L-th layer of the CNN network. H and W represent the size parameters of the image. MAE represents the difference value between pseudo-SAR and real SAR images and can be expressed using the following formula: where i and j represent the coordinates of a pixel, and m and n are the dimension parameters of the image. A smaller MAE value indicates a higher similarity between the pseudo-SAR and real SAR image.

Parameter Analysis
During the model training process, it is necessary to determine the training iteration value, learning rate, and optimizer. The training iteration is set to an empirical value. As for the learning rate and optimizer settings, this paper follows the strategies mentioned in the literature [41]. The specific setting approach is provided in Section 3.1.2.

Image Registration
A point-matching-based registration framework is adopted in this paper. The framework includes keypoint extraction, feature descriptor construction, feature descriptor matching, and RANSAC to remove outliers. The flow chart of this method is shown in   (7) where and 0 represent the pseudo-SAR and real SAR image, respectively. ̂ℎ and ̂0 ℎ denote the features extracted from the CNN network at the -th layer.
represents the weights of the -th layer of the CNN network. and represent the size parameters of the image.

(5) MAE
MAE represents the difference value between pseudo-SAR and real SAR images and can be expressed using the following formula: (8) where and represent the coordinates of a pixel, and and are the dimension parameters of the image. A smaller MAE value indicates a higher similarity between the pseudo-SAR and real SAR image.

Parameter Analysis
During the model training process, it is necessary to determine the training iteration value, learning rate, and optimizer. The training iteration is set to an empirical value. As for the learning rate and optimizer settings, this paper follows the strategies mentioned in the literature [41]. The specific setting approach is provided in Section 3.1.2.

Image Registration
A point-matching-based registration framework is adopted in this paper. The framework includes keypoint extraction, feature descriptor construction, feature descriptor matching, and RANSAC to remove outliers. The flow chart of this method is shown in Figure 4. The overall process of the registration scheme is described below. First of all, extract keypoints from real SAR and pseudo-SAR images. Secondly, image patches are extracted centered around the keypoints and fed into the MatchosNet to extract feature descriptors. Then, the keypoints are matched according to the descriptor. Finally, remove the outliers by RANSAC [42] and obtain the final registration result. The overall process of the registration scheme is described below. First of all, extract keypoints from real SAR and pseudo-SAR images. Secondly, image patches are extracted centered around the keypoints and fed into the MatchosNet to extract feature descriptors. Then, the keypoints are matched according to the descriptor. Finally, remove the outliers by RANSAC [42] and obtain the final registration result.

Keypoint Detection
The registration framework proposed by Yun et al. [23], based on MatchosNet, utilizes the Difference of Gaussians (DoG) operator to extract keypoints from optical and SAR images. However, directly applying the DoG operator in SAR images with significant speckle noise will lead to the inability to extract repeatable keypoints, affecting the orientation assignment and descriptor construction, and resulting in registration errors [14]. The ROEWA operator proposed by Fjortoft et al. [16] is commonly used for the edge detection of SAR images with good results. It uses gradient by ratio (GR) instead of a differential gradient, which consists of two orthogonal one-dimensional filters to form a two-dimensional separable filter. In this paper, the ROEWA [16] operator is used to extract images at different scales to help establish the Harris scale space. The comparison between Remote Sens. 2023, 15, 3528 9 of 24 our proposed keypoint extraction method and the DoG operator is given in the following ablation experiment chapter.
The calculation process of constructing the Harris scale space using the ROEWA operator is described by Equations (9)-(13) as shown. A series of scale parameters are constructed, which are denoted as α.
where i represents the spatial layer of the scale, c is a constant, α denotes the scale space parameter, and α 0 represents the initial value of the scale space parameter.
Hereafter, the ROEWA operators oriented in the horizontal and vertical directions are defined as follows: where M and N are the size of the sliding processing window, I(x, y) represents the pixel intensity of the image, x and y represent the coordinates of the center point. R h,α and R v,α denote the horizontal and vertical ROEWA operator, separately.
For the next step, the horizontal and vertical gradients are calculated using R h,α and R v,α , respectively.
where G h,α and G v,α represent the horizontal and vertical gradients, respectively. Finally, the Harris scale space is constructed by the following formula: where g √ 2α represents the Gaussian convolution kernel with scale α, * denotes the convolution operation, d is a hyperparameter, and R SH represents the Harris scale space. Then, in each scale level of the Harris scale space, a local extrema value is found as a candidate keypoint and added to the keypoint set. The flow chart for extracting keypoints is illustrated as shown in Figure 5.

Feature Descriptor Construction
For the obtained keypoints, the image patches centered on the keypoint coordinates are selected. Then, we input the extracted image patches to the MatchosNet. The network adopts a twin structure and shares weights. The backbone of MatchosNet is CSP-DenseNet [46]. The network reduces the computation and enhances the learning ability. The network structure is shown in Figure 6.
There are three DenseBlocks with the same structure in the network. The structure of each DenseBlock is shown in Figure 7. Then, in each scale level of the Harris scale space, a local extrema value is found as a candidate keypoint and added to the keypoint set. The flow chart for extracting keypoints is illustrated as shown in Figure 5.

Feature Descriptor Construction
For the obtained keypoints, the image patches centered on the keypoint coordinates are selected. Then, we input the extracted image patches to the MatchosNet. The network adopts a twin structure and shares weights. The backbone of MatchosNet is CSPDenseNet [46]. The network reduces the computation and enhances the learning ability. The network structure is shown in Figure 6. There are three DenseBlocks with the same structure in the network. The structure of each DenseBlock is shown in Figure 7.

Feature Descriptor Construction
For the obtained keypoints, the image patches centered on the keypoint coordinates are selected. Then, we input the extracted image patches to the MatchosNet. The network adopts a twin structure and shares weights. The backbone of MatchosNet is CSPDenseNet [46]. The network reduces the computation and enhances the learning ability. The network structure is shown in Figure 6. There are three DenseBlocks with the same structure in the network. The structure of each DenseBlock is shown in Figure 7.  Each DenseBlock consists of 11 layers, including 3 layers of 3 × 3 depth-wise convolutions, 4 layers of 1 × 1 pixel-wise convolutions, 3 layers of channel-wise feature concatenation, and 1 layer of average pooling. The DenseBlock removes redundant layer connections and retains some important layer connections, resulting in the improvement of the operation efficiency. Figure 8 illustrated the computation process of the loss function. The figure consists of three components: namely, descriptors, distance matrix, and relative tuple. In the descriptors section, there are total of n matched descriptor pairs. The distance matrix section represents the distance matrix formed by calculating the L2 distances between all descriptors [21]. The relative tuple section consists of the four-tuple collection computed from the distance matrix. This collection is used for the final calculation of the loss function, which is shown below:  After extracting feature descriptors using the network, the RANSAC algorithm is used to remove outliers and obtain the point matching result. Finally, the registration results of the pseudo-SAR and SAR images are mapped back to the optical and SAR images.

Parameter Analysis
For the registration part, it is necessary to determine the Harris scale space constant , , and 0 . In the registration network, the image patch size, number of training iterations, learning rate, optimizer, and RANSAC threshold need to be determined. The specific parameter values for the , , 0 , image patch size, learning rate, and optimizer are set according to the literature [23,37]. As for the training iteration value and the RANSAC threshold, these rely on empirical values. These values are provided in Section 3.1.2.

Dataset Preparation
The QXS-SAROPT [47] and OSDataset [48] datasets are adopted in this experiment. For the QXS-SAROPT dataset, the SAR image part of the dataset is from the Gaofen-3 After extracting feature descriptors using the network, the RANSAC algorithm is used to remove outliers and obtain the point matching result. Finally, the registration results of the pseudo-SAR and SAR images are mapped back to the optical and SAR images.

Parameter Analysis
For the registration part, it is necessary to determine the Harris scale space constant c, d, and α 0 . In the registration network, the image patch size, number of training iterations, learning rate, optimizer, and RANSAC threshold need to be determined. The specific parameter values for the c, d, α 0 , image patch size, learning rate, and optimizer are set according to the literature [23,37]. As for the training iteration value and the RANSAC threshold, these rely on empirical values. These values are provided in Section 3.1.2.

Dataset Preparation
The QXS-SAROPT [47] and OSDataset [48] datasets are adopted in this experiment. For the QXS-SAROPT dataset, the SAR image part of the dataset is from the Gaofen-3 satellite and the resolution is 1 m × 1 m, and the optical image part is from Google Earth images. These images cover three port cities: Santiago, Shanghai, and Qingdao, and they contain 20,000 pairs of optical and SAR image pairs. The dataset is split into three parts for training, validation, and testing, with a ratio of 8:1:1. In the training process of the pseudo-SAR generation network, the training and validation sets of this dataset are adopted. The test set is used to test the pseudo-SAR generation network and the registration network. The OSDatase contains 10,692 pairs of optical and SAR images, each with a size of 256 × 256 pixels. These SAR images have the same sensor source and resolution as QXS-SAROPT. The dataset collects scenes from several cities around the world, including Beijing, Shanghai, Suzhou, Wuhan, Sanhe, Yuncheng, Dengfeng, Zhongshan, and Zhuhai in China, Rennes in France, Tucson, Omaha, Guam, and Jacksonville in the United States, and Dehradun and Agra in India. The SAR images from this dataset are used to train the registration network.

Parameter Setting
For the pseudo-SAR generation strategy, the total number of iterations for training the original and improved Restormer network is set to 600,000 empirically. Regarding the literature [41], the initial learning rate is set to 3 × 10 −4 and gradually reduced to 1 × 10 −6 using the cosine annealing algorithm. The optimizer is set to Adam.
For constructing the Harris scale space, according to the literature [37], the parameters c, d, and α 0 are set as 2 1/3 , 0.04, and 2 respectively. On the basis of the literature [23], the training optimizer is Adam, the learning rate is set to 1 × 10 −4 , and the size of the image patches is specified to be 64 × 64. Based on the empirical values, the training epoch is set to 100, and the RANSAC threshold is set to 1.
In the ablation experiment, to ensure fairness, the CycleGAN model is trained for 600,000 iterations. The optimizer used is Adam with a learning rate of 3 × 10 −4 . Additionally, the learning rate gradually reduce to 1 × 10 −6 using the cosine annealing algorithm.

Registration Comparison Method
The comparative experiments used in this paper are as follows.
(1) PSO-SIFT [49]: According to the existing SIFT method, PSO-SIFT adopts a new gradient definition to eliminate the nonlinear radiation differences between optical and SAR images.  [50] to generate pseudooptical images from SAR images, and it uses SIFT to match the pseudo-optical and optical images to obtain the final registration results. In the ablation experiment, the CycleGAN network and the improved Restormer network are compared in the pseudo-SAR generation strategy. In the registration experiment, we make improvements to this method by converting the optical image into a pseudo-SAR image and replacing the SIFT with MatchosNet to better evaluate the registration method proposed in this paper.

Experimental Platform
The platform and environment used in the experiment are shown in the Table 1.

(1) Keypoint matching analysis
To evaluate the proposed registration method, this paper compares its performance with the methods discussed in Section 3.1.3. The results of point matching are analyzed in this section. Figure 9 show the visual results of point matching.    Figure 9(a2), the red box highlights the mismatches produced by PSO-SIFT in those areas.
The visual registration results of the rural and road scenes are shown in Figure 9(b1-b4). From the overall comparison of the four methods, it can be observed that our method has a significant advantage in terms of the number of matched points. PSO-SIFT, Cy-   Figure 9(a2), the red box highlights the mismatches produced by PSO-SIFT in those areas.
The visual registration results of the rural and road scenes are shown in Figure 9(b1-b4). From the overall comparison of the four methods, it can be observed that our method has a significant advantage in terms of the number of matched points. PSO-SIFT, CycleGAN + MatchosNet, and MatchosNet exhibit fewer matched points. Figure 9(c1-c4) show the registration results of the four methods in an urban scene. It can be noticed that our proposed method still has a significant number of matched points in areas with strong textures. Both CycleGAN + MatchosNet and MatchosNet show more matched points than PSO-SIFT in the urban scene.
Referring to Figure 9  From the visual effect of the above five scenes, deep learning-based registration methods demonstrate superior point-matching performance compared to traditional methods, especially in areas with strong textures. This is because deep learning networks can effectively learn the similarity between features of heterogeneous images. The combination of the Transformer-based pseudo-SAR generation strategy and deep learning registration mitigates the majority of feature differences in the pseudo-SAR generation stage; this strategy significantly enhance the robustness of the registration network. Table 2 presents the quantitative evaluation of the registration results. Two quantitative metrics are used here-namely, the number of correctly matched points (NCM) and Root Mean Squared Error (RMSE)-to assess the effectiveness of our registration. Among them, a larger NCM and smaller RMSE value indicates a better matching effect. The data in Table 2 indicate that the proposed method outperforms the other three methods in terms of the NCM metric across all scenes. Regarding the RMSE metric, our method is slightly higher than the MatchosNet method only in the rural and highway scenes, but it performs lower than the other methods in the remaining four scenes. However, in the rural and road scenes, the MatchosNet registration method only achieves 4 matched point pairs, while our method achieves 84 matched point pairs.

(2) Checkerboard image experiment analysis
The visual appearance of the checkerboard pattern is also an essential evaluation criterion for registration results. In order to further prove the effectiveness of the proposed method, the checkerboard image experiments are added based on the point matching results, and the experimental results are shown in Figure 10. In Figure 10(a1-a4), the PSO-SIFT method completely fails to register, while MatchosNet exhibits a matching error in the red-boxed region. As shown in Figure 10(b1-b4), the PSO-SIFT method generates incorrect matches in the red-boxed region. In the urban scene, only our method successfully matched the images; the other three methods exhibit mismatches in the red-boxed region, as depicted in Figure 10(c1-c4). From Figure 10(d1-d4), it can be observed that in the farmland scene, the other three methods also exhibit matching errors in the red-boxed region, while the proposed method still achieves successful registration. According to Figure 10(e1-e4), in the mountain scene, the PSO-SIFT method fails to achieve a complete registration. The proposed method accomplishes the registration successfully. These results indicate that deep learning-based methods have advantages in registration, and the proposed Transformer-based pseudo-SAR generation strategy further improves the registration performance between optical and SAR images.

Ablation Experiment
This section presents a series of ablation experiments, including pseudo-SAR generation strategy validity analysis, the validity analysis of the pseudo-SAR generation strategy for registration, and keypoints extraction strategy validity analysis.

(1) Pseudo-SAR generation strategy validity analysis
In this experiment, AG is used to evaluate the pseudo-SAR image. SSIM, PSNR, LPIPS, and MAE are calculated between the pseudo-SAR and real SAR images. The objective evaluation indicator values for the generation results are shown in the Table 3. Bold font indicates optimal values.
The experiment results show that the improved Restormer outperforms the original Transformer and CycleGAN in terms of AG, SSIM, and PSNR metrics. The improved Restormer achieves the best performance in most scenes based on the LPIPS and MAE metrics. Therefore, the experimental result indicates that the improved Restormer outperforms the CycleGAN and the original Restormer, and L2 loss function is superior to L1 loss function in the pseudo-SAR generation strategy.
The subjective evaluation indicators for this experiment were based on visual evaluation. Figure 11 presents a comparison of five different scenes: forest and lake, rural and road, urban, farmland, and mountain. Each scene comparison comprises five images, arranged from left to right: a real optical image, a pseudo-SAR image generated by CycleGAN, a pseudo-SAR image generated by original Restormer, a pseudo-SAR image generated by improved Restormer, and a real SAR image. Based on the comparison of pseudo-SAR generation strategy, in Figure 11(a2-e2), as indicated by the red-box marked, the CycleGAN method only focuses on rendering the style of the SAR image onto the optical image in the rural and rode, urban, farmland, and mountain scenes without fully eliminating the feature difference of the target. The original Restormer and improved Restormer exhibit better visual effects compared to CycleGAN. As shown in Figure 11(a3-e3,a4-e4), Restormer produces pseudo-SAR images that resemble real SAR images more closely. Specifically, improved Restormer provides clearer textures in the generated images compared to the original Restormer, as depicted in Figure 11(b4,d4,e4). The original Restormer generates images with relatively blurred texture details in some areas of the pseudo-SAR images, as shown in the red box marked in Figure 11(b3,d3,e3). In conclusion, in the pseudo-SAR generation strategy, the original Restormer outperforms similar methods, and the improved Restormer further improves the generating effect. The above conclusions also prove that the L2 loss function has advantages in the field of pseudo-SAR image generation.

Remote Sens. 2023, 15, x FOR PEER REVIEW 18 of 25
As shown in Figure 11(a3-e3,a4-e4), Restormer produces pseudo-SAR images that resemble real SAR images more closely. Specifically, improved Restormer provides clearer textures in the generated images compared to the original Restormer, as depicted in Figure  11(b4,d4,e4). The original Restormer generates images with relatively blurred texture details in some areas of the pseudo-SAR images, as shown in the red box marked in Figure  11(b3,d3,e3). In conclusion, in the pseudo-SAR generation strategy, the original Restormer outperforms similar methods, and the improved Restormer further improves the generating effect. The above conclusions also prove that the L2 loss function has advantages in the field of pseudo-SAR image generation.  (2) The validity analysis of pseudo-SAR generation strategy for registration Figure 11. Comparison of pseudo-SAR generation strategy. (a1-a5) represent the forest and lake scenes, (b1-b5) represent the rural and road scenes, (c1-c5) represent the urban scenes, (d1-d5) represent the farmland scenes, and (e1-e5) represent the mountain scenes. (a1-e1) are the real optical images, (a2-e2) are pseudo-SAR images generated by CycleGAN, (a3-e3) are pseudo-SAR images generated by original Transformer, (a4-e4) are pseudo-SAR images generated by improved Restormer, and (a5-e5) are the real SAR images.
(2) The validity analysis of pseudo-SAR generation strategy for registration Figure 12 illustrates a comparison of the registration results between the proposed method, the original Transformer, CycleGAN + MatchosNet, and MatchosNet. Among them, MatchosNet directly registers optical and SAR images. The proposed method, the original Transformer and CycleGAN + MatchosNet are two-stage registration methods that involve pseudo-SAR generation and registration. From the results in Figure 12, it can be observed that the proposed method exhibits a more even distribution and a higher number of matching points. The registration results based on the Transformer pseudo-SAR generation strategy outperform the direct registration of optical and SAR images. In the registration methods based on CycleGAN, one scene shows a matching error, as indicated by the red box in Figure 12( Figure 12 illustrates a comparison of the registration results between the proposed method, the original Transformer, CycleGAN + MatchosNet, and MatchosNet. Among them, MatchosNet directly registers optical and SAR images. The proposed method, the original Transformer and CycleGAN + MatchosNet are two-stage registration methods that involve pseudo-SAR generation and registration. From the results in Figure 12, it can be observed that the proposed method exhibits a more even distribution and a higher number of matching points. The registration results based on the Transformer pseudo-SAR generation strategy outperform the direct registration of optical and SAR images. In the registration methods based on CycleGAN, one scene shows a matching error, as indicated by the red box in Figure 12(d3).    Figure 12. Figure 13a represents the NCM for the four methods; it can be observed that the proposed method achieves the highest NCM value. The original two-stage registration method using Transformer performs well and obtains the second-highest NCM values in most scenes. Figure 13b shows the RMSE for the four methods; it is evident that the proposed method achieves the lowest RMSE in most scenes. These experimental   Figure 12. Figure 13a represents the NCM for the four methods; it can be observed that the proposed method achieves the highest NCM value. The original two-stage registration method using Transformer performs well and obtains the second-highest NCM values in most scenes. Figure 13b shows the RMSE for the four methods; it is evident that the proposed method achieves the lowest RMSE in most scenes. These experimental results provide substantial evidence of the proposed pseudo-SAR generation strategy in terms of image registration effectiveness.
(3) Keypoint extraction strategy validity analysis Figure 14 presents the comparison of keypoint extraction strategies for pseudo-SAR and real SAR images. It can be observed that the proposed strategy yields a higher number of keypoint matches without any matching errors. However, the DoG and FAST extraction strategies exhibit fewer matching points. Especially in some scenarios there are some erroneous keypoint matches, as indicated by the red boxes in Figure 14(c2,d2,d3).
In Figure 15, a comparison of three keypoint extraction methods is presented. In Figure 15a, the number of keypoints extracted by the three methods is compared. Figure 15b illustrates the NCM for the three methods. It can be observed that proposed keypoint extraction strategy outperforms the other methods in terms of both the number of keypoints selected and the final NCM. FAST operator extracts fewer keypoints than DoG in most cases but more keypoints than DoG in urban areas, which fully shows that there are more corners in urban areas and FAST is easier to extract keypoints than DoG. However, in weak texture areas, the corner feature is not significant, so FAST extracts fewer keypoints. The proposed keypoint extraction strategy gives consideration to both shallow and deep features, it can extract large keypoints in both weak and strong texture regions, and the final matching keypoints are more than FAST and DoG. This provides substantial evidence for the robustness of our keypoint extraction strategy. results provide substantial evidence of the proposed pseudo-SAR generation strategy in terms of image registration effectiveness. (3) Keypoint extraction strategy validity analysis Figure 14 presents the comparison of keypoint extraction strategies for pseudo-SAR and real SAR images. It can be observed that the proposed strategy yields a higher number of keypoint matches without any matching errors. However, the DoG and FAST extraction strategies exhibit fewer matching points. Especially in some scenarios there are some erroneous keypoint matches, as indicated by the red boxes in Figure 14(c2,d2,d3).   In Figure 15, a comparison of three keypoint extraction methods is presented. In Figure 15a, the number of keypoints extracted by the three methods is compared. Figure 15b illustrates the NCM for the three methods. It can be observed that proposed keypoint extraction strategy outperforms the other methods in terms of both the number of keypoints selected and the final NCM. FAST operator extracts fewer keypoints than DoG in most cases but more keypoints than DoG in urban areas, which fully shows that there are more corners in urban areas and FAST is easier to extract keypoints than DoG. However, in weak texture areas, the corner feature is not significant, so FAST extracts fewer keypoints. The proposed keypoint extraction strategy gives consideration to both shallow and deep features, it can extract large keypoints in both weak and strong texture regions, and the final matching keypoints are more than FAST and DoG. This provides substantial evidence for the robustness of our keypoint extraction strategy.

Discussion
The registration experimental results from Section 3.2 indicate that deep learningbased methods are more robust compared to traditional point matching methods. Comparing the results of the proposed method with CycleGAN + MatchosNet and Matchos-Net methods further demonstrates that the registration method based on the Restormer's pseudo-SAR generation strategy improves the accuracy of deep learning models in the registration process. The registration scheme based on the pseudo-SAR generation strategy can avoid the feature differences between heterogeneous images, making the registration network easier to train.
In the conducted ablation experiments, this paper investigated the effectiveness of Restormer's pseudo-SAR generation and a Harris scale space keypoint extraction strategy. The experimental results demonstrate that both of these strategies outperform similar methods. Specifically, compared to similar methods, the proposed Restormer pseudo-SAR generation strategy exhibits smaller RMSE and larger NCM. Contrasting with the original method, L2 loss function is used to instead of L1 loss function, and this improvement has achieved better results in experiments. The proposed keypoints extraction strategy shows a higher number of extracted and matched keypoints. Therefore, relative to other deep learning-based methods, the proposed method has more advantages.
However, in some weak texture scenes, generating pseudo-SAR images may be challenging, which could be a direction for future research in the field of optical and SAR

Discussion
The registration experimental results from Section 3.2 indicate that deep learning-based methods are more robust compared to traditional point matching methods. Comparing the results of the proposed method with CycleGAN + MatchosNet and MatchosNet methods further demonstrates that the registration method based on the Restormer's pseudo-SAR generation strategy improves the accuracy of deep learning models in the registration process. The registration scheme based on the pseudo-SAR generation strategy can avoid the feature differences between heterogeneous images, making the registration network easier to train.
In the conducted ablation experiments, this paper investigated the effectiveness of Restormer's pseudo-SAR generation and a Harris scale space keypoint extraction strategy. The experimental results demonstrate that both of these strategies outperform similar methods. Specifically, compared to similar methods, the proposed Restormer pseudo-SAR generation strategy exhibits smaller RMSE and larger NCM. Contrasting with the original method, L2 loss function is used to instead of L1 loss function, and this improvement has achieved better results in experiments. The proposed keypoints extraction strategy shows a higher number of extracted and matched keypoints. Therefore, relative to other deep learning-based methods, the proposed method has more advantages.
However, in some weak texture scenes, generating pseudo-SAR images may be challenging, which could be a direction for future research in the field of optical and SAR image registration based on pseudo-SAR generation strategy. In other fields of research, such as underwater acoustic or sonar [51,52], Transformer-based simulation may be explored for pseudo-SAS (Synthetic Aperture Sonar) imagery generation.

Conclusions
This paper proposes a registration method based on a pseudo-SAR generation strategy. In this approach, Restormer is used to transform an optical image to a pseudo-SAR image. During the training of Restormer, the original loss function is replaced with L2 so that the model fluctuates less at the best fit. In the registration process, the DoG operator is replaced with the ROEWA operator, which is used to construct the Harris scale space for pseudo-SAR and real SAR images, this strategy increases both the extracted and matched keypoints. The extreme points are extracted in each layer of the Harris scale space and added to the keypoint set. The image patches around the keypoints are extracted and fed to MatchosNet to obtain feature descriptors for initial matching, and the RANSAC algorithm is used to remove outliers to obtain the final matching results. The feasibility and robustness of this method have been demonstrated by experiments compared to similar methods.