A Robust 3D Density Descriptor Based on Histogram of Oriented Primary Edge Structure for SAR and Optical Image Co-Registration

: The co-registration between SAR and optical images is a challenging task because of the speckle noise of SAR and the nonlinear radiation distortions (NRD), particularly in the one-look situation. In this paper, we propose a novel density descriptor based on the histogram of oriented primary edge structure (HOPES) for the co-registration of SAR and optical images, aiming to describe the shape structure of patches more ﬁrm. In order to extract the primary edge structure, we develop the novel multi-scale sigmoid Gabor (MSG) detector and a primary edge fusion algorithm. Based on the HOPES, we propose the co-registration method. To obtain stable and uniform keypoints, the non-maximum suppressed SAR-Harris (NMS-SAR-Harris) and deviding grids methods are used. NMS-SSD fast template matching and fast sample consensus (FSC) algorithm are used to further complete and optimize matching. We use two one-look simulated SAR images to demonstrate that the signal-to-noise ratio (SNR) of MSG is more than 10 dB higher than other state-of-the-stage detectors; the binary edge maps and F-score show that MSG has more accurate positioning performance. Compared with the other state-of-the-stage co-registration methods, the image co-registration results obtained on seven pairs of test images show that, the correct match rate (CMR) and the root mean squared error (RMSE) improve by more than 25% and 15% on average, respectively. It is experimentally demonstrated that the HOPES is robust against speckle noise and NRD, which can effectively improve the matching success rate and accuracy. speckle noise and obtain the primary edge structure information in SAR image, especially the one-look SAR images. The test results of simulated SAR images show that the primary structure extraction algorithm can obtain higher SNR edge strength map and edge positioning accuracy. The proposed co-registration method is composed of NMS-SAR-Harris and deviding grids for keypoints extraction, HOPES


Introduction
Synthetic aperture radar (SAR), as an active microwave imaging system, can obtain images regardless of time or cloud cover. However, the imaging characteristics make it difficult to conform to human vision and be interpreted like optical images. The complementary information of optical and SAR images plays a significant role in GCP extraction [1,2], image fusion [3], change detection [4], etc. The co-registration accuracy of SAR and optical images directly affects these applications. However, due to the speckle noise, the nonlinear radiometric distortion (NRD), the angular differences, and the radiometric differences between SAR and optical images, co-registration is challenging [5][6][7].
The previous image matching methods are divided into the area-based matching methods and feature-based matching methods. The area-based matching methods mainly include the normalized correlation methods (NCC) [8,9], mutual information methods (MI) [10][11][12], frequency domain-based methods [13], etc. They are registrated by pixel values and similarity measure. Because they are sensible to NRD and noise, they are more suitable for homomodal images rather than multi-sensor images. Feature-based matching methods aim at mining common features to form descriptors, mainly including the point features [14], line features [15], edge features [16], and contour features [17], etc. The feature-based matching methods are more robust and can reduce the effects of NRD and noise to a certain extent, so it is more suitable for the co-registration of SAR and optical.
In the traditional image matching field, the scale-invariant feature transform (SIFT) usually has high performance [18], and numerous improved algorithms have emerged to further enhance the performance [19][20][21]. However, due to the speckle noise and NRD, the traditional SIFT algorithm does not work well on SAR images. To address this problem, F Dellinger et al. proposed the SAR-SIFT, they redefined the gradient extraction part of SIFT using the ratio of exponentially weighted averages (ROEWA) to improve the robustness to speckle noise, which shows the high performance in SAR image co-registration [22]. For SAR and optical image co-registration, Xiang et al. proposed the OS-SIFT, which uses multi-scale ROEWA for SAR and multi-scale Sobel for optical in gradient extraction [23]. Yu et al. combined the spatial feature detection with local frequency-domain description and proposed an improved nonlinear SIFT-based co-registration framework [24]. All of these methods improved the robustness of the SIFT-like algorithm in SAR images.
In recent years, robust feature descriptors have been widely used in SAR and optical image co-registration. These methods incorporated several feature descriptions to improve the robustness against speckle noise and NRD, and significantly improved the accuracy and success rate of co-registration. Fan et al. proposed the uniform nonlinear diffusion-based Harris corner point extraction (UND-Harris) and phase consistent structure descriptors (PCSD) to match SAR and optical images [25]. Ye et al. proposed the histogram of orientation phase consistency (HOPC), which combines the advantages of histogram of oriented gradient (HOG) and phase consistency (PC) to achieve multimodal image matching [26]. They subsequently integrated the MMPC-lap feature detector with the local histogram of oriented phase congruency (LHOPC) to further increase performance [27]. Xiang et al. investigated the energy minimization method and the higher-order singular value decomposition method in the PC algorithm, improved the gradient extraction method of the traditional PC method, and further improved the robustness of the PC method in SAR and optical matching [28]. Xiong et al. proposed the rank-based local self-similarity (RLSS) to describe the local shape of an image and used RLSS descriptors of multiple local regions to construct dense-RLSS to complete the matching [29]. Li et al. constructed a maximum index map (MIM) with Log-Gabor to describe features and achieved rotation invariance by multiple MIM, and they refer to this as the radiation-invariant feature transform (RIFT) [30].
With the development of deep learning techniques, many researchers have tried to construct feature descriptions by using deep learning. He et al. proposed a remote sensing image matching technique based on siamese convolutional neural network to learn feature and similarity metrics [31]. Hughes et al. constructed SAR and optical image matching networks using semi-supervised learning, aiming to overcome dataset limitations [32]. Zhang et al. constructed a general workflow for multimodal remote sensing image coregistration based on a siamese fully convolutional network and adopted the strategy of maximizing the feature distance between positive and negative samples. [33]. Then, they used the Siamese convolutional network to learn pixelwise deep dense features to further improve the robustness of the network [34]. Dou et al. designed a generic matching network (GMN) based on the generic adversarial network (GaN) to increase the amount of training data and improve the matching performance [35]. Hughes et al. used GaN networks to generate very high resolution (VHR) SAR image blocks to achieve VHR SAR and optical matching [36]. Song et al. constructed MAP-Net to extract high-level semantic information by using spatial pyramid aggregated pooling (SPAP) and attention mechanism to complete cross-modal image matching [37].
However, although the deep learning methods are more likely to achieve performance leadership, they are limited by datasets and difficult to make breakthroughs in generaliza-tion ability due to the factors such as resolution, shooting angle, and polarization. Therefore, traditional algorithms are still more widely used at this stage. The SIFT-like algorithms are slightly limited by the area features and inefficient at large images. The algorithms based on the robust feature descriptors usually achieve better results. However, in the case of one-look SAR images, the strong speckle noise causes an offset in the edge localization, which will lead to a degradation of co-registration performance. It is difficult to balance the localization ability of edges with the suppression ability of noise, so the co-registration performance is limited to full-resolution images.
In order to balance the localization ability of edges with the suppression ability of noise in the one-look SAR images and improve the co-registration performance, in this paper, we propose a novel 3D density feature descriptor based on primary structure information to achieve SAR and optical co-registration. The primary edge structure is defined as the edge response where the contrast changes significantly, which is usually manifested as the edge structure of the object that is significantly different from the surrounding ground objects. The primary edge can reflect the structure feature of nearby ground objects and has a high similarity between the optical and SAR images. Correspondingly, the weak edge has a low edge response value, which is usually generated by noise or inconspicuous object edges. The weak edge exhibits significant differences between SAR and optical images. It bases on a feature-based method and consists of three steps: keypoints detection, feature extraction, and feature matching. Compared with other algorithms, the main contributions of the proposed method are summarized as follows: • The primary edge extraction operator multi-scale sigmoid Gabor (MSG) is proposed to improve the noise rejection ability and maintain the edge localization ability. The MSG is compared against the traditional operator ratio of exponentially weighted averages (ROEWA) [38], the state-of-the-art operators ratio based edge detector (RBED) [39], and unbiased difference ratio edge detector (UDR) [40]. In comparison to the other detectors above, MSG has the higher signal-to-noise ratio (SNR) and edge positioning performance. • Based on primary edge structure information, we propose a novel 3D density feature descripor, histogram of orientated primary edge structure (HOPES). It consists of more robust multi-angle structural information of SAR and optical images. Compared with the traditional method MI, the state-of-the-art methods CFOG and HOPC, HOPES can obtain more sharper similarity measure map. • In the co-registration process, the non-maximum suppressed SAR-Harris (NMS-SAR-Harris) and deviding grids method are used to obtain stable and uniform keypoints. The HOPES descriptor is used to extract the features of keypoints. The NMS-SSD fast template match method is used to match feature. The fast sample consensus (FSC) algorithm [41] is used to remove outliers.
The remainder of this paper is organized as follows: In Section 2, we firstly propose the multi-scale sigmoid Gabor filter and the primary edge fusion algorithm; and then, we introduce the HOPES in detail; finally, we introduce the co-registration process, including keypoint extraction, and NMS-SSD fast matching. In Section 3, the performance of MSG, the accuracy and robustness of HOPES matching, and the effect of parameters on performance are evaluated and discussed experimentally. Finally, conclusions are provided in Section 4.

Multi-Scale Sigmoid Gabor Filter (MSG)
Due to the noise and NRD, using edge structure features is a more robust co-registration method for SAR and optical images. However, it is a great challenge to extract edge features robustly and accurately. Noise and weak contrast areas cause weak edges, which are different from optical images and unsuitable for matching features. In contrast, the primary edges reflect the edge structure feature of the strong contrast region and have higher similarity with the optical image. In order to obtain primary edges and suppress the influence of speckle noise, a multi-scale sigmoid Gabor filter (MSG) is designed.
Mehrotra et al. demonstrated that the odd-symmetric part of the Gabor filter is an efficient and robust edge detection operator [42], and the two-dimensional Gabor filter at angle θ is defined as: where ω is the frequency of the sine function and σ is the scale of the Gaussian function. Figure 1 gives the two-dimensional Gabor filters on θ = 0 and θ = π/4. Two adjacent non-overlapping Gabor windows are given by the following equation: The local means µ 1 and µ 2 of addedboth windows can be obtained by computing the convolution of the filter with the grayscale of original image: Then we can obtain the result of single-scale Gabor edge detector: In Gabor filter defined by Equation (1), σ reflects the scale property of the filter. Build a scale space of n scales S n = {σ 1 , · · · , σ n }, where the relationship between two adjacent scale parameters is σ i+1 /σ i = k. Thus, the scale-edge strength map (S-ESM) and scale-edge direction map (S-EDM) can be obtained: where ESM n and EDM n are given by: The ESM at different scale is shown as Figure 2: At a small scale, the Gabor filter performs better at localizing edges, but it is more sensitive to noise. At a wide scale, the Gabor filter has less edge localization capability but can suppress noise more effectively. Therefore a fusion method is considered to suppress noise and maintain edges. It is generally considered that the lower response of the Gabor filter is resulted by weak edges. The weak edges are usually cause by noise and unsharp edges, and they need to be suppressed. So, we proposed a sigmoid function to build the multi-scale sigmoid Gabor filter (MSG), which aims to reduce value at the lower response of the filter. A measure of filter response spread at single-scale can be generated by dividing the sum of responses by the highest response. Then we normalize it by the number of scales being used, and thus can obtain a fractional measure of spread that varies between 0 and 1. This spread is given by: where N is the total number of scale, A max (x) is the maximum value of the filter response at that scale, and ε is used to avoid division by zero. Then the weights at this scale can be expressed as: where c is the cutoff value of the filter; the edge response will be penalized to a certain extent below it. γ is the gain factor that controls the cutoff response; changing the value of γ can adjust the weight of primary and weak edges; increasing the value of γ enhances the suppression of weak edges and highlights primary edges. However, a value of γ that is excessively large tends to result in the loss of edge features, as shown in Figure 3. Therefore, the weights in the scale space are expressed as: The primary edge fusion algorithm in the scale space is described as follows: For the true edge points, weights in the scale space are: While the weights at weak edges decrease significantly as the scale increases, the weights at primary edges remain relatively constant at different scales. Therefore we choose the minimum weight strategy to suppress noise and refine edges. Figure 4 shows the effect of different weight strategies, and it can be seen that the minimum weight strategy can obtain the minimum noise and the best edge localization performance.

3.
The fused ESM is obtained from the following equation:

Histogram of Oriented Primary Edge Structure
Inspired by the histogram of oriented gradient (HOG), we develop a 3D density descriptor called the histogram of oriented primary edge structure (HOPES) using the primary edge structure produced via MSG. Since the images have been coarsely adjusted, the same point coordinates can be used to extract the template window and search window separately. Figure 5 gives the main flow of constructing HOPES 3D density descriptor.
where ESM k n denotes ESM of the k-th orientation at the n-th scale: Perform primary edge fusion algorithm on S-ESM at each orientation, and the primary edge feature map under the orientation space (O-PEFM) can be obtained as: where ESM k f usion denotes the primary edge obtained by the primary edge fusion algorithm at the k-th orientation: Convolve the O-PEFM by a 3D Gaussian-like kernel g, then we can get the 3D density structure descriptor HOPES. This 3D Gaussian-like kernel consists of a two-dimensional Gaussian kernel in the X and Y directions and a one-dimensional Gaussian kernel in the Z direction. The convolution in the X and Y directions reduces noise, while convolution in the Z direction smooths the oriented gradient, hence reducing directional distortion produced by local geometric and intensity distortion. To demonstrate the role of Z-direction Gaussian convolution, we add multiplicative noise with a mean of 0 and a variance of 0.06 to a high-resolution optical image to simulate the noise distribution of the SAR image, and add distortion to simulate the directional distortion of the SAR image. Compare the gradient direction histograms of HOPES without Z-direction convolution, with [1, 2, 1] T convolution kernel [43], with σ = 1, size = 5, and σ = 3, size = 11 Gaussian kernels, (see Figure 6), and their statistical variances are in Table 1. As can be seen, the gradient direction histogram obtained by using a Gaussian kernel for Z-direction convolution is more similar to the original one, and it should be noted that using a larger Gaussian kernel and σ can achieve a better smoothing effect, but at the same time, it is more likely to cause overfitting phenomenon and lose its original gradient direction histogram feature. Therefore, in practical applications, we choose a Gaussian kernel with σ = 1, size = 5. Finally, L2 regularization is applied to the Z-direction to overcome the grayscale difference between SAR and optical images to obtain the final HOPES descriptor. HOPES is by constructed using all pixels, so it can be considered as a density descriptor. To generate the structure descriptor, the proposed HOPES descriptor computes a histogram of primary edge for M directions centered on each pixel. Visualize the HOPES descriptor, as shown in Figure 7a. Firstly, divide the obtained HOPES descriptor into n × n subregions, add up the gradient directions in the subregions, count the gradient values of all directions in all subregions, and normalize them. Take their size as the size of gradient vectors and the direction as the direction of the gradient vectors, then we can get the visualized HOPES descriptor. Figure 7b shows that the HOPES descriptor is able to obtain distinct gradient directional features at the edges, while the gradient direction is more balanced at the noise with no significant directionality.

Co-Registration Algorithm
The flowchart of the proposed algorithm is shown in Figure 8. The algorithm consists of three steps: keypoints detection, feature extraction, and feature matching. The step of feature extraction has been introduced above, so we will introduce step 1 keypoints extraction and step 3 feature matching next.

Keypoints Extraction
Firstly, the geography information is used to register the SAR and optical images coarsely. The initial registration is crucial to the entire registration process, and the incidence angle will affect registration results. We use two GF-3 images at the same region with different orbital directions and different incidence angles to prove that the increase of the incident angle will lead to a decrease in the registration success rate and accuracy. That is because the increase of incidence angle will cause the offset of the feature information and the increase of structural feature difference. In addition, an excessively large incidence angle may cause the obscuration of objects. So, the input SAR image is an orthoimage that has been coarsely adjusted by RD or RPC modal with digital elevation model (DEM). The purpose of using DEM is to maintain the stability of the primary structure. The coarse orthoimage eliminates most of the displacement, rotation, and distortion. Next, keypoints are extracted on SAR image by dividing grids or NMS-SAR-Harris algorithm. Dividing grids method aims to get evenly distributed points. In this method, we divide the m × n image uniformly on the row and column coordinates by x pixels, and then we can get the grid points. NMS-SAR-Harris uses non-maximum suppression based on SAR-Harris. It selects the point with the most significant response in a region as the keypoint, which effectively solves the problem of duplication of points and improves the matching efficiency, as shown in Figure 9:

Feature Matching
The sum of squared difference (SSD) can be used as a similarity measure. Due to the complexity of 3D density descriptor, we transform the spatial convolution operation of SSD into the frequency domain by FFT for point multiplication calculation to reduce the computational complexity [43]. Non-maximum suppression is also introduced to obtain matching results with higher confidence. Suppose the HOPES descriptors of the reference template window and the search window are D l and D r , then the SSD of D l and D r are defined as: Expanding the above equation: The templates achieve optimal matching when the SSD metric is minimized. In the above equation, ∑ (x,y) D 2 l (x, y) and ∑ (x,y) D 2 r (x, y) are constant terms, so maximum ∑ (x,y) D l (x, y)D r (x, y): where (x , y ) denotes the translational relationship between two windows. To speed up the computation, the convolution calculation D l (x, y)D r ((x, y) − (x , y )) is transferred to the frequency domain by the Fourier transform, then we have: where F and F −1 denote the Fourier transform and inverse transform, and * denotes the complex conjugate. Using the above method, the translation relation (x , y ) can be calculated. Inevitably, the matching results may have multi-peakedness, as in Figure 10a.
In order to improve the matching accuracy and robustness, non-maximum suppression is used to select the maximum similarity point. We call it NMS-SSD. Firstly, the pixel values of similarity measure map are ordered decreasing, the first N extreme points are selected as seed points. The coordinates (x i , y i ) are taken as the upper left point of window (x tl i , y tl i ), and the pixel values are recorded as matching scores. Given the search window size, and the lower right corner point of search window is (x rb i , y rb i ). The window area of all seed points are compared with the window area of the highest matching score (main peak point), respectively, and if it is larger than the discriminant threshold N t of the overlapping area ratio, as in Figure 10b, it will be removed from the seed points; if it is smaller than N t , as in Figure 10c, it will be kept as a sub-peak point. If there are sub-peak points and the ratio of primary to secondary peak is greater than the threshold t, or only primary peak exists, then the primary peak will be used as matching point. The matching confidence is further improved by NMS-SSD.

Experimental Results and Discussion
In this section, we evaluate the edge extraction performance of the MSG operator using two simulated SAR images with added multiplicative noise. We also compare the similarity measure map of HOPES with the traditional method MI and two state-of-the-stage methods CFOG and HOPC. Then, seven pairs of SAR and optical images are used to test the HOPES and analyze the results. The co-registration performances are evaluated via objective and subjective methods. One is the evaluation criteria, and the other is to use a chessboard mosaic image and enlarged submaps. Finally, the influence of different parameters on the performance is experimentally evaluated and analyzed.

Comparison of Edge Extraction
In the proposed matching algorithm, the accurate acquisition of edge information is the key to successful matching. Therefore, in this section, we compare the proposed MSG with the existing popular SAR edge detection algorithms, including the traditional dector ROEWA [38], two state-of-the-stage dectors RBED [39], and UDR [40].

Datasets and Parameters Settings
Firstly, two sets of one-look simulated images are used for the experiment. To simulate the SAR image, we added speckle noise to the images, and make them to have a similar noise distribution to the true SAR image. Figure 11a, is a simulated one-look SAR image of size 1000 × 600. Figure 11c is a simulated one-look SAR image of size 1500 × 1500, in this image, the detection difficulty is obviously increased by the effects of transition zone size and edge contrast. The ground truth (GT) maps are shown as Figure 11b,d, which clearly indicate the edge, non-edge region and transition region. In the GT map, the black line represents the true location of the edge. The eight-neighborhood of the true edge is the transition region, which is the white area in the GT map. The black region in the GT map is the non-edge region. The detectors ROEWA h ROEWA (x, y), RBED h RBED (x, y), UDR h UDR (x, y) are noted as: In the next experiments, all possible combinations of parameters in the parameter space suggested by the original algorithm will be tried to obtain the best results. Furthermore, the parameter space of MSG is:

Evaluation Criteria
Signal-to-noise ratio (SNR): We define the edge and transition regions in Figure 11b,d as signal S, and the other regions in Figure 11b,d as noise N. Use Equation (24) to calculate SNR: SNR = 10log 10 (S/N) (24) F-score: In order to compare the edge localization performance between different algorithms, we use NSHT post-processed to obtain binary edges. The edge pixels are counted as true positive (TP) if the detector extracts them in the edge and transition regions, false positive (FP) if they are reported in the non-edge regions. The non edge pixels reported in the edge region are counted as false negative (FN), and the non edge pixels reported in the non edge region are counted as true negative (TN). Use Equation (25) to calculate the performance of the detection operator.

Results and Discussion
The edge strength maps obtained by ROEWA, RBED, UDR, and MSG are shown as Figure 12, their SNR are shown as Figure 13. Figure 14 gives the binary edge image after NSHT algorithm; and Table 2 gives F-scores of the four detectors.  In the two sets of test data, ROEWA, RBED and UDR can extract edges, however, if we want to produce a sharp edge strength map, they will be inevitably affected by noise, and if we adjust parameters to avoid the influence of noise as much as possible, the edges will be thickened, which is not conducive to the subsequent registration task. The SNR of MSG is significantly better than other algorithms, it shows stronger edge response and noise suppression ability. A stronger SNR can be obtained by adjusting γ MSG , however, it will result in the loss of some edge features due to the over-inhibition of low contrast edges. Experimental results of F-score show that MSG has better edge localization performance than other operators in one-look simulated SAR images. Among the four algorithms, MSG detects the most complete binary edge, as shown in Figure 14 and Table 2.

Comparison of the Feature Descriptors
We use traditional algorithm MI [10] and two state-of-the-art methods HOPC [26], CFOG [27] and HOPES to compare the differences of the four feature descriptors. Figure 15 shows similarity measure maps of CFOG, HOPC, MI and HOPES. We use the same template size and search radius for each group. HOPES can achieve correct matching under all four sets of data. It can get sharper and more obvious peaks at the matching point, while smoother at the non-matching area. Data 1 is a water pond, and it can be seen that due to the obvious temporal difference between SAR image and optical image, there is a certain difference between the template image and the search image, HOPES uses the primary edge feature of the template area to avoid the effect of temporal difference and thus achieves the co-registration. CFOG and HOPC have a higher matching success rate than MI because they use the structural information of the images, but there is a phenomenon of multi-peakedness. In comparison, HOPES has higher robustness.

Comparison of SAR and Optical Co-Registration
In this subsection, we test seven sets of SAR and optical images with HOPC, CFOG, SAR-SIFT, OS-SIFT and our proposed algorithm. SAR-SIFT and OS-SITF are the SIFT-like algorithms; HOPC and CFOG represent histogram of phase congruency and gradient, respectively. Subjective and objective criteria are used to evaluate the performance of the co-registration algorithms.

Datasets and Parameters Settings
In our experiment, to compare the co-registration performance between different algorithms, we selecte seven groups of SAR and optical images with different regions, resolutions, and time. As shown in Figure 16. Table 3 lists the information of each image pair.   Table 3. SAR images are from GF-3 satellite. Optical images are collected from Google Earth, mosaicked by different satellite images. Pair A is a mountainous area with an average elevation of about 1500 m. Apart from having generally similar trends of mountain range, the images vary widely and pose a challenge for co-registration. Pair B is the urban area with large radiometric differences. Surface features of Pair C and Pair F are the fish ponds and saltworks, which contain numerous repetitive texture features that can interfere with co-registration. Pair D is the tropical river area with large radiometric differences. Pair E contains a large lake area and the upper right part contains a urban area. Pair G is a plateau area with an average altitude of more than 4500m, with a resolution of 1m and large image differences, which is a big challenge for co-registration. All these images contain some temporal differences.
We compare our algorithm with CFOG, HOPC, SAR-SIFT and OS-SIFT. We apply the NMS-SAR-Harris and deviding grids methods to extract the keypoints in the HOPES matching algorithm to compare the differences of keypoint extraction methods, and apply NMS-SAR-Harris method to extract keypoints in CFOG and HOPC. We keep consistent with the relevant parameter settings in NMS-SAR-Harris of these three algorithm to ensure that the number of keypoints is the same. As the SIFT-like algorithms, SAR-SIFT and OS-SIFT follow the recommended settings of the original author. All algorithms use FSC algorithm for post-processing to eliminate outliers. The model is set to affine transformation with RMSE of 3. Parameters of HOPES are set to σ MSG = 1.4, k MSG = 1.4, γ MSG = 6, scale parameter N Scale = 3, and orientation is set to M = 8.

Evaluation Criteria
As a subjective evaluation criteria, the checkboard mosaic image and enlarged subimages are displayed to observe the effect and details of image co-registration. We also use the following criteria to analyze the performance objectively and quantitatively.
Root mean squared error (RMSE): We select 10~20 pairs of corresponding points manually to estimate the affine transformation matrix H. The RMSE is calculated as [44]: In Equation (26) Correct match rate (CMR): It counts the success matching rate. It should be noted that SAR-SIFT and OS-SIFT are different from HOPES, CFOG, and HOPC algorithms, the SIFT-like algorithms adopt global search strategy and their initial keypoints are different with others. So we do not count CMR on them. CMR are defined as follows: In Equation (27), N 1 and N 2 are the number of keypoints extracted from two images, in our algorithm, N 1 = N 2 .

Results and Discussion
The co-registration results are shown in Figure 17, and the statistics of the six coregistration strategies are shown in Table 4.  Our HOPES achieves the best CMR in all seven data sets and the best RMSE in six data sets. There is little difference between NMS-SAR-Harris and deviding grids strategy. Two SIFT-like algorithms, SAR-SIFT and OS-SIFT, achieved the worst results. As a coregistration algorithm designed for SAR images, SAR-SIFT is difficult to complete the co-registration between optical and SAR in the most cases. OS-SIFT improves the gradient extraction in SAR-SIFT, making it more competent for optical and SAR co-registration. It can complete the co-registration in the textured scenes such as lakes and ponds, but the registration accuracy is not high. At the same time, both algorithms are time-consuming in co-registration of images with larger size and richer texture because of global search. To visually compare the co-registration results, Figures 18-24 gives the mosaic board map after HOPES matching and a local zoom of the board map obtained by the successful co-registration algorithm. Next, we analyze and discuss the co-registration results for each pair of images.
In Pair A, due to the terrain and squint of SAR, the imaging results are very different from the optical image. At the same time, in areas with large topographic changes, the orthophoto used DEM has smear, which will interfere with feature extraction. SAR-SIFT and OS-SIFT failed to register. CFOG (Figure 18c) and HOPC (Figure 18d) have a certain offset, and the offset of CFOG is larger. HOPES (Figure 18b) uses primary edges to construct primary structural features, it has the highest accuracy and the smallest offset, which is consistent with the RMSE results in Table 4. In Pair B, SAR-SIFT and OS-SIFT fail to register, the offset of CFOG (Figure 19c) is the largest, and there is little difference between HOPC ( Figure 19d) and HOPES (Figure 19b). HOPC achieves the best RMSE in Table 4. In Pair C, due to the rich texture details, all algorithms complete the co-registration. HOPES To compare the efficiency of the above algorithms, we test them with Pair C, Pair E and Pair F and count the running time of each algorithm, as Table 5. The experiment is conducted on a laptop with an Intel Core i7-10875H processor and 32 GB memory. Among the three data sets, the CFOG algorithm has the best running time, which because CFOG computes only single-scale gradient information with relatively low operational complexity, however, it has the lowest co-registration accuracy among the first three algorithms. HOPC has a higher computational complexity due to the phase congruency algorithm. While HOPES adopts a multi-scale fusion algorithm, which increases the computational complexity, but also brings higher co-registration accuracy and success rate. However, the SIFT-like algorithms adopt a global search strategy and have high dimension descriptors, so their computational complexity are the largest. It should be noted that the number of feature points extracted by SIFT-like algorithms is not the same as the first three algorithms, so it is not strictly comparable. We count the average running time for each point of the first three algorithms, there is an approximate linear relationship between the efficiency of the algorithm and the number of keypoints. Extracting the deeper feature information is the key to improving the co-registration accuracy and success rate, but it also increases the computational complexity, which is a tradeoff.
To summarize, the SIFT-like algorithms such as SAR-SIFT and OS-SIFT are more sensitive to nonlinear radiation differences. They can successfully register the images with obvious boundaries and texture (such as lakes, rivers, etc.), but they are not suitable for significant nonlinear radiation differences caused by radiation distortion. HOPC uses the phase congruency model with illumination and contrast invariance to construct the histogram of phase congruency. As an improved algorithm of HOPC, CFOG uses gradient information to construct the histogram of gradient direction. Experiments show that they are more robust to nonlinear radiation differences than SIFT-like algorithms, but their structural feature extraction methods are not optimized for SAR images. In the practical applications, especially the one-look SAR images, the speckle noise of SAR greatly affects the success rate and accuracy of co-registration. Compared with the above four algorithms, our HOPES adopts a multi-scale primary edge fusion algorithm, constructs more robust and higher SNR primary edge structure features, and retains the edge positioning ability. Therefore, HOPES has better RMSE and NCM, and is more robust to different scenes.

Comparision of Parameters Settings
In order to compare the effects of σ MSG , γ MSG and N Scale on the HOPES co-registration performance, we choose Pair C for testing. First, the scale factor is set to k MSG = 1.4, the number of scales is set to N Scale = 4, the orientation is set to M = 8, γ MSG = 1, 3, 5, ..., 15, σ MSG = 1.4, 1.6, 1.8, 2.0, the statistics results of RMSE and NCM are shown as Figure 25a,b. When σ MSG and γ MSG are small, both the RMSE and NCM are poor due to the influence of SAR speckle noise. As γ MSG increases, the primary edge fusion function plays a role at this time, which provides a higher edge localization ability and suppresses the speckle noise, so that the RMSE gradually decreases and NCM gradually increases. However, with γ MSG increasing, some edges will be excessively suppressed, resulting in the loss of structural features, and it makes the RMSE increase and NCM decrease. There is a tradeoff in the choice of σ MSG . With a too small choice of σ MSG , the influence of noise cannot be satisfactorily reduced. In contrast, . A smaller number of scales results in a larger noise impact, and a larger number of scales will produce a coarser edge structure, both of which are detrimental to co-registration.

Conclusions
In this study, we propose a primary structure extraction algorithm to extract the primary edges of SAR images; based on this, we develop a SAR and optical co-registration algorithm called HOPES to overcome the difficulties caused by strong speckle noise and complex NRD.
We design a primary structure extraction algorithm, including Multi-scale Sigmoid Gabor filter, primary edge fusion algorithm and minimum weight strategy, to suppress speckle noise and obtain the primary edge structure information in SAR image, especially the one-look SAR images. The test results of simulated SAR images show that the primary structure extraction algorithm can obtain higher SNR edge strength map and edge positioning accuracy. The proposed co-registration method is composed of NMS-SAR-Harris and deviding grids for keypoints extraction, HOPES structure feature descriptor, NMS-SSD fast template matching, and FSC outlier removal. NMS-SAR-Harris and deviding grids methods can obtain keypoints with obvious features and uniform keypoints, respectively, HOPES is a histogram of primary structure based on MSG. It is a 3D density structure feature that can reflect the matching region's structure feature. NMS-SSD fast template matching and FSC algorithm further improve the confidence of matching results. Image co-registration experiments show that the proposed co-registration method is robust to speckle noise and NRD between SAR image and optical image; it improves the success rate and accuracy of matching effectively.