Exploiting High Geopositioning Accuracy of SAR Data to Obtain Accurate Geometric Orientation of Optical Satellite Images

: Accurate geopositioning of optical satellite imagery is a fundamental step for many photogrammetric applications. Considering the imaging principle and data processing manner, SAR satellites can achieve high geopositioning accuracy. Therefore, SAR data can be a reliable source for providing control information in the orientation of optical satellite images. This paper proposes a practical solution for an accurate orientation of optical satellite images using SAR reference images to take advantage of the merits of SAR data. Firstly, we propose an accurate and robust multimodal image matching method to match the SAR and optical satellite images. This approach includes the development of a new structural-based multimodal applicable feature descriptor that employs angle-weighted oriented gradients (AWOGs) and the utilization of a three-dimensional phase correlation similarity measure. Secondly, we put forward a general optical satellite imagery orientation framework based on multiple SAR reference images, which uses the matches of the SAR and optical satellite images as virtual control points. A large number of experiments not only demonstrate the superiority of the proposed matching method compared to the state-of-the-art methods but also prove the effectiveness of the proposed orientation framework. In particular, the matching performance is improved by about 17% compared with the latest multimodal image matching method, namely, CFOG, and the geopositioning accuracy of optical satellite images is improved, from more than 200 to around 8 m.


Introduction
Benefiting from the fast development of space transportation technology and sensor manufacturing technology, the acquisition capacity of remote sensing data has been largely enhanced [1]. Currently, it has formed a complete air-space-ground remote sensing data acquisition system which has the characteristics of being efficient and diverse (multi-platform, multisensor, multiscale) and having a high resolution (spectrum, spatial, time), providing massive unstructured remote sensing data [2]. Different types of sensors have different imaging principles and properties; therefore, different sensor data usually reflect different aspects of the ground objects and are good in different application areas. For example, optical imagery is widely applied in surveying and mapping disaster management [3], and in change detection [4]; SAR imagery is used for surface subsidence monitoring [5], DEM/DSM generation [6], and target recognition [7]; and infrared imagery is applied in soil moisture inversion [8] and forest fire monitoring [9]. Although the current processing and application technology for a single type of sensor data has reached a high level, a single type of sensor only encodes the electromagnetic wave information of a specific band. SAR orthophoto and the optical satellite image, and these tie points are taken as GCPs to improve the orientation accuracy of optical satellite images [39,40]. Reinartz et al. [41] used adaptive mutual information to match the SAR and optical satellite images, and the matches were employed to optimize the original RPCs. The results showed that the geometric positioning accuracy of the optical satellite is largely improved and reaches within 10 m. However, this method does not consider the complex terrain and cannot work with multiple SAR reference images. Merkle et al. [42] used a large number of pre-registered TerraSAR-X and PRISM image datasets to train a Siamese neural network and used the network model to extract the corresponding points between the SAR and optical satellite images. The matching points were employed to optimize the sensor model, enhancing the geo-localization accuracy. However, this method cannot handle optical satellite and SAR images with large offsets and cannot work on large optical satellite images due to the limited GPU memory.
The significant difference in acquisition principles, viewing perspectives, and wavelength responses between SAR and optical satellite images makes finding reliable correspondences when matching optical and SAR images challenging [42]. More specifically, SAR sensors acquire data in a sideways-looking manner, which may cause typical geometric distortions such as foreshortening, occlusion, and shadowing, especially for the areas with large height changes. In contrast, optical satellite images are acquired in the nadir view or small fixed slanted angles to provide high geometric accuracy. Consequently, the same object may have a completely different appearance on SAR and optical satellite images. Moreover, these two types of sensors receive and apply electromagnetic waves in different bands. The radar signal wavelength is at the level of centimeters, while that of optical signals is in nanometers. The different lengths of waves encode different aspects of object properties, therefore leading to significant intensity differences between SAR and optical satellite images for the same object. Additionally, the signal polarization, roughness, and reflectance properties of the object surface may also influence the wavelength responses greatly. In addition, the speckle effect on SAR images further increases the matching difficulty of optical satellite images. Figure 1 shows the difference between an optical and a high-resolution SAR image for a selected scene containing manufactured structures and vegetation. In this paper, we propose an improved multimodal image matching descriptor angleweighted oriented gradient (AWOG) for SAR and optical satellite images, obtaining a robust and accurate matching performance. To describe the image features, we calculated a multi-dimensional feature vector lying on multiple uniform distribution orientations based on the horizontal and vertical gradient maps g x and g y for each pixel, as with the recent multimodal image matching method CFOG [43]. Note that the range of [0 • , 180 • ) was divided into several evenly distributed orientations, and we calculated the gradient direction and magnitude based on g x and g y , found the two orientations nearest to the image gradient direction, and distributed the gradient magnitude to these two orientations based on their weights. In this way, the image gradient value is only allocated to the two most related orientations rather than all orientations, increasing the distinguishability of feature vectors. During the process, we opened a window image on the whole image and assigned the two weighted amplitudes of each pixel in the window image to the corresponding orientation elements of the feature vector. After all the gradients of pixels in the window image were assigned, the feature vector of the target pixel was obtained and normalized to reduce the influence of the nonlinear intensity difference. In addition, we introduced a quick lookup table named the feature orientation index (FOI), which converts the inefficient calculation of each pixel into matrix operations and helps to assign the weighted values to their corresponding elements of the feature vector, significantly improving the computation efficiency. Additionally, in terms of metric similarity, we utilized phase correlation (PC) [44] rather than the metrics in the spatial domain to speed up the matching process, but we omitted the normalization operation of PC since the descriptor was normalized during the feature vector generation.
In addition, we propose a precise geometric orientation framework for raw optical satellite images based on SAR orthophotos. Specially, we first collected all available SAR images of the study area and performed overlap detection between the SAR images and the target optical satellite image. Only the SAR images with large overlaps were retained as the reference image sets. Then, we created image pyramids for the optical satellite images and all the reference SAR images and detected feature points using the SAR-Harris feature detector [45]. To ensure the uniformity and high discrimination of the obtained feature points simultaneously, we introduced an image block strategy. After that, we transferred the point features into their corresponding ground points and reprojected the ground points onto the optical satellite images. The reprojected locations on the satellite image were taken as the initial correspondences for the feature points of SAR images, which were optimized using AWOG. We further conducted error elimination to remove the unreliable matches. We deemed the retained matches to be correct and took them as virtual control points (VCPs). Finally, we took the VCPs and the optical satellite image as input and performed RFM-based block adjustment to achieve an accurate image orientation of the optical satellite images.
In summary, the main contributions of our work are as follows: 1.
We propose a robust feature descriptor AWOG to match SAR and optical satellite images and introduce a PC-based matching strategy, obtaining stable and reliable matching point pairs with higher efficiency.

2.
We put forward a framework for an accurate geometric orientation of optical images using the VCPs provided by SAR images.

3.
Various experiments on SAR and optical satellite image datasets verify the superiority of AWOG to the other state-of-the-art multimodal image matching descriptors. Compared with CFOG, the correct matching ratio is improved by about 17%, and the RMSE of location errors is reduced by about 0.1 pixels. Additionally, the efficiency of the proposed method is comparable to CFOG and about 80% faster than the other state-of-the-art methods such as HOPC, DLSS, and MI.

4.
We further prove the effectiveness of the proposed method for the geometric orientation of optical satellite images using multiple SAR and optical satellite image pairs. Significantly, the geopositioning accuracy of optical satellite images is improved, from more than 200 to around 8 m.
This paper is structured as follows: Section 2 introduces some related works on SAR and optical image matching. Section 3 presents the proposed matching method for SAR and optical satellite images. Section 4 introduces a general orientation framework of optical satellite images using SAR images. Section 5 presents the experiments and analysis concerning the proposed method. Section 6 discusses several vital details. Finally, we conclude our work and discuss future research directions in Section 7.

Related Work
Current SAR and optical satellite image matching studies can be roughly divided into feature-based and area-based methods [46].
Feature-based methods detect salient features on images and match them based on similarity. The features can be seen as a simple representation of the whole image, thus being robust to geometric and radiometric changes, and can be classified into corner features [47,48], line features [49,50], and surface features [51]. Towards matching SAR and optical satellite images, Fan et al. [52] proposed the SIFT-M algorithm by extracting SIFT [53] features on both images, constructing distinct feature descriptors from multiple support regions, and introducing a spatially consistent matching strategy to match these feature points. Compared to SIFT, SIFT-M obtains more matches and has higher efficiency and accuracy. Based on SIFT-M, Xiang et al. [54] further proposed the OS-SIFT algorithm. They detected feature points in a Harris-scale space built on the consistent gradients of both SAR and optical satellite images and computed the histograms of multiple image patches to enhance the distinguishability of feature descriptors, increasing the robustness and performance of SIFT-M. Salehpour et al. [55] proposed a hierarchical method based on the BRISK operator [56]. They used an adaptive and elliptical bilateral filter to remove the speckle noise on SAR images and introduced a hierarchical approach to utilize the local and global geometrical relationship of BRISK features. This method has high accuracy, but the obtained matches are few and not well distributed over the full image. To improve the robustness against the significant nonlinear radiation distortions of SAR and optical images, Li et al. [57] proposed a radiation-variation insensitive feature transform (RIFT) method. This approach first detects feature points on the phase congruency map [58], which considers the repeatability and number, and then constructs a maximum index map (MIM) based on the log-Gabor convolution sequence. Moreover, they achieved rotation invariance by analyzing the value changes of the MIM. The results showed that RIFT performs well and is stable on multimodal image matching. Based on RIFT, Cui et al. [59] further proposed SRIFT. They first established a multiscale space by using a nonlinear diffusion technique. Then, a local orientation and scale phase congruency (LOSPC) algorithm was used to obtain sufficiently stable key points. Finally, a rotation-invariant coordinate (RIC) system was employed to achieve rotation invariance. Compared to RIFT, SRIFT obtains more matches and strengthens the robustness against scale change. Wang et al. [60] proposed a phase congruency-based method to register optical and SAR images. This method consists of employing a uniform Harris feature detection method and a novel local feature descriptor based on the histogram of the phase congruency orientation. The results showed that the ROS-PC is robust to nonlinear radiometric differences between optical and SAR images and can tolerate some rotation and scale changes. To increase the discriminability of the classical local self-similarity (LSS) descriptor [61], Sedaghat et al. [62] proposed a distinctive order-based self-similarity descriptor (DOBSS) based on the separable sequence. To fulfill multimodal image matching, they first extracted UR-SIFT [63] feature points from both images, constructed DOBSS descriptors, and finally utilized cross-matching and consistency checks in the projective transformation model to find the correct correspondences. This method not only obtains more matches but also shows better robustness and accuracy than SIFT and LSS. Beyond the methods based on point features, Sui et al. [64] proposed a multimodal image matching method by combing iterative line feature extraction and Voronoi integrated spectral point matching, increasing the matching reliability. However, this method requires a large number of lines in the scene. Hnatushenko et al. [65] proposed a variational approach based on contour features for co-registration of SAR and optical images in agricultural areas. They performed a special pre-processing of SAR and optical images and achieved co-registration by transforming it into a constrained optimization problem. This method is quite efficient and fully automatic. Xu et al. [66] proposed a surface feature-based method that employs iterative level set-based area feature segmentation to take advantage of the surface features. However, this method can only be applied to images containing significant regional boundaries, such as lakes and farmland. Even though the features have the advantage of flexibility and robustness on multimodal images, the repeatability is relatively low due to the large geometric deformations, and nonlinear intensity changes [67,68]. For example, the state-of-the-art method RIFT only has an inlier ratio of about 15%, decreasing the performance of feature-based methods in matching SAR and optical satellite images.
The area-based methods realize image matching by opening a matching window on the original images and utilizing a similarity metric to find the matches [69]. Normalized cross-correlation (NCC) [70] and mutual information (MI) [71] are two popularly used metrics in the matching of optical satellite image pairs. However, NCC cannot handle the matching of SAR and optical images in most cases due to the nonlinear intensity deformation. On the contrary, MI describes statistical information of the image and has good robustness to nonlinear change. However, the MI method does not consider the influence of neighborhood pixels, making it easily fall into the local extremum. Moreover, MI is sensitive to the image window size and requires a large amount of calculation. Suri and Reinartz [72] analyzed the performance of MI on the registration of TerraSAR-X and Ikonos imagery over urban areas and found that the MI-based method can successfully eliminate large global shifts. Apart from MI, which works on multimodal image matching, another method is to reduce the multimodal images into a uniform form. The phase correlation (PC) algorithm developed by Kuglin and Hines [44] can be used to transform both SAR and optical satellite images into the frequency domain. This approach shows some robustness against nonlinear intensity distortion and has high efficiency. To explore the inherent structure information of SAR and optical satellite images, Xiang et al. proposed the OS-PC algorithm [73] by combining robust feature representations and three-dimensional PC, which exhibits high robustness to radiometric and geometric differences.
Some researchers noticed that even though large nonlinear intensity differences exist between SAR and optical satellite images, the morphological region features remain [74,75]. Therefore, these researchers first extracted the image conjugated structure and then conducted matching of the structural image, increasing the matching performance. Li et al. [76] used the histogram of oriented gradients (HOG) descriptor [77] to express the image structure and used NCC as the measured metric, obtaining good matching results. Based on HOG, Ye et al. [78] proposed the histogram of oriented phase congruency (HOPC) feature descriptor by replacing the gradient feature with a phase congruency feature, obtaining good performance on the matching of SAR and optical satellite images. Zhang et al. [79] applied HOPC to a combined block adjustment framework for optical satellite stereo imagery which is assisted by spaceborne SAR images and laser altimetry data. Ye et al. [80] further proposed a similarity metric, DLSC, using the NCC of a novel dense local selfsimilarity (DLSS) descriptor which is based on shape properties. Since DLSS captures the self-similarities within images, it is robust to large nonlinear intensity distortions. However, HOG, HOPC, and DLSS only construct descriptors for a few silent features. Thus, they are both sparse representations and omit some available and important structure information. Aiming at solving this issue, Ye et al. [43] further proposed the channel features of oriented gradients (CFOG). CFOG computes the adjacent structure feature pixel by pixel, enhancing the ability to describe the structure details. In terms of the similarity metric, they applied the convolution theorem and Fourier transform to deduce the sum of squared differences (SSD) measure from the spatial domain to the frequency domain, achieving a fast calculation. Ye et al. [81] used CFOG to fulfill the co-registration of Sentinel-1 SAR and Sentinel-2 optical images and evaluated the performance of various geometric transformation models, including polynomials, projective models, and RFMs. However, CFOG only computes the gradient maps in the horizontal and vertical directions and uses them to interpolate the gradients in all the other orientations, which brings ambiguity and decreases the accuracy.
With the development in the field of deep learning, some researchers have explored the application of deep learning-based techniques to fulfill the matching of SAR and optical satellite images. Merkle et al. [82] investigated the possibility of cGANs (conditional generative adversarial networks) in this task. They first trained conditional generative adversarial networks to generate SAR-like image patches from optical satellite images and then achieved the matching of SAR and optical images using NCC, SIFT, and BRISK under the same image model. Following the same thought, Zhang et al. [83] employed a deep learning-based image transfer method to eliminate the difference between SAR and optical images and applied a traditional method to match the multimodal remote sensing images. Ma et al. [84] proposed a two-step registration method to register homologous and multimodal remote sensing images. They first computed the approximate relationship with a deep convolutional neural network, VGG-16 [85], and then adopted a robust local feature-based matching strategy to refine the initial results. Similarly, Li et al. [86] applied a two-step process to estimate the rigid body rotation between SAR and optical images. They first utilized a deep learning neural network called RotNET to coarsely predict the rotation matrix between the two images. After that, a better matching result was obtained using a novel local feature descriptor developed from the Gaussian pyramid. Zhang et al. [87] set up a Siamese fully convolutional network and trained it with a loss function that maximizes the feature distance between the positive and negative samples. Mainly, they built a universal pipeline for multimodal remote sensing image registration. Hughes et al. [88] developed a threefold approach for the matching of SAR-optical images. They first found the most suitable matching regions using a goodness network, then built a correspondence heatmap with a correspondence network, and finally eliminated the outliers by an outlier reduction network. Even though these methods obtained better results than the competing traditional geometric-based methods in their experimental results, current deep learningbased methods still suffer from the following problems. Firstly, there is no multimodal remote sensing dataset large enough to train a deep learning neural network for general use. Secondly, the deep learning-based methods bring challenges to the computer resources, especially the GPU, and the efficiency is relatively low under most computer setups. Finally, the current learning-based methods cannot be applied to full-scale satellite remote sensing images. These demerits restrict the practical application of deep learning techniques in the matching or registering of multimodal remote sensing images.

AWOG Descriptor
Since the gradient information can describe the image structure well and is robust to nonlinear intensity differences, we also constructed our multimodal image descriptor AWOG based on image gradients as with the other state-of-the-art image descriptors. However, unlike those descriptors that build sparse feature descriptors only for silent features such as HOPC [78] and DLSS [80], AWOG constructs dense descriptors for each pixel, increasing the description ability of image details. Even though CFOG [43] is also based on dense descriptors, it uses the horizontal and vertical gradients to interpolate the gradients in multiple other directions, which decreases the distinctiveness of the features. On the contrary, AWOG uses an angle weighting strategy to distribute the gradient value only into the two most related orientations, largely increasing the distinguishability of the feature descriptors and further improving the matching reliability of SAR and optical satellite images.
AWOG first calculates the image gradients as the basis for the subsequent calculation. As displayed in Figure 2, the horizontal gradient g x and vertical gradient g y are firstly calculated using the Sobel operator [89], and then the gradient magnitude and gradient direction of the image are calculated based on Equations (1) and (2). Since there may be an intensity inversion between SAR and optical satellite images, we conducted gradient direc-tion modification [43,[90][91][92] to improve the resistance against such intensity appearance. Significantly, the gradient direction whose values are greater than 180 is subtracted by 180, as in Equation (3).
where GM and GD are the gradient magnitude and direction, respectively. In order to make full use of the gradient information, we first divided the range of [0 • , 180 • ) into n subranges evenly by a few discrete and continuous orientation values (as shown in Figure 3); the number of values was the same as the size of the feature vector; and the index numbers of these values corresponded to those of the feature orientations used for building the feature vector. Then, we introduced a feature orientation index table (FOI) to speed up the following calculation process. For a single pixel, its gradient direction value must lie in a certain subrange, and we named the indexes of the lower and upper bounds of the subrange for a pixel as ILB and IUB, respectively. Note that IUB is always one greater than ILB. The FOI table can be obtained from collecting the ILB of all pixels, which can be calculated from Equation (4).
where the symbol · means to take an integer not greater than the value in it. Note that the feature vector was initialized as 0, and we utilized the angle difference of the gradient direction with the two feature orientations corresponding to the orientation indexes ILB and IUB of each pixel as weights and assigned the gradient magnitude onto the two feature orientations based on the calculated weights. The smaller the angle difference between the gradient direction and a predefined feature orientation, the larger the gradient magnitude value assigned to that feature orientation.
Furthermore, we utilized adjacent information to reduce the influence of nonlinear intensity distortions between SAR and optical satellite images. To improve the calculation efficiency, we introduced two weighted gradient magnitude images, WGML and WGMU, which can be obtained by collecting the weighed values of the gradient magnitude distributed to the lower and upper feature orientations corresponding to ILB and IUB of each pixel, respectively. The calculation can be expressed as the following equation: where W L and WU are the weighted values, and WGML and WGMU are the weighted gradient magnitude images.
During processing, we took the target pixel as the center to simultaneously open up statistical windows of m × m pixels on WGML and WGMU and assign the values of each pixel in the windows to the corresponding elements of the feature vector with the FOI. The statistical process is similar to voting, where all the feature vector elements are used as ballot boxes, and the values of WGML and WGMU are used as the vote value. The statistical process is shown in Figure 4. For a specific element x i of the feature vector v of the target pixel, it can be seen as the sum of two parts. The first part x L i is the sum of all pixel values with ILB equal to i in the WGML statistical window, and the second part x U i is the sum of all pixel values with IUB equal to i in the WGMU statistical window, which can be expressed by Equation (6). The feature descriptor is obtained after finishing the calculation for all predefined feature orientations of the feature vector. If we arrange the feature vector at each pixel along the z-axis direction perpendicular to the image plane, a 3D image cube can be formed, which is easy to be displayed.
where x i means a specific element of the feature vector, and x L i and x U i are calculated by statistical processing. To reduce the influence of orientation distortions caused by local deformations, we conducted convolution on the feature 3D cube with a kernel of d z = [1, 3,1] T in the z direction [43]. After that, we further normalized the feature vector using the L 2 norm with Equation (7).
In summary, the whole pipeline of AWOG is shown in Figure 5.

Phase Correlation Matching
Traditional similarity measures in the spatial domain such as NCC are very timeconsuming for high-dimensional feature vectors. Thus, we utilized the 3D phase correlation (PC) [73] based on the frequency to match the image features, improving the calculation efficiency. Since AWOG conducts normalization during construction, we no longer need to normalize the cross-power spectrum as the traditional PC does.
The core idea of PC is the Fourier shift property which converts the offset of image coordinates into a linear phase difference in the frequency domain [93]. Generally, PC first transfers the image into the frequency domain by the Fourier transform and calculates the cross-power spectrum. If A 1 (x, y, z) and A 2 (x, y, z) are the AWOG descriptors of the reference image and the input image, respectively, then Given that F(u, v, w) and G(u, v, w) are obtained by conducting 3D fast Fourier transform of A 1 (x, y, z) and A 2 (x, y, z), Equation (9) can be calculated according to the Fourier shift theorem.
where → d is the 3D unit vector. By conducting some deformations, the cross-power spectrum can be calculated as follows: where * indicates the complex conjugate. Through analyzing the inverse Fourier transform of the cross-power spectrum Q(u, v, w), a correlation function centered at (x 0 , y 0 ) can be obtained. Additionally, the peak value of the function always appears at (x 0 , y 0 ); thus, the corresponding point can be found by locating the maximum value, which can be expressed as follows: where F −1 (·) denotes the inverse transform of the 3D fast Fourier transform, and δ(x − x 0 , y − y 0 ) denotes the correlation function, which is centered at (x 0 , y 0 ). In order to testify the performance of PC, we conducted experiments on a pair of high-resolution optical satellite and SAR images that cover the same urban area and have a large nonlinear intensity difference and compared our results with those of NCC. As shown in Figure 6, we can see that the results of PC and NCC with the AWOG descriptor both only have one single peak, which means they all find the correct matches. However, the running time for PC is 0.01 s, while that of NCC is 3.2 s. Therefore, the use of PC as a similarity measure not only keeps the high accuracy of NCC but also largely improves the efficiency.

A General Orientation Framework of Optical Satellite Images Using SAR Images
In this section, we put forward a general framework for the automatic orientation of optical satellite images based on SAR orthophotos. Firstly, we perform overlap detection between the optical satellite image and SAR images to find the qualified images as the reference SAR image set. Then, the image pyramids are constructed for all the SAR and optical satellite images. The SAR-Harris feature point detector, combined with an image dividing strategy [94], is utilized on the pyramid images to retain enough high-quality features across the entire image. After that, the SAR image and DEM data are used for rough geometric correction to coarsely eliminate the rotation and scale differences, and AWOG is applied on the rectified images to obtain accurate corresponding points, which are taken as virtual control points (VCPs). Finally, RFM-based block adjustment [95] is carried out for the optical satellite image by using the VCPs, and the accurate orientation parameters are output. The specific steps are as follows: (1) The available SAR images of the target study area are collected, and the helpful reference SAR image set is built through overlap detection. We check the corresponding ground cover of SAR images and the optical satellite image, and only the SAR images with overlapping ground areas are retained as reference images. Specially, we read the metadata of the SAR reference image to obtain the coordinates of the four image corners and other affiliated information and project the four corners onto the ground. In this way, the longitude and latitude ranges of the ground covering of the SAR reference image are obtained. For the optical satellite image, we can directly obtain the coordinates of the ground covering by reading the metadata of the optical satellite image. If there are overlaps between the two ground coverings, we deem these two images overlapped; otherwise, there is no overlap between these two images.
(2) Image pyramids are constructed for the optical satellite and reference SAR images, and high-quality feature points are detected on the SAR images. If the resolution difference between the reference image and the optical satellite image is not more than double, the image pyramid layer and the zoom ratio between different levels are set as 4 and 2, respectively. Suppose the resolution difference between the images is large. In that case, a certain number of pyramid image layers are added to the higher-resolution image according to the multiple of the resolution difference, and a lookup table is established to match the images between the levels with a smaller resolution difference. Then, the SAR-Harris corner detector [45] is employed to detect feature points on each layer of the pyramid image of the reference SAR image. Additionally, a block strategy [94] is also applied to ensure the uniformity of the point distribution. All the subsequent steps start with the top layer pyramid image.
(3) There are always large geometric deformations between the optical satellite and SAR images. Therefore, we first perform image reshaping (as shown in Figure 7) to coarsely eliminate the rotation and scale difference and then use AWOG to match the images. First, the template window is determined on the reference SAR image with the obtained feature point as the center, and the ground area corresponding to the template window is acquired based on the approximate DEM of the study area, which can be provided by SRTM. To obtain the rough matching area, we reproject the ground area onto the optical satellite image with the RFM. Furthermore, an affine transformation model between the searching area of the optical satellite image and the template window of the reference image is established. The searching area of the optical satellite image is resampled using the bilinear interpolation method according to the established transformation model in order to eliminate the rotation and scale difference. At last, AWOG is applied to match the roughly geo-rectified image with the reference SAR image, and the RANSAC algorithm is utilized to eliminate unreliable matches. (4) The RFM-based affine transformation compensation model [95] is performed to improve the orientation accuracy of the original RPCs, and this process is completed using block adjustment with multiple reference SAR images. The affine transformation model has been widely applied to compensate the deviation of RPCs for high-resolution optical satellite images, which can be expressed as follows: where x 0 and y 0 are the column and row of tie points in the optical satellite image, and a i and b i (i = 0, 1, 2) are the unknown affine model parameters. In particular, a 0 and b 0 compensate the row direction and column direction errors on the image caused by the attitude and position inaccuracy in the satellite moving and scanning directions of the CCD linear sensor, respectively; a 2 and b 1 compensate the error caused by the drifts of the inertial navigation systems and GPS, respectively; and a 1 and b 2 compensate the error caused by an inaccurate image interior orientation. In order to obtain the corresponding initial positions of the image points in the SAR reference image on the optical satellite image, we first need to obtain the geographic coordinates (λ p , φ p ) of the image point in the SAR reference image, which can be calculated as follows: After obtaining the corresponding ground points of the feature points in the reference image, the back-projection location of these ground points on the optical satellite image can be calculated using the RPCs and DEM of the target area, which are provided by the SRTM in this study.
where x p oc and y p oc are the normalized image coordinates of the tie point p computed by the forward rational functions of the optical satellite image. λ p , φ p , h p are the normalized longitude, latitude, and altitude of the tie point p.
For each tie point p on the optical image, through combing Equations (12) and (14), the block adjustment can be described as Finally, the optimal estimations of affine transformation parameters a i and b i are obtained by iterative least squares adjustment.
(5) The processing status is checked frequently. If the bottom layer of the pyramid image has completed the calculation, the final affine correction transformation model together with the original RPCs is output as the accurate orientation result. Otherwise, we transmit the calculation results of the current layer of the pyramid to the next layer until the bottom layer. The flowchart of this orientation framework is shown in Figure 8.

Experiments and Analysis
In this section, we first verify the superiority of our proposed image matching method over state-of-the-art multimodal image matching methods and then prove the effectiveness and robustness of the proposed orientation framework of satellite optical images using SAR reference images.

Experimental Datasets
To testify the performance of AWOG, six optical-to-SAR image pairs covering the same scene were unitized. Considering the significant image property difference between LiDAR and optical images, we also used two LiDAR-to-optical image pairs to verify the robustness of our method. These image pairs were acquired at different imaging times and cover rich image contents. Figure 9 presents all the experimental image pairs. All the image pairs were pre-rectified to the same spatial resolution, and there are no large scale, rotation, and translation differences. However, there are still large geometric deformation and noticeable nonlinear intensity changes between the image pairs. The detailed information of the experimental images is shown in Table 1.

. Evaluation Criteria
Three commonly used criteria were employed to indicate the matching performance, which are the correct matching ratio (CMR), root mean square error (RMSE), and average running time (t). The correct matching ratio describes the matching success rate, which can be calculated by dividing the number of correct matches by all obtained matches. To determine the correct matches, we first manually selected 50 obvious points from both images as checkpoints and calculated the transformation between images. Then, the matches whose location errors are smaller than 1.5 pixels were deemed correct matches. RMSE represents the matching accuracy of correct matches. The smaller the value of RMSE, the higher the accuracy. t is the average running time on all experimental image pairs, which reflects the matching efficiency.

Comparison with the State-of-the-Art Multimodal Image Matching Methods
To verify the effectiveness and superiority of the proposed method, we compared our method with three state-of-the-art methods, namely, DLSS [80], HOPC [78], and CFOG [43]. In addition, MI [71] was also utilized as a baseline considering its good robustness against nonlinear intensity changes. AWOG and CFOG calculate the descriptor pixel by pixel; therefore, they can be matched by using PC or SSD. Meanwhile, HOPC and DLSS only calculate the descriptor in a sparse sampled grid, meaning they can only be matched with the similarity measures in the spatial domain such as SSD; PC does not work on them. To evaluate the descriptors comprehensively and fairly, we matched AWOG and CFOG with PC and used SSD as the similarity measure for all four descriptors. For convenience, we named the combinations of AWOG and PC, CFOG and PC, AWOG and SSD, CFOG and SSD, HOPC and SSD, and DLSS and SSD as AWOG PC , CFOG PC , AWOG SSD , CFOG SSD , HOPC SSD , and DLSS SSD , respectively.
Before matching, 200 SAR-Harris feature points were firstly detected on the reference image, and then the points were described and matched with different approaches. For MI, DLSS SSD , HOPC SSD , CFOG SSD , and AWOG SSD , a template window image with sizes from 25 × 25 to 91 × 91 pixels was opened on the reference image, and the correspondences were found through moving the template window in searching areas from 46 × 46 to 111 × 111 pixels. For AWOG PC and CFOG PC , they were directly matched with the template window image without sliding. Figure 10 presents the results of all comparative methods with respect to the CMR. Overall, we can see that the proposed method AWOG PC performs the best, while MI obtains the worst results, and the reason could be that MI does not consider the adjacent information during matching, which may fall into a local optimum.
The results of DLSS SSD , HOPC SSD , CFOG SSD , and AWOG SSD provide a fair comparison of the feature descriptors employed directly and fairly, considering that they all utilize the same similarity measure, SSD. DLSS SSD is easily affected by the properties of the image content, and it performs worse on image pair 6 with mountain coverage and image pair 8 that contains a textureless LiDAR depth map. HOPC SSD and CFOG SSD both obtain a better performance than DLSS SSD , showing good resistance against nonlinear intensity changes. AWOG SSD obtains the highest CMR among all comparative methods on all datasets, proving the superiority and robustness of the proposed descriptor, AWOG.
Moreover, CFOG PC and AWOG PC perform better than CFOG SSD and AWOG SSD , respectively, which proves the effectiveness of the similarity measure PC when applied to multimodal image matching. Compared with CFOG PC , AWOG PC achieves better results than CFOG PC , benefiting from the stronger gradient description ability. In particular, AWOG PC is less sensitive to the size of the template window image. For example, the CMR of AWOG PC is 17.1% higher than that of CFOG PC on average under the template window with a size of 31 × 31 pixels.   Table 2. Again, MI shows the worst result. AWOG SSD obtains the smallest RMSE compared to all the other methods that also utilize SSD. CFOG PC and AWOG PC perform better than CFOG SSD and AWOG SSD , respectively. Furthermore, the RMSE of AWOG PC is smaller than CFOG PC , displaying the highest accuracy among all comparative approaches. In particular, the average value of RMSE on all datasets of AWOG PC is 0.135, 0.146, 0.192, 0.214, 0.269, and 0.414 pixels better than CFOG PC , AWOG SSD , CFPG SSD , HOPC SSD , DLSS SSD , and MI on average, respectively.   Figure 12 presents the average running time of all methods under different sizes of template window images on all datasets, and the corresponding values are presented in Table 3. For the methods using SSD, the computational efficiency of HOPC SSD and CFOG SSD is at the same level. At the same time, DLSS SSD achieves the highest calculation efficiency, and AWOG SSD has the lowest calculation efficiency due to the relatively complex computation of a complete descriptor for all pixels. Generally, the methods using PC are faster than the corresponding methods under the same window size. In detail, CFOG PC and AWOG PC are faster than CFOG SSD and AWOG SSD , respectively. Moreover, the running time of CFOG PC and AWOG PC increases much slower than that of CFOG SSD and AWOG SSD with the increase in the size of the matching image window. In detail, when the size of the template window is 25 × 25 pixels, the running times of CFOG PC , AWOG PC , CFOG SSD , and AWOG SSD are 0.510 s, 1.279 s, 1.428 s, and 3.019 s, respectively; the running times become 0.888 s, 1.652 s, 10.307 s, and 15.866 s when the size of the template window is 91 × 91 pixels. We can see that AWOG PC is slightly slower than CFOG PC , but much faster than CFOG SSD and AWOG SSD , showing great advantages in practical application. Overall, AWOG PC shows a good balance between accuracy and efficiency, achieving the best accuracy and a relatively good efficiency among all competing methods. Therefore, AWOG PC can be well applied to the matching work of multimodal images. Figure 13 provides the visual matching results of AWOG PC on all experimental image pairs. From the result, we can see that a large number of evenly distributed matches with high accuracy are obtained, proving the effectiveness of the proposed method.

Study Area and Datasets
To evaluate the performance and effectiveness of the automatic orientation of optical satellite images using SAR reference images, several experiments were conducted on the location of Guangzhou city and its surrounding areas in China. As shown in Figure 14, the length of the study area is about 170 km from north to south, and the width of the area is about 140 km from east to west. The western and southwestern parts of the study area are relatively flat, the southern part is the estuary of the Zhujiang River, and the rest is mountainous areas where the elevation of the central mountainous area is around 200~500 m. The largest elevation difference in the northeastern mountainous area is close to 1200 m. Figure 14. The study area (framed with the yellow rectangle) and the areas (marked by the red polygon) covered by SAR reference images.
The TerraSAR-X orthoimages were taken as the reference images, whose spatial resolution is 5 m. Their plane positioning accuracy is better than 5 m which has been testified by field-measured checkpoints. The optical satellite images to be oriented are the panchromatic images from GF-1, GF-2, and ZY-3 satellites, which have a location difference of 99.6~260.4 m from the reference images measured manually. The amount and detailed information of the experimental images are provided in Table 4. Moreover, we used SRTM-30 to provide the elevation information for the object points. Note: the size of the images acquired by the same sensor may be slightly different, and therefore we use the average size in the table.
The distribution of the reference SAR images and optical satellite images is displayed in Figure 15. We can see that the coverage of the SAR reference images mainly locates in the north-central region of the study area, and there is a specific range of vacancy in the southeast region. The four GF-1 satellite images cover all available reference images in the study area, of which a small part of the GF-1-1 and GF-1-3 images and about one third of the GF-1-2 and GF-1-4 images have no reference information. All four GF-2 images fall into the coverage of the reference images. The three ZY-3 images locate in the midwest of the coverage of the reference images, of which about one third of the ZY-3-1 images and a small part of the ZY-3-2 and ZY-3-3 images do not have reference information.

Geometric Orientation Performance
Before starting the orientation process, there are several parameters required to be set. Based on the analysis of AWOG PC in Section 5.1, we set the values of m, n, and the size of the template window image as 3, 8, and 61 × 61 pixels. Additionally, for the parameters used in the orientation process, we set the size of the template window image on the top layer of the image pyramid as 121 × 121 pixels, and the number of iterations and the gross error threshold of RANSAC as 2000 and 3 pixels, respectively. Figure 16 displays the obtained matches between the optical satellite image GF-1-3 and all available reference SAR images of it. Figure 16a-g show the matching points at each SAR image, and Figure 16h shows the gathering of all corresponding matching points on GF-1-3, where the color of a point indicates the images to which it is matched. From Figure 16, we can see that the matching points are not only large in number but also evenly distributed throughout the image, enabling them to be good virtual control points (VCPs).   Table 5 presents the detailed quantitative results. From the index of N O , we can see that each optical satellite image has a large number of VCPs. There are still 9046 VCPs obtained for the GF-1-3 image, whose coverage is mainly mountains. Additionally, after orientation, the differences in the coordinates of the corresponding locations between the SAR reference images and optical satellite images are between 0.81 and 1.68 pixels, and from 4.1 to 8.4 m when transferring to the object space. Considering the high geopositioning accuracy of SAR images, the geometric orientation of the optical satellite images is significantly improved. Moreover, the proposed method has high efficiency. The processing time of the GF-2-1 images is only 33.15 s. Generally, the running time increases with the number of reference images. For example, the processing time of GF-1-1 is less than that of GF-1-4, even though more matches are found for GF-1-1 considering the reference images of GF-1-1 total six, while those of GF-1-4 total seven. In particular, the running time becomes 98.18 s when there are seven reference images, as in the experiment of GF-1-3, which is still relatively short.  Figure 17 displays three registration checkerboards of the optical satellite image and reference SAR image. We can see that the location difference between the optical satellite image and SAR image is large before orientation, which reaches more than 200 m for the GF-2 images, while there is barely a positioning difference between the two images after orientation.   The calculation of a dense high-dimensional descriptor for each pixel is very timeconsuming. Therefore, we introduced the FOI to accelerate the calculation process. Figure 18 presents the running time used for calculating the descriptor with or without the FOI on the datasets used in Section 5.1.1. We can see that the efficiency improves by about 30% for all image pairs. The reasons are mainly attributed to two aspects. Firstly, the weights and weighted values used for generating the elements of the feature descriptor on each orientation can be calculated together in the form of a matrix rather than pixel by pixel. Secondly, when integrating the adjacent information, the weighted values for the pixels falling in the statistical window can be assigned to the corresponding gradient orientation directly with the help of the FOI. In this way, we do not need to count the contribution and add the contributions from each pixel as the traditional method does. Thus, the use of the FOI dramatically improves the efficiency, increasing the practicability of the proposed descriptor.

Parameter Study
The number of subranges n divided by the whole gradient direction range, and the size m of the window image used for computing the descriptor may significantly affect the matching accuracy and efficiency. A proper setting of the value of m and n is crucial to fully take advantage of our method's strength. Theoretically, a small value of n will reduce the dimensionality of the descriptor, which may reduce the image matching accuracy. In contrast, a large value of n will increase the dimensionality of the descriptor and improve the distinguishability but may suffer from low efficiency. For the parameter m, generally, the larger the m, the worse the feature vector distinguishability, and the larger the amount of calculation. To find the optimal value of m and n, we conducted lots of experiments and present the experimental results of image pairs 1 and 2 in Tables 6 and 7. From the results of Table 6, we can see that with the increase in the value of n, the CMR increases, RMSE reduces gradually, and t increases gradually. When the value of n reaches 8, the values of RMSE and CMR are 0.606 pixels and 96.5%, respectively. After that, the performances of RMSE and CMR improve slightly, but the running time increases a lot. Therefore, we set the value of n as 8 in this paper. From the results of Table 7, we can see that the performances of RMSE and CMR both decrease with the increase in m. The reason is that the larger the value of m, the more adjacent information applied, which reduces the distinctiveness of the feature descriptor and introduces match confusion. Based on the analysis, we set the value of m as 3.

Conclusions
This paper proposed an accurate geometric orientation method of optical satellite images using SAR images as reference data. Firstly, we developed a multimodal imageapplicable dense descriptor AWOG through proposing an angle-weighted oriented gra-dient calculation method and introducing a feature orientation index table. To speed up the matching process of the calculated feature descriptors, the similarity measure 3D phase correlation was employed. Experiments on multiple multimodal image pairs proved that the proposed method is superior to the competing state-of-the-art multimodal image matching methods in terms of accuracy and efficiency, which can be well applied to fulfill the matching of optical satellite images and SAR images. In particular, the correct matching ratio increases by about 17%, and the matching accuracy improves by more than 0.1 pixels compared with the latest multimodal image matching method, CFOG. In terms of efficiency, the running time of the proposed method is only about 20% of HOPC, DLSS, and MI.
In addition, we proposed a general framework for the precise orientation of optical satellite images based on multiple SAR reference images. Significantly, the matches obtained from the optical satellite images and SAR images were taken as virtual control points to optimize the initial RPC parameters, improving the geometric positioning accuracy. Taking 12 TerraSAR-X images as the reference data, 4 GF-1 images, 4 GF-2 images, and 3 ZY-3 images were oriented with the proposed framework. The experimental results reveal the superior effectiveness and robustness of our framework. The VCPs obtained from the SAR reference images were sufficient and evenly distributed. For the GF-1 image with a large image size, more than 9000 VCPs were obtained; for the GF-2 image with a relatively small size, there were still 897 well-distributed VCPs obtained. As a result, the positioning accuracy of all used optical satellite images improved, from more than 200 to within 10 m. In addition, the proposed method has very high computation efficiency. The shortest time used for processing an image is about 33 s, and the longest time is still less than 100 s, which renders it well applicable to practical applications.
Since AWOG is not invariant to rotation and scale change, we eliminated the rotation and scale differences between SAR and optical satellite images before matching with the help of DEM and RPC data. In detail, a corresponding searching area of the matching window on the SAR reference image was determined on the optical satellite image to provide the initial location for template matching. However, when the DEM or RPC parameters of the target area cannot be obtained, our method is not applicable. In addition, our method relies on the image structural features. If the structural features in an image are not rich enough, such as the image containing a large forest area, the number of effective GCPs will reduce, and the distribution will worsen.
In the future, we would like to enable our method to be invariant to scale and rotation change. To solve this problem, we need to first consider fulfilling coarse image matching and obtaining the initial image transformation matrix by using image features and further refining the previous result with the method proposed in this study. For images with weak structural features, an image feature based on the image textures can be introduced to help with image matching. Additionally, we are willing to keep improving the matching performance of SAR and optical satellite images and conduct joint bundle adjustment of SAR images and optical satellite images to further enhance the applicability of using SAR images to automatically aid the orientation of optical satellite images. Moreover, deep learning is a very fast-growing technique. It is promising to apply it to real multimodal image processing applications in the short term. Data Availability Statement: Not applicable.