A CNN-Based High-Accuracy Registration for Remote Sensing Images

In this paper, a convolutional neural network-based registration framework is proposed for remote sensing to improve the registration accuracy between two remote-sensed images acquired from different times and viewpoints. The proposed framework consists of four stages. In the first stage, key-points are extracted from two input images—a reference and a sensed image. Then, a patch is constructed at each key-point. The second stage consists of three processes for patch matching—candidate patch pair list generation, one-to-one matched label selection, and geometric distortion compensation. One-to-one matched patch pairs between two images are found, and the exact matching is found by compensating for geometric distortions in the matched patch pairs. A global geometric affine parameter set is computed using the random sample consensus algorithm (RANSAC) algorithm in the third stage. Finally, a registered image is generated after warping the input sensed image using the affine parameter set. The proposed high-accuracy registration framework is evaluated using the KOMPSAT-3 dataset by comparing the conventional frameworks based on machine learning and deep-learning-based frameworks. The proposed framework obtains the least root mean square error value of 34.922 based on all control points and achieves a 68.4% increase in the matching accuracy compared with the conventional registration framework.


Introduction
Image registration is the process of geometric synchronization between a reference image and a current image from the same area. These images are acquired from different times and viewpoints by different sensors [1]. Thus, image registration is an essential preprocess step in many remote sensing applications because the main process, which includes change detection, image fusion, image mosaic, environment monitoring, and map updating can be drastically influenced by these differences [1,2]. Many types of image registration techniques have been developed in the areas of remote sensing over the past few decades. The registration frameworks can be classified into two categories-area-based frameworks and feature-based frameworks [1].
We introduce the two conventional image registration frameworks for the two categoriesarea-based frameworks and feature-based frameworks. In area-based frameworks, the registration problem is transformed into an optimization problem, where the similarity between reference and sensed images is maximized. Conventional area-based registration frameworks find correspondences at multiple key-points between input and reference images using similarity measures such as mutual information (MI) [3,4] or normalized cross-correlation (NCC) [5]. The detected correspondences are used in the estimation of the global geometric transform. However, they are sensitive to illumination changes and noise [1]. Liang et al. proposed spatial and mutual information (SMI) as the similarity metric for searching similar local regions using ant colony optimization [3]. Patel and Thakar employed mutual information (MI) based on

High-Accuracy Registration Framework
The proposed framework consists of two different CNNs-MatchNet [21] and GMatchNet [24], as shown in Figure 2. First, multiple key-points and their patches were extracted from the reference image and the sensed image. Note that each patch (64 × 64 pixels) includes one key-point at its center. The next stage consists of three distinct processes for patch matching. For each reference patch of the reference image, multiple candidate lists were selected from the sensed image by MatchNet. Then, one-to-one matched labels for each reference patch were determined for its matched candidate lists based on cross correlation. The local affine parameter set was estimated between each input patch from the output of matched labels selection by GMatchNet, and the coordinate of the matched patch was finely adjusted using a local transformation. Then, the global geometric affine parameter set was computed from all the adjusted reference coordinates using the RANSAC algorithm. Finally, the warping process was performed to geometrically synchronize the reference image and sensed image.

Patch Extraction Based on Scale-Space Extrema
In the first stage of key-point detection, the locations and scales that can be repeatably assigned for different views of a same object were identified. Locations that are invariant In this paper, we propose a CNN-based registration framework for remote sensing that can improve the registration accuracy between two remote-sensed images acquired from different times and viewpoints. The framework can be summarized as follows: First, multiple key-points and their patches are extracted from two input images using scale-space extrema detection. Each patch contains one key-point at its center. Using the conventional network, finding the corresponding patch pair in the matching step would yield geometric distortions, such as translation, scale, and shearing because learning the invariance mapping function is difficult. For an accurate local patch matching process, we adopt the geometric CNN proposed in [24] to compensate the geometric distortion of each matched patch pair. From now on, the geometric CNN is called GMatchNet. A local geometric transformation is estimated from each matched patch pair. Using this local geometric transform, the corresponding center coordinate of each input patch is finely adjusted. Then, we compute the global geometric affine parameter set from all the adjusted coordinates by the random sample consensus algorithm (RANSAC). Finally, a registered image is generated after warping the input sensed image by the global affine parameter set. The proposed framework is evaluated on the KOMPSAT-3 dataset by comparing the conventional frameworks based on machine learning and deep-learningbased frameworks. We perform registration of images in which magnetic north is aligned with the universal transverse Mercator coordinate system. It is shown that the proposed high-accuracy registration framework can improve the accuracy of image registration by compensating the geometric distortion between matched patch pairs and can be applied to other registration frameworks based on patches.
The remainder of this paper is structured as follows: Section 2 introduces related work on image registration, deep learning, and patch matching. Section 3 details the proposed registration framework that uses the estimated geometric transformation in the corresponding patch pairs. Section 4 discusses the experimental results, and, finally, Section 4 summarizes the conclusions of the study.

High-Accuracy Registration Framework
The proposed framework consists of two different CNNs-MatchNet [21] and GMatch-Net [24], as shown in Figure 2. First, multiple key-points and their patches were extracted from the reference image and the sensed image. Note that each patch (64 × 64 pixels) includes one key-point at its center. The next stage consists of three distinct processes for patch matching. For each reference patch of the reference image, multiple candidate lists were selected from the sensed image by MatchNet. Then, one-to-one matched labels for each reference patch were determined for its matched candidate lists based on cross correlation. The local affine parameter set was estimated between each input patch from the output of matched labels selection by GMatchNet, and the coordinate of the matched patch was finely adjusted using a local transformation. Then, the global geometric affine parameter set was computed from all the adjusted reference coordinates using the RANSAC algorithm. Finally, the warping process was performed to geometrically synchronize the reference image and sensed image.

High-Accuracy Registration Framework
The proposed framework consists of two different CNNs-MatchNet [21] and GMatchNet [24], as shown in Figure 2. First, multiple key-points and their patches were extracted from the reference image and the sensed image. Note that each patch (64 × 64 pixels) includes one key-point at its center. The next stage consists of three distinct processes for patch matching. For each reference patch of the reference image, multiple candidate lists were selected from the sensed image by MatchNet. Then, one-to-one matched labels for each reference patch were determined for its matched candidate lists based on cross correlation. The local affine parameter set was estimated between each input patch from the output of matched labels selection by GMatchNet, and the coordinate of the matched patch was finely adjusted using a local transformation. Then, the global geometric affine parameter set was computed from all the adjusted reference coordinates using the RANSAC algorithm. Finally, the warping process was performed to geometrically synchronize the reference image and sensed image.

Patch Extraction Based on Scale-Space Extrema
In the first stage of key-point detection, the locations and scales that can be repeatably assigned for different views of a same object were identified. Locations that are invariant to a change in the scale of the image can be detected by searching for stable features across all possible scales using a continuous function of scale known as the scale-space. Subsequently, the Laplacian of Gaussian (LoG) for the image with various standard deviation ( ) values was determined. The LoG operates as a blob detector that detects blobs in various sizes due to changes in . However, the LoG requires, to some extent, a heavier computational load. Therefore, the proposed framework adopts the difference of Gaussians (DoG), which approximates the LoG. The DoG is the difference between the Gaussian blurring of an image with two different standard deviations, denoted by and . When

Patch Extraction Based on Scale-Space Extrema
In the first stage of key-point detection, the locations and scales that can be repeatably assigned for different views of a same object were identified. Locations that are invariant to a change in the scale of the image can be detected by searching for stable features across all possible scales using a continuous function of scale known as the scale-space. Subsequently, the Laplacian of Gaussian (LoG) for the image with various standard deviation (σ) values was determined. The LoG operates as a blob detector that detects blobs in various sizes due to changes in σ. However, the LoG requires, to some extent, a heavier computational load. Therefore, the proposed framework adopts the difference of Gaussians (DoG), which approximates the LoG. The DoG is the difference between the Gaussian blurring of an image with two different standard deviations, denoted by σ and kσ. When this DoG is generated, the local extrema are retrieved from the image, which results in the key-points. Lowe proposed several empirical parameter set, the number of octaves set to 4, number of scale levels set to 5, initial σ set to 1.6, and k set to √ 2 [6]. In the second step, the detected key-points as the central point were used to extract the image patches with a size of 64 × 64 pixels. Here, we assumed that the reference images and the sensed images are I 1 and I 2 , respectively. If I 1 has m key-points, then the patches are P 1 = p 1 1 , p 2 1 , · · · , p m 1 . If I 2 has n key-points, the patches are P 2 = p 1 2 , p 2 2 , · · · , p n 2 . Thus, we can acquire the image patch pairs p i 1 , p j 2 by combining the patches in images I 1 and I 2 , where i = 1, 2, · · · , m, j = 1, 2, · · · , n.

Training Method for Matched Candidate List Generation
MatchNet is a deep network architecture that determines the correspondence of two images by analyzing the similarity of features in two input images. The structure of MatchNet is illustrated in Figure 3, and its layer parameters are listed in Table 1. To compare the similarity of two patches, they are first passed through the same feature    The performance of MatchNet strongly depends on sufficient training dataset for optimizing parameters. However, it is difficult to obtain a labeled remote sensing image dataset. Thus, we adopt augmentation to construct a training dataset. The augmented dataset consists of remote sensing images transformed by a set of rotation matrices, where Let P i and M be the i-th image patch and the number of image patches, respectively. Then, P i can be transformed to an image set of R θ (P i ). The patch size of MatchNet is 64 × 64. The matched patch pairs are P i , R θ P j , i = j and θ = 0 • and unmatched patch pairs P i , R θ P j , i = j and θ = 0 • and P i , R θ P j , i = j , where i and j = 1, 2, · · · M. Therefore, the structure of a training sample is P i , R θ P j , y θ ij . Figure 4 illustrates examples of training patch pairs. The feature and metric networks were jointly trained in a supervised setting using the Siamese structure. The training dataset was constructed with a matched patch pairs to unmatched patch pairs ratio of 1:1 using the sampling method [21]. The cross-entropy error was minimized over a training set of n patch pairs using the SGD with momentum. The cross-entropy was defined by

Matched Label Selection
In the matched label selection of the proposed framework, all image patch pairs , from the sensed image and the reference image were fed to the trained CNN to predict multiple candidate lists. These lists were generated through patches with matched label sets. Owing to the remote sensing imaging mechanism and the small patch size, MatchNet is capable of finding more than one similar image patches between and . This one-to-many matching leads to an ill-posed problem, which can be a major reason for the appearance of an inaccurate geometric affine parameter set. We adopted a local constraint using NCC to select one matching pair among the patches from the multiple candidate lists.
The NCC measures the similarity of two patches based on pixel intensity as the local constraint. In this study, we only selected the matched patch pair with the maximum NCC. The NCC of a patch pair , was computed as follows: where ( , ) and ( , ) are the gray values of image patches and at location ( , ), respectively. Further, ̅ and ̅ are the average gray values of image patches and , respectively. One patch with the highest NCC value among the patches from multiple candidate lists was selected as the matched label.

Matched Patch Compensation with Local Geometric Transformation
As learning the invariance mapping function is difficult, geometric distortions, such as translation, scale, and shearing, appear between the matched patch pairs. It is necessary to correct the geometric distortion in the matched patch pairs. Figure 5a,b illustrate the matched patch pairs; however, two patches exhibit geometric distortions. To compensate for the geometric distortion, we adopted a pre-trained GMatchNet, which has been proposed for determining correspondences between two images in agreement with a geomet-

Matched Label Selection
In the matched label selection of the proposed framework, all image patch pairs from the sensed image I 1 and the reference image I 2 were fed to the trained CNN to predict multiple candidate lists. These lists were generated through patches with matched label sets. Owing to the remote sensing imaging mechanism and the small patch size, MatchNet is capable of finding more than one similar image patches between I 1 and I 2 . This one-to-many matching leads to an ill-posed problem, which can be a major reason for the appearance of an inaccurate geometric affine parameter set. We adopted a local constraint using NCC to select one matching pair among the patches from the multiple candidate lists.
The NCC measures the similarity of two patches based on pixel intensity as the local constraint. In this study, we only selected the matched patch pair with the maximum NCC.
The NCC of a patch pair p i 1 , p j 2 was computed as follows: where p i 1 (x, y) and p

Matched Patch Compensation with Local Geometric Transformation
As learning the invariance mapping function is difficult, geometric distortions, such as translation, scale, and shearing, appear between the matched patch pairs. It is necessary to correct the geometric distortion in the matched patch pairs. Figure 5a,b illustrate the matched patch pairs; however, two patches exhibit geometric distortions. To compensate for the geometric distortion, we adopted a pre-trained GMatchNet, which has been proposed for determining correspondences between two images in agreement with a geometric model, such as the geometric affine parameter set. Figure 6 shows a diagram of the GMatchNet architecture. The process of GMatchNet proceeds in four steps. First, input images P 1 and P 2 are passed through the Siamese architecture consisting of the convolutional layers, thus extracting feature maps F 1 and F 2 . Second, feature maps across images are matched to a tentative correspondence map F 12 . Third, a regression CNN that directly outputs the geometric affine parameter setθ is constructed. Finally, the network generates a new transformed image, P 2 , by applying the transform Tθ to image P 2 . We calculated the central coordinates of the newly generated image P 2 t and used it to adjust the key-point position. In the case of GMatchNet, pre-trained weights were publicly available and could be used without any fine tuning since we could achieve the satisfied performance when those pretrained weights were applied to our framework. images and are passed through the Siamese architecture consisting of the convolutional layers, thus extracting feature maps and . Second, feature maps across images are matched to a tentative correspondence map . Third, a regression CNN that directly outputs the geometric affine parameter set is constructed. Finally, the network generates a new transformed image, , by applying the transform to image . We calculated the central coordinates of the newly generated image and used it to adjust the key-point position. In the case of GMatchNet, pre-trained weights were publicly available and could be used without any fine tuning since we could achieve the satisfied performance when those pretrained weights were applied to our framework.

Global Constraints and Warping
The RANSAC algorithm estimates a model from a set of observed data through a random sampling and voting scheme often interpreted as an outlier detection method, which can further remove the falsely matched points globally associated with local constraints. By using the compensated matching labels from the previous step in the RAN-SAC algorithm, we calculated the global geometric affine parameter set with the RAN-SAC algorithm. Finally, we warped the sensed image using the global geometric affine parameter set , generating the registered image . images and are passed through the Siamese architecture consisting of the convolutional layers, thus extracting feature maps and . Second, feature maps across images are matched to a tentative correspondence map . Third, a regression CNN that directly outputs the geometric affine parameter set is constructed. Finally, the network generates a new transformed image, , by applying the transform to image . We calculated the central coordinates of the newly generated image and used it to adjust the key-point position. In the case of GMatchNet, pre-trained weights were publicly available and could be used without any fine tuning since we could achieve the satisfied performance when those pretrained weights were applied to our framework.

Global Constraints and Warping
The RANSAC algorithm estimates a model from a set of observed data through a random sampling and voting scheme often interpreted as an outlier detection method, which can further remove the falsely matched points globally associated with local constraints. By using the compensated matching labels from the previous step in the RAN-SAC algorithm, we calculated the global geometric affine parameter set with the RAN-SAC algorithm. Finally, we warped the sensed image using the global geometric affine

Global Constraints and Warping
The RANSAC algorithm estimates a model from a set of observed data through a random sampling and voting scheme often interpreted as an outlier detection method, which can further remove the falsely matched points globally associated with local constraints. By using the compensated matching labels from the previous step in the RANSAC algorithm, we calculated the global geometric affine parameter set W with the RANSAC algorithm. Finally, we warped the sensed image using the global geometric affine parameter set W, generating the registered image I 2 .

Results
In this study, we constructed datasets for both patch matching and registration using multispectral red, green, and blue images of cities around Seoul, South Korea, captured by the KOMPSAT-3 satellite with a resolution of 2.8 meter. Regions in Seoul are densely populated and their landscape is frequently changed by the emergence of new skyscrapers. On the other hand, areas around Seoul are agricultural areas with different colors depending on the seasonal conditions. The experiment was performed on a computer powered by an Intel (R) Core i7-8700K 3.40 GHz CPU with an NVIDIA GeForce GTX 1080 Ti GPU. In the following sections, we discuss the training and validation methods for patch matching via MatchNet and the evaluation metrics, and evaluate the performance of each remote sensing image registration framework.
We also explain the details of the dataset used for MatchNet. The training sets and validation sets for patch matching consisted of images from Suwon City. This dataset came with patches extracted using the scale-space extrema detection for extracting the key-points [6]. The size of the image patch used was 64 × 64 pixels. The resulting dataset was divided into 130k for training sets and 50k for validation sets. We used a sampler to generate an equal number of matched and unmatched patch pairs in each batch so that the network would not be overly biased toward the unmatched decision [25].

Evaluation Datasets and Metrics for Remote Sensing Image Registration Frameworks
The datasets for evaluation of remote sensing image registration consisted of images from Seoul and its surroundings from different times-three areas in the city and one area around it. Table 2 lists the detailed information of those images. In the same area, the upper row represents the reference image and the lower row represents the sensed image. All satellite images were divided into 500 × 500 images. Each pair of images consisted of images from the same area captured at different times. The characteristics for each area are as follows: Area 1 dataset consists of images of residential areas, Area 2 dataset consists of images of residential and green lung areas, Area 3 dataset consists of images of industrial facilities, and Area 4 dataset consists of images of skyscrapers. The metrics from [26] were employed in this study to objectively evaluate the proposed high-accuracy registration framework, which are as follows: the number of control points (N red ); the root-mean-square error (RMSE) based on all control points and normalized to Remote Sens. 2021, 13, 1482 9 of 14 the pixel size (RMS all ); the RMSE computed by the control point residuals based on the leave-one-out method (RMS loo ); the statistical evaluation of the residual distribution across quadrants P quard ; the bad point proportion with a norm greater than 1.0 (BPP(1.0)); the statistical evaluation of the presence of a preference axis on the residual scatter plot (S kew ); the statistical evaluation of the goodness of control points distribution across the image (S cat ); and the weighted sum of the above seven measures, the cost function (φ). Smaller values indicate better performance for six metrics except N red . The cost function was used as an objective tool to evaluate the different control points for the pair of images. The equation of the cost function φ is expressed as follows: The registration accuracy was measured in terms of RMS all and RMS loo . The quantity and quality of matching points were measured in terms of N red and φ, respectively. The lower the values of these metrics, the better N red . We can observe that both RMS all and RMS loo equal or tend to the subpixel error, which are significant results of registration. N red measures the number of points that have been matched correctly. Further, a larger N red and a smaller RMS imply a higher accuracy of point matching.

Evaluation of Remote Sensing Image Registration Framework
The proposed frameworks were compared with the conventional feature-based image registration framework, SIFT, and the state-of-the-art deep learning-based image registration framework. We used the DBN network structure proposed by Wang et al. [19]. The deep learning-based frameworks were experimented with two trained methods-the conventional method and proposed training method. We defined the improved accuracy, I A φ , of φ as follows: where SIFT φ and DNN φ are the φ values of the SIFT-based framework and each DNNbased framework, respectively. N red measures the number of correct corresponding points. A larger N red and a smaller RMS all imply more accurate point matching. Table 3 summarize the experimental results using eight metrics on four evaluation datasets. The last line in Table 3 illustrates the averaged results using eight metrics for the evaluation datasets in all areas. In the DBNbased framework of Wang et al. [19], although the number of control points (N red ) was large, it had mismatched points and therefore an increased RMS all value. For qualitative assessment, we used the checkerboard mosaic image, which can demonstrate the subjective quality better than any other image in terms of edge continuity and region overlapping.
In Table 3, on the one hand, the DBN-based framework generated a large N red , but the RMS values increased due to the RMSE for all control points. On the other hand, the proposed framework for the Area 1 dataset had a smaller N red , but the lowest RMS value representing the pixel error. In addition, the smallest N red of 0.855 was obtained for the quality of matching points φ. The performance of the proposed framework was 40.2% better than that of the SIFT-based framework in Area 1. Figure 7a,b illustrate the pair of images from Area 1, which were acquired by the KOMPSAT-3 satellite in March 2014 and October 2015. The green boxes indicate the same region in the three images and show smooth edges.   In Area 2, on the one hand, the DBN-based framework increased the RMS values representing the quality of the matching point because the points did not match. Thus, the performance reduced by 79.54%. On the other hand, the proposed framework on the Area 2 dataset had a relatively large N red and the lowest RMS value representing the pixel error. In addition, the smallest result of 1.478 was obtained from the quality of matching points φ. The performance of the proposed framework was 68.98% better than that of the SIFT-based framework in Area 2. Figure 8a    In Area 3, on the one hand, using the DBN-based framework increased the RMS values because the points did not match. Thus, the performance reduced by 315.32%. On the other hand, using the proposed framework on the Area 3 dataset produced the largest N red and the least RMS value representing the pixel error. In addition, the smallest result of 3.093 was obtained from the quality of matching points φ. The proposed framework performed 85.88% better the SIFT-based framework in Area 3. The greatest performance improvement was observed in the industrial facility areas. Figure 9a  In Area 4, on the one hand, the DBN-based framework generated a large N red along with an increased RMS. On the other hand, the proposed framework produced the secondlargest N red but the lowest RMS representing the pixel error. In addition, the smallest result of 29.187 was obtained from the quality of matching points φ. The proposed framework performed 78.63% better than the SIFT-based framework in Area 4. The largest performance improvement occurred in the industrial facility areas. Figure 10a,b illustrate the pair of images from Area 4 acquired in December 2014 and October 2015. The changes observed in Figure 10 are large owing to the difference between skyscrapers and viewports. The SIFT-based framework failed to register the image. By contrast, the two proposed models successfully registered the images. Figure 10c,d are the image registration results of the SIFT-based framework and the DBN-based framework, respectively. Both frameworks failed to register the images.

Conclusions
In this study, we proposed a CNN-based registration framework for remote sensing that can improve the image registration accuracy between two remote-sensed images acquired from different times and viewpoints. The matching step often produces geometric In the KOMPSAT-3 image datasets, the DBN-based framework generated the largest N red along with a larger RMS value representing the matching point quality because the points did not match. The DBN-based framework reduced the RMS all value representing the registration accuracy by 165.786, but the RMS all value of the proposed framework significantly reduced to 34.922. The DBN-based framework reduced the φ value representing the matching points quality by 41.904, but that of the proposed framework significantly reduced to 8.653. The proposed framework achieved a performance improvement of 68.4%. The remarkable improvement in the performance of the proposed framework can be observed in the difference between the high-rise building and the image as the viewpoint shifts.

Conclusions
In this study, we proposed a CNN-based registration framework for remote sensing that can improve the image registration accuracy between two remote-sensed images acquired from different times and viewpoints. The matching step often produces geometric distortions, such as translation, scale, and shearing between the matched patch pairs given that the invariance mapping function is difficult to learn. To correct these distortions, we adopted a geometric CNN with a stronger invariance feature to find a local affine parameter set for each matched patch pairs. Therefore, we constructed multiple candidate lists, from which we estimated the local geometric transform. The proposed framework was evaluated on the KOMPSAT-3 dataset by comparing the conventional machine-learningbased frameworks and the proposed deep-learning-based framework. The proposed framework obtained the smallest RMSE of 34.922 based on all control points and achieved a 68.4% increase in the matching accuracy compared with the conventional registration framework. As the proposed framework is composed of two different networks, there is a computational complexity owing to the redundancy of the two feature networks. A unified network to alleviate the computational complexity can be the future direction of this research.