A Two-Stage Deep Learning Registration Method for Remote Sensing Images Based on Sub-Image Matching

: The registration of multi-temporal remote sensing images with abundant information and complex changes is an important preprocessing step for subsequent applications. This paper presents a novel two-stage deep learning registration method based on sub-image matching. Unlike the conventional registration framework, the proposed network learns the mapping between matched sub-images and the geometric transformation parameters directly. In the ﬁrst stage, the matching of sub-images (MSI), sub-images cropped from the images are matched through the corresponding heatmaps, which are made of the predicted similarity of each sub-image pairs. The second stage, the estimation of transformation parameters (ETP), a network with weight structure and position embedding estimates the global transformation parameters from the matched pairs. The network can deal with an uncertain number of matched sub-image inputs and reduce the impact of outliers. Furthermore, the sample sharing training strategy and the augmentation based on the bounding rectangle are introduced. We evaluated our method by comparing the conventional and deep learning methods qualitatively and quantitatively on Google Earth, ISPRS, and WHU Building Datasets. The experiments showed that our method obtained the probability of correct keypoints (PCK) of over 99% at α = 0.05 ( α : the normalized distance threshold) and achieved a maximum increase of 16.8% at α = 0.01, compared with the latest method. The results demonstrated that our method has good robustness and improved the precision in the registration of optical remote sensing images with great variation.


Introduction
Image registration is the process of eliminating the relative position deviation of corresponding pixels by calculating the corresponding transformation between multiple images of the same scene taken at different times, different viewpoints, or by different sensors. It is one of the essential steps of remote sensing applications. In recent years, remote sensing images have gradually developed with high spatial resolution, high spectral resolution, and high temporal resolution. There are an increasing number of applications scenarios for the registration of high-resolution aerial and satellite remote sensing images, such as land analysis, urban development assessment, and geographical change assessment. The robustness and precision of remote sensing image registration have an important influence on follow-up tasks such as change detection and image fusion. However, multitemporal optical remote sensing imagery with high resolution often suffers from complex effects, such as illumination variation, occlusion, changes in lands and buildings, and complex geometric distortions. These complications are caused by sunlight, changes in weather, natural disasters, human activities, and photography of drastic topographic relief and high-rise buildings at a low viewpoint, which have made registration difficult.

•
To match large patches, we propose a matching network called ScoreCNN, which contains an inner product structure for matching of sub-images (MSI). It estimates the similarities between image patches to generate the corresponding heatmaps, and a filtering algorithm is designed to sort out high-quality matched pairs from the candidates. • A regression network with weight structure and position embedding is proposed in the estimation of transformation parameters (ETP). It can directly estimate the parameters of a transformation model with an uncertain number of matched subimages. The weight structure can learn to evaluate the quality of inputs and mitigate the impact of the inferior ones.

•
We introduce the training method of the two-stage registration model, including the strategy of sharing training samples and the augmentation of random shifting based on bounding rectangles. Experiments showed that our method improved the robustness and precision of registration in the images of various terrain.

Materials and Methods
The proposed method mainly consists of two modules, MSI and ETP, containing CNNs for matching and regression. The pipeline and the definition of the main components are shown in Figure 1. First, matched sub-image pairs are acquired by the network and the filtering algorithm in MSI from the cropped patches in the source image and target image. Second, the global transformation matrix between the two images is predicted through the matched sub-images in ETP. Finally, the source image is warped by the transformation result.
Remote Sens. 2021, 13, x FOR PEER REVIEW 3 of 26 sent into, and the transformation model is predicted directly. They share characteristics of strong robustness, fast speed, simple training, and convenient usage. However, as remote sensing images are much larger than general images, they need considerable down sampling, which may lead to coarse feature extraction and matching, resulting in poor precision or failure, especially in images without salient contours.
To tackle the problem of local feature-based methods, which fail easily, and improve the precision of regression-based deep learning methods, we propose a novel two-stage deep learning registration method based on sub-image matching, aiming to register multitemporal optical remote sensing images with high-resolution and great differences. First, a convolutional neural network (CNN) is applied to search for the corresponding subimage patches. Then, the parameters of the geometric transformation model are predicted directly for the sub-images. Our method aims to improve precision while retaining the robustness of end-to-end methods. The contributions of this paper are summarized as follows: • To match large patches, we propose a matching network called ScoreCNN, which contains an inner product structure for matching of sub-images (MSI). It estimates the similarities between image patches to generate the corresponding heatmaps, and a filtering algorithm is designed to sort out high-quality matched pairs from the candidates.

•
A regression network with weight structure and position embedding is proposed in the estimation of transformation parameters (ETP). It can directly estimate the parameters of a transformation model with an uncertain number of matched sub-images. The weight structure can learn to evaluate the quality of inputs and mitigate the impact of the inferior ones.

•
We introduce the training method of the two-stage registration model, including the strategy of sharing training samples and the augmentation of random shifting based on bounding rectangles. Experiments showed that our method improved the robustness and precision of registration in the images of various terrain.

Materials and Methods
The proposed method mainly consists of two modules, MSI and ETP, containing CNNs for matching and regression. The pipeline and the definition of the main components are shown in Figure 1. First, matched sub-image pairs are acquired by the network and the filtering algorithm in MSI from the cropped patches in the source image and target image. Second, the global transformation matrix between the two images is predicted through the matched sub-images in ETP. Finally, the source image is warped by the transformation result. is the affine transformation matrix from the source image to the target image. The orange boxes are the selected sub-images from the source image, and the blue boxes represent the sub-images which are searched from the target image to be matched with each source sub-images. represents the similarity between the th sub-image from the source image and the sub-image located at ( , ) from the A θ is the affine transformation matrix from the source image to the target image. The orange boxes are the selected sub-images from the source image, and the blue boxes represent the sub-images which are searched from the target image to be matched with each source sub-images. p k ij represents the similarity between the kth sub-image from the source image and the sub-image located at (i, j) from the target image. The maximum of p k ij indicates the location of best-matched pair. (x s,m , y s,m ), (x t,m , y t,m ) represent the coordinate of the mth pair of sub-images from the source and target sub-images, respectively.

Matching of Sub-Images
The first part of the registration system aims to obtain the corresponding image pairs of moderate size with corresponding features in the source image and the target image for the following estimation. The reason for cropping image patches instead of applying down-sample images as inputs is that the details of the images can be retained to a maximum extent, and the coordinate information attached to the image patches can reduce the negative impact of drastic geometric distortion between images. Also, invalid areas with great differences can be easily excluded when matching, and therefore the precision of registration is improved. Unlike the conventional local features, the candidate sub-images are not neighborhoods centered on certain points, so they do not require high locating precision, and they need only to cover the corresponding features inside.
To obtain correct corresponding pairs, the proposed convolutional network, namely ScoreCNN, correlates the features of two sub-images and predicts the similarity between them to find the matched pairs. The sub-images in the same size are cropped from the source images and target images. This section describes ScoreCNN's architecture, the fast filtering algorithm, and the MSI workflow they comprise in detail.

Architecture of ScoreCNN
ScoreCNN is a Siamese network consisting of three parts: feature extraction, feature correlation, and head, as shown in Figure 2. Feature extraction with a shared weights backbone network extracts the three-dimensional feature maps f A , f B ∈ R d×h×w from sub-images. Feature maps can be regarded as d-dimensional dense local feature descriptors from the input images. We adopted the correlation layer used in the regression task in [45] as feature correlation here, as the way it computes similarities in the regression can be extended to the matching task. Feature correlation outputs correlation maps C AB ∈ R (h×w)×h×w made up of scalar products of feature descriptors v ∈ R d at each position in a pair of feature maps. The head turns correlation maps into the probability p, which is the output of ScoreCNN. the target image. The maximum of indicates the location of best-matched pair. , , , , , , , represent the coordinate of the th pair of sub-images from the source and target sub-images, respectively.

Matching of Sub-Images
The first part of the registration system aims to obtain the corresponding image pairs of moderate size with corresponding features in the source image and the target image for the following estimation. The reason for cropping image patches instead of applying down-sample images as inputs is that the details of the images can be retained to a maximum extent, and the coordinate information attached to the image patches can reduce the negative impact of drastic geometric distortion between images. Also, invalid areas with great differences can be easily excluded when matching, and therefore the precision of registration is improved. Unlike the conventional local features, the candidate sub-images are not neighborhoods centered on certain points, so they do not require high locating precision, and they need only to cover the corresponding features inside.
To obtain correct corresponding pairs, the proposed convolutional network, namely ScoreCNN, correlates the features of two sub-images and predicts the similarity between them to find the matched pairs. The sub-images in the same size are cropped from the source images and target images. This section describes ScoreCNN's architecture, the fast filtering algorithm, and the MSI workflow they comprise in detail.

Architecture of ScoreCNN
ScoreCNN is a Siamese network consisting of three parts: feature extraction, feature correlation, and head, as shown in Figure 2. Feature extraction with a shared weights backbone network extracts the three-dimensional feature maps , ∈ ℝ × × from subimages. Feature maps can be regarded as -dimensional dense local feature descriptors from the input images. We adopted the correlation layer used in the regression task in [45] as feature correlation here, as the way it computes similarities in the regression can be extended to the matching task. Feature correlation outputs correlation maps ∈ ℝ ( × )× × made up of scalar products of feature descriptors ∈ ℝ at each position in a pair of feature maps. The head turns correlation maps into the probability , which is the output of ScoreCNN.
The input image size of ScoreCNN is 240 × 240. Then, feature maps with size of d × h × w are extracted, where the dimension d varies with the backbone. The head of ScoreCNN is composed of 3 × 3 convolutional layers, rectified linear units (ReLU), pooling layers, a fully connected layer, and a sigmoid function for logistic regression. The Remote Sens. 2021, 13, 3443 5 of 25 detailed setups are shown in Figure 3. As the ScoreCNN can be treated as a classification network when training, we consider VGG-16 [46] or ResNet-18 [47], which are both widely used and capable of good performance in image classification, as the backbone of feature extraction. To select a more superior backbone, we preliminarily train ScoreCNN with them and evaluate with the indices mentioned in Section 3.1. To unify the output size of the feature maps in H and W dimensions, we only embed the first four layers of VGG-16 and the first three layers of ResNet-18 in ScoreCNN, with output sizes of 512 × 15 × 15 and 256 × 15 × 15, respectively. We adopt binary cross entropy function as the loss function. The comparison results in Table 1 show that the performances of the two backbones are close, but the total number of parameters in ScoreCNN with the ResNet backbone is at least 50% less than the one with VGG backbone, and the computational complexity is only 10% of VGG-16's. Thus, ResNet-18 is adopted as the final backbone of ScoreCNN, and the following experiments in this paper were carried out based on it. ScoreCNN is composed of 3 × 3 convolutional layers, rectified linear units (ReLU), pooling layers, a fully connected layer, and a sigmoid function for logistic regression. The detailed setups are shown in Figure 3. As the ScoreCNN can be treated as a classification network when training, we consider VGG-16 [46] or ResNet-18 [47], which are both widely used and capable of good performance in image classification, as the backbone of feature extraction. To select a more superior backbone, we preliminarily train ScoreCNN with them and evaluate with the indices mentioned in Section 3.1. To unify the output size of the feature maps in H and W dimensions, we only embed the first four layers of VGG-16 and the first three layers of ResNet-18 in ScoreCNN, with output sizes of 512 × 15 × 15 and 256 × 15 × 15, respectively. We adopt binary cross entropy function as the loss function. The comparison results in Table 1 show that the performances of the two backbones are close, but the total number of parameters in ScoreCNN with the ResNet backbone is at least 50% less than the one with VGG backbone, and the computational complexity is only 10% of VGG-16's. Thus, ResNet-18 is adopted as the final backbone of ScoreCNN, and the following experiments in this paper were carried out based on it.  The similarities estimated by ScoreCNN at different positions are taken as the metrics for matching pairs, which are positively associated with the matching probability. The best matched sub-images pairs of high quality are obtained based on the rule of unique significant peak value. Specifically, several non-overlapping image patches, , are evenly cropped from the source image. Each source patch ∈ forms a set of sub-image pairs, with the patches ∈ selected by sliding windows at the interval of in the target image. The similarity is predicted at each position of the target image and makes up a heatmap, . The maximum position, , in the heatmap is the coordinate of the most matched target patch to in theory. The ideal heatmap has only one peak whose value is significantly different from ones of other positions, as shown in Figure 4a. It indicates a high confidence of correct match. On the contrary, heatmaps with a low maximum value or multi-peaks, as shown in Figure 4b, indicate low confidence and need to be eliminated.  The similarities estimated by ScoreCNN at different positions are taken as the metrics for matching pairs, which are positively associated with the matching probability. The best matched sub-images pairs of high quality are obtained based on the rule of unique significant peak value. Specifically, several non-overlapping image patches, I s , are evenly cropped from the source image. Each source patch I k s ∈ I s forms a set of sub-image pairs, with the patches I m t ∈ I t selected by sliding windows at the interval of s t in the target image. The similarity is predicted at each position of the target image and makes up a heatmap, M k . The maximum position, loc m , in the heatmap is the coordinate of the most matched target patch to I m t in theory. The ideal heatmap has only one peak whose value is significantly different from ones of other positions, as shown in Figure 4a. It indicates a high confidence of correct match. On the contrary, heatmaps with a low maximum value or multi-peaks, as shown in Figure 4b  In general, the correct matches are photographed at the same geographical area containing similar features, while the unmatched pairs are not ( Figure 5b). However, in practice, some pairs that contain few similar or prominent features are difficult to classify even if they come from the same geographical region, and should be regarded as unmatched pairs. For instance, the ground surface changes greatly with passing time in the pair shown in Figure 5c, and the pair in Figure 5d represents weak texture images without prominent features. Weak texture images are generally vegetation or water. Their similarity scores are inevitably higher than other unmatched pairs, though they are from different regions, because feature correlation essentially estimates the similarity by a scalar product, which cannot distinguish whether the features are salient. Accordingly, we propose a simple strategy to filter out weak texture images directly with the statistical characteristics of the intensity before training and registration process. Considering that the differences between pixel values in green and blue color channel of the vegetation and water images are small, the filtering rules are as follows: where , are standard deviations (SD) of the intensities in green and blue channels, respectively, and is the threshold of max SD. The sub-image pair is excluded when it satisfies at least one of the conditions in (1).

Workflow of MSI
The calculation procedure of MSI combining ScoreCNN and matching pair filtering is detailed in Algorithm 1. A set of sub-images are cropped from the source image, and the heatmap of each source sub-image is generated by ScoreCNN as described in Section In general, the correct matches are photographed at the same geographical area containing similar features, while the unmatched pairs are not ( Figure 5b). However, in practice, some pairs that contain few similar or prominent features are difficult to classify even if they come from the same geographical region, and should be regarded as unmatched pairs. For instance, the ground surface changes greatly with passing time in the pair shown in Figure 5c, and the pair in Figure 5d represents weak texture images without prominent features.  In general, the correct matches are photographed at the same geographical area containing similar features, while the unmatched pairs are not ( Figure 5b). However, in practice, some pairs that contain few similar or prominent features are difficult to classify even if they come from the same geographical region, and should be regarded as unmatched pairs. For instance, the ground surface changes greatly with passing time in the pair shown in Figure 5c, and the pair in Figure 5d represents weak texture images without prominent features. Weak texture images are generally vegetation or water. Their similarity scores are inevitably higher than other unmatched pairs, though they are from different regions, because feature correlation essentially estimates the similarity by a scalar product, which cannot distinguish whether the features are salient. Accordingly, we propose a simple strategy to filter out weak texture images directly with the statistical characteristics of the intensity before training and registration process. Considering that the differences between pixel values in green and blue color channel of the vegetation and water images are small, the filtering rules are as follows: where , are standard deviations (SD) of the intensities in green and blue channels, respectively, and is the threshold of max SD. The sub-image pair is excluded when it satisfies at least one of the conditions in (1).

Workflow of MSI
The calculation procedure of MSI combining ScoreCNN and matching pair filtering is detailed in Algorithm 1. A set of sub-images are cropped from the source image, and the heatmap of each source sub-image is generated by ScoreCNN as described in Section Weak texture images are generally vegetation or water. Their similarity scores are inevitably higher than other unmatched pairs, though they are from different regions, because feature correlation essentially estimates the similarity by a scalar product, which cannot distinguish whether the features are salient. Accordingly, we propose a simple strategy to filter out weak texture images directly with the statistical characteristics of the intensity before training and registration process. Considering that the differences between pixel values in green and blue color channel of the vegetation and water images are small, the filtering rules are as follows: where σ G ,σ B are standard deviations (SD) of the intensities in green and blue channels, respectively, and TH is the threshold of max SD. The sub-image pair is excluded when it satisfies at least one of the conditions in (1).

Workflow of MSI
The calculation procedure of MSI combining ScoreCNN and matching pair filtering is detailed in Algorithm 1. A set of sub-images are cropped from the source image, and the heatmap of each source sub-image is generated by ScoreCNN as described in Section 2.1.1. Then, a set of high-quality matched pairs are obtained according to Section 2.1.2. Lines 13-14 in Algorithm 1 indicate that the sphere around the peak where the maximum locates is excluded when searching for the second peak, which simplifies the calculation. Here, r is the radius of the sphere and is related to the sliding interval, s t ; for example, r = D s t .

Algorithm 1 Matching of Sub-Images
Input: source image I s and target image I t ; trained ScoreCNN model N Output: matched sub-images 1: Cut sub-images I s from I s ; 2: for each I k s ∈ I s do 3: # Forward propagation 4: Generate heatmap M k through model N ; 5: # Find best-matched sub-image from target image 6: # Filter outliers 9: if I k s , I m t satify Equation (1) or M max < l then 10: continue 11: else 12: S max ← neighborhood of radius r centered on loc m ; 13:

Estimation of Transformation Parameters
ETP outputs the global transformation matrix between the source image and target image with the inputs of a set of matched sub-image pairs. As common methods such as Random Sampling and Consensus (RANSAC) [48] for estimating parameters are not applicable here, we first discuss two possible approaches for further suppressing lowquality inputs before detailing ETP.
One is to regress the parameters of each pair separately and assemble them as the final result; the other is to fuse the features and regress the parameters directly by the network. The former inputs one pair at a time and forward propagates several times, while the latter inputs multiple pairs and forward propagates once. Preliminary experiments showed that the errors of the regression on pairs in the former approach were magnified when assembling, making the total error unacceptable. Assuming the coordinate on the horizon axis is x = θ 1 x + θ 2 y + θ 3 after linear transformation, and the parameter error is ∆θ 1 =θ 1 − θ 1 , the estimation error of x is ∆θ 1 x, which means that it increases with the image size with the same parameter error ∆θ and may not be counterbalanced. Therefore, we adopt the latter, which is superior in theory and learnable. The set of sub-images from the source image and the target image is defined as I = (I s , I t ) I s , I t ∈ R m×H×W . The mapping E of ETP is expressed as follows: where m is the number of sub-images, H × W is the size of sub-images, and DoF represents the degree of freedom of the transformation matrix. The challenge is that m is an uncertain number, meaning uncertain inputs, while cutting it down to a fixed number may lead to higher errors and a reduction of information.
To tackle this challenge and minimize the impact of outliers, we propose a CNN with weight structure and position embedding. The architecture, outputs, and loss function of ETP are introduced in detail in the following sections.

Architecture of ETP
The general architecture of the parameter estimation network is shown in Figure 6, and is mainly composed of feature extraction, position embedding, and the regression head. Analogous to MSI, feature extraction and the correlation extract similarity information from sub-image pairs. Position embedding learns to encode the coordinates of the sub-images at the original images. The transformation matrix, A θ , is obtained by the regression head after the combination of the correlation maps and the encodings. ter error is Δ = − , the estimation error of is Δ , which means that it increases with the image size with the same parameter error Δ and may not be counterbalanced. Therefore, we adopt the latter, which is superior in theory and learnable.
The set of sub-images from the source image and the target image is defined as = ( , )| , ∈ ℝ × × . The mapping ℰ of ETP is expressed as follows: where is the number of sub-images, × is the size of sub-images, and represents the degree of freedom of the transformation matrix. The challenge is that is an uncertain number, meaning uncertain inputs, while cutting it down to a fixed number may lead to higher errors and a reduction of information.
To tackle this challenge and minimize the impact of outliers, we propose a CNN with weight structure and position embedding. The architecture, outputs, and loss function of ETP are introduced in detail in the following sections.

Architecture of ETP
The general architecture of the parameter estimation network is shown in Figure 6, and is mainly composed of feature extraction, position embedding, and the regression head. Analogous to MSI, feature extraction and the correlation extract similarity information from sub-image pairs. Position embedding learns to encode the coordinates of the sub-images at the original images. The transformation matrix, , is obtained by the regression head after the combination of the correlation maps and the encodings. and represent the input source sub-images and the target sub-images, respectively.
represents the correlation map from , and , . ( , ) and ( , ) represent the coordinates of the images to which they belong. and denote the 2-D vectors after encoding. , , and are concatenated into feature block . denotes the final transformation matrix.
We adopt SE-ResNeXt101 [49] as the backbone of ETP's feature extraction to reach the best performance. SE-ResNeXt, which applies the Squeeze-and-Excitation (SE) module to ResNeXt, focuses on more important features. We embed the first three layers of the backbone and apply L2-normalization [45]. Given the coordinates ( , ) and ( , ) corresponding to the sub-image pair ( , ) at the upper left corner or center of the original images, we can obtain the 2-D vectors and by position embedding, as follows: where ∈ ℝ × , denotes a learnable fully connected layer, and ( , ) denotes the coordinates normalized to [−1, 1]. The reason why the coordinates are needed is that the information of the translation component implied by features of separated image pairs Figure 6. Network architecture of ETP. I s and I t represent the input source sub-images and the target sub-images, respectively. C 1 represents the correlation map from I s,1 and I t,1 . (x s , y s ) and (x t , y t ) represent the coordinates of the images to which they belong. v A and v B denote the 2-D vectors after encoding. v A , v B , and C are concatenated into feature block V. A θ denotes the final transformation matrix.
We adopt SE-ResNeXt101 [49] as the backbone of ETP's feature extraction to reach the best performance. SE-ResNeXt, which applies the Squeeze-and-Excitation (SE) module to ResNeXt, focuses on more important features. We embed the first three layers of the backbone and apply L2-normalization [45]. Given the coordinates (x s , y s ) and (x t , y t ) corresponding to the sub-image pair (I s , I t ) at the upper left corner or center of the original images, we can obtain the 2-D vectors v A and v B by position embedding, as follows: where v ∈ R h×w , P denotes a learnable fully connected layer, and (x n , y n ) denotes the coordinates normalized to [−1, 1]. The reason why the coordinates are needed is that the information of the translation component implied by features of separated image pairs makes no contribution to the global translation. It is necessary to provide the respective position encoding information to the network to connect their features. In this way, the encodings also provide information for coarse alignment, making it easier for the network to learn the fine transformation. The proposal of position embedding is inspired by natural language processing (NLP). One of the approaches for absolute encoding is learnable position embedding [50], which defines fully connected layers as embedding layers. Analogously, two fully connected layer shared weights are applied to encode the normalized coordinates of the sub-image pair. We concatenate the encodings and correlation maps into new feature blocks V instead of adding them directly, as in [50], because of the differences in the network structure, which avoids interference.
An uncertain number of features V leads to uncertain channels if concatenated on the channel axis, and are not allowed by the general CNN. As mentioned earlier, we should consider valid information as much as possible. Hence, we designed a specific architecture of regression head to take in all the matched pairs and merge the variable dimensions into a fixed size by the operations of summing and averaging. Additionally, although most of the outliers are eliminated in the previous process, there may still be low-quality sub-images. Weight structure is accordingly proposed to self-recognize and suppress baddish inputs at the same time. The architectures of regression head are shown as Figure 7, wherein Figure 7a is the basic architecture without weight structure, Figure 7b,c is two forms of weight structure, that is, structure A and B. The trunk of the basic architecture is the same as the ones with weight structure. We hypothesize that the correct high-quality matches containing implicit similar features outnumber the bad ones so that the weight structure can learn to recognize outliers and assign smaller weights to reduce its contribution to the regression. In the weight structures, each feature block, V i , is multiplied by a template map based on itself. The feature map, Z, after the fusion of weighted sum can be expressed as: where D i is the ith feature map after redistributing the attention to the channels in the input V i with the channel attention mechanism [48]. α i ∈ α represents the weight of the ith feature map, and ∑ m i=1 α i = 1, where m represents the number of inputs. The weights are obtained as follows: where W : R m×(h×w+2)×h×w → R m denotes the mapping from D i to the weights before the logistic function. The difference between structure A and B lies in the sequence of averaging feature maps and extracting implicit features. The details of the blocks in Figure 7 are described in Table 2, where the blocks share the same settings with the same name. We compare these structures in Section 3.2.1, and their performance increases in order as shown in Figure 7, which means structure B is the best.

Transformation Matrix
Given that remote sensing images are the distant shots, we adopt the normalized affine transformation with the size of 2 × 3 and 6-DoF, as it is more suitible than 8-DoF projection for close shots or the 18-DoF thin plate spline for complex non-linear distortion. The normalized transformation matrix, A θ , is the output of ETP instead of the original affine matrix M. Two forms of the transformation of the pixels between the source image and the target image are denoted as: where (x s n , y s n ) and x t n , y t n represent the normalized coordinates of the points in the source image and the target image respectively, while x s i , y s i and x t i , y t i represent the absolute pixel coordinates. The transfer between absolute coordinates and normalized coordinates is defined as: where W, H are the width and height of the images, respectively, and the coordinates satisfy −1 ≤ x s n , y s n , x t n , y t n ≤ 1, 0 ≤ x s i , x t i ≤ H and 0 ≤ y s i , y t i ≤ W. With this, we can obtain the transfer between A θ and M:

Loss Function
We adopt the grid loss [45] for training. Given the estimation result,Â, and ground truth, A GT , of the global transformation, the loss function is defined as follows: where x i , y j are the normalized image coordinates. Here, N is the number of grid points, and T (·) is the operator of the transformation, which indicates the transformed coordinates.

Training and Augmentation
The steps of the training workflow are as follows: (1) cropping sub-image samples from the source image and the target image, (2) internal augmentation, (3) forward propagation, (4) calculating the loss function, (5) backward propagation. Although the networks in MSI and ETP are different, the generation of training samples and augmentation share characteristics, which simplifies the procedure of training. The details are described in the next section.

Training with Shared Samples
The generation of the training sample occurs online and entails cropping and processing the corresponding sub-images dynamically before they are inputted. The parameters of affine transformation for the synthetic transformed images are simulated based on singular value decomposition (see [43]), where the rotation angle is α ∼ U (−π, π), the anisotropic scaling factors are s x , s y ∼ U (0.35, 1.65), the shear angle is h ∼ U (−0.75, 0.75), and the translations are t x , t y ∼ U (−256, 256).
Sub-images with the number of n s are cropped from the source image, of which the areas are required to cover almost the whole image, for example, cropping with an equal interval. In the transformed target image, the target sub-images and the corresponding source sub-images form pairs as positive samples, while the negative samples are composed of any two sub-images at non-corresponding positions. The positive and negative samples are sent to ScoreCNN, and the cross-entropy loss employed for the backpropagation is calculated with the outputs and the ground-truth matching labels.
The online generation of training samples makes it flexible for tuning and smaller storage occupation, whereas there are a few sub-images out of boundaries because of the violent deformation, resulting in false positive samples, as shown in Figure 8. We set the labels of this kind of pairs and weak-texture pairs satisfying (1) as '0', namely, unmatched pairs. To balance the ratio of positive and negative samples to 1:1, some of the false positive samples are replaced by other matched pairs dynamically. The label y k ij of the sub-image pair {I k s , I ij t } is defined by: where Loc k and Loc ij denote the location of I k s and I ij t in the target image, respectively.
negative samples are sent to ScoreCNN, and the cross-entropy loss employed for the backpropagation is calculated with the outputs and the ground-truth matching labels. The online generation of training samples makes it flexible for tuning and smaller storage occupation, whereas there are a few sub-images out of boundaries because of the violent deformation, resulting in false positive samples, as shown in Figure 8. We set the labels of this kind of pairs and weak-texture pairs satisfying (1) as '0', namely, unmatched pairs. To balance the ratio of positive and negative samples to 1:1, some of the false positive samples are replaced by other matched pairs dynamically. The label of the sub-image pair , is defined by: where and denote the location of and in the target image, respectively.
The generation of training samples for ETP is similar to that of ScoreCNN as described above. The positive samples that are distributed evenly in the image are reutilized instead of the results of MSI to speed up the training and reduce dependency on the quality of the former steps. Choosing pairs with higher matching scores among them or the pairs with scores over the threshold is optional to reinforce the associated relation between MSI and ETP. These positive samples and the corresponding coordinates are inputted to the network together. Then the grid loss is calculated with the outputs and ground truths for the back propagation. To improve generalization, the sequence of inputs is disrupted. The generation of training samples for ETP is similar to that of ScoreCNN as described above. The positive samples that are distributed evenly in the image are reutilized instead of the results of MSI to speed up the training and reduce dependency on the quality of the former steps. Choosing m pairs with higher matching scores among them or the pairs with scores over the threshold T s is optional to reinforce the associated relation between MSI and ETP. These positive samples and the corresponding coordinates are inputted to the network together. Then the grid loss is calculated with the outputs and ground truths for the back propagation. To improve generalization, the sequence of inputs is disrupted.

Augmentation Based on the Bounding Rectangle
The conventional feature extraction defines the corresponding points precisely, while the corresponding sub-images are based on areas instead of points, where the areas in a specific scope nearby can be regarded as correct correspondences. When generating training samples, the areas transformed from the source sub-images with a rectangular shape are quadrilateral areas, of which the bounding windows are often not the same size as the target sub-image windows. The windows can lie inside the bounding rectangles or contain them completely, as shown in Figure 9, when cropping positive samples from the target images. Hence, we augment the target sub-images by cropping randomly in this scope instead of the exact centers transformed from the source sub-images. The augmentation is applied to the training in both MSI and ETP. It avoids the overfitting caused by fixed coordinates and ignoring the features of images; it also simulates the offsets of sub-images when matching them, as shown in Figure 10.
target sub-image windows. The windows can lie inside the bounding rectangles or contain them completely, as shown in Figure 9, when cropping positive samples from the target images. Hence, we augment the target sub-images by cropping randomly in this scope instead of the exact centers transformed from the source sub-images. The augmentation is applied to the training in both MSI and ETP. It avoids the overfitting caused by fixed coordinates and ignoring the features of images; it also simulates the offsets of subimages when matching them, as shown in Figure 10.

Dataset and Experimental Settings
The dataset we mainly used included multi-temporal remote sensing image pairs with correct geographical correspondence collected by [45], called the Google Earth (GE) dataset in this paper, as well as some of the images released by the International Society for Photogrammetry and Remote Sensing (ISPRS) [51] and Wuhan University (WHU) Building Dataset [52] for further test. The images in the GE dataset were from Google Earth and were taken at different times (2019, 2017, and 2015) and with different sensors (Landsat-7, Landsat-8, Worldview, and Quickbird). Each pair of images was divided into pre-temporal and post-temporal images, with sizes of 1080 × 1080. The number of image pairs in the training set, validation set, and testing set was 9000, 620, and 500 pairs, respectively. Various objects are included, and the proportions of main categories are shown in target sub-image windows. The windows can lie inside the bounding rectang tain them completely, as shown in Figure 9, when cropping positive sample target images. Hence, we augment the target sub-images by cropping random scope instead of the exact centers transformed from the source sub-images. Th tation is applied to the training in both MSI and ETP. It avoids the overfitting fixed coordinates and ignoring the features of images; it also simulates the offs images when matching them, as shown in Figure 10.

Dataset and Experimental Settings
The dataset we mainly used included multi-temporal remote sensing im with correct geographical correspondence collected by [45], called the Google dataset in this paper, as well as some of the images released by the Internation for Photogrammetry and Remote Sensing (ISPRS) [51] and Wuhan Universi Building Dataset [52] for further test. The images in the GE dataset were fro Earth and were taken at different times (2019, 2017, and 2015) and with differe (Landsat-7, Landsat-8, Worldview, and Quickbird). Each pair of images was di pre-temporal and post-temporal images, with sizes of 1080 × 1080. The numbe pairs in the training set, validation set, and testing set was 9000, 620, and 500 pa tively. Various objects are included, and the proportions of main categories are

Dataset and Experimental Settings
The dataset we mainly used included multi-temporal remote sensing image pairs with correct geographical correspondence collected by [45], called the Google Earth (GE) dataset in this paper, as well as some of the images released by the International Society for Photogrammetry and Remote Sensing (ISPRS) [51] and Wuhan University (WHU) Building Dataset [52] for further test. The images in the GE dataset were from Google Earth and were taken at different times (2019, 2017, and 2015) and with different sensors (Landsat-7, Landsat-8, Worldview, and Quickbird). Each pair of images was divided into pre-temporal and post-temporal images, with sizes of 1080 × 1080. The number of image pairs in the training set, validation set, and testing set was 9000, 620, and 500 pairs, respectively. Various objects are included, and the proportions of main categories are shown in Table 3. The post-temporal images were regarded as the source images, and the pre-temporal images were synthesized by affine transformation and regarded as the target images. The implementation of this model was based on PyTorch and trained on GeForce RTX 3090, optimized by Adam with initial learning rate of 5 × 10 −4 . The experimental settings of the hyper-parameters mentioned previously are listed in Table 4. Except t, l, TH, there were no strict requirements for settings of the other parameters.

Ablation Study
To prove the effectiveness of the components and the training methods, several experiments were conducted and analyzed. To ensure the objectivity of the testing set, the validation set was utilized to evaluate the models in this section.
We applied accuracy, area under the ROC curve (AUC), F1 score, the false positive rate at the true positive rate of 95% (FPR95), and the number of parameters as evaluation indices for ScoreCNN. As for ETP, we adopted different indices for the validation and testing set because there were specified keypoints for assessment in the testing set, while there were not for the validation set. The average probability of correct keypoints (PCK) [53], which was also used in [45], and the average mean absolute error (MAE), were adopted for the evaluation on testing set. Given the keypoints P i t , PCK can be calculated by: where T P i t represents the locations of keypoints after transformation, T (·) denotes the transformation function, P i GT represents the ground-truth locations, d(·) calculates the distances between the predicted and ground-truth points, | · | measures the number of elements of the correct point set, and n is the number of total keypoints. α·max(H, W) represents the distance threshold of correct points in the image size of H × W, where α is the normalized distance threshold. α was set as 0.01, 0.03, and 0.05 in line with the image size, where different values reflect different tolerance of pixel error for correct points. Smaller α implies smaller pixel offset of correct points and stricter precision requirement. PCK assesses the precision and robustness of registration globally at the same time. A higher PCK indicates higher rates of correct keypoints and more well-aligned image pairs. To evaluate the absolute pixel errors on the validation set, we set up MAE grid based on MAE, calculated as follows: where T P j t and P j GT denote the predicted and ground-truth keypoints transformed from the grid points distributed in the source image evenly. n represents the number of the grid points, which can be 10 × 10 = 100, for example. The difference between MAE and MAE grid is the points used for error calculation.

Effects of ETP's Architectures
We trained and compared the three different regression architectures of ETP as described in Section 2.2.1 to find out the best architecture. Figure 11 and Table 5 show the performance of the training process and training results, respectively, on the validation set. Although the loss and error of the basic architecture decreased faster than the others in the beginning, the architectures with weight structures performed better later with smaller MAE grid . This preliminarily proved that the proposed weight structures are effective, and structure B is better than A.
points. Smaller implies smaller pixel offset of correct points and stricter precision quirement. PCK assesses the precision and robustness of registration globally at the sa time. A higher PCK indicates higher rates of correct keypoints and more well-aligned age pairs. To evaluate the absolute pixel errors on the validation set, we set up MAE based on MAE, calculated as follows:

, (
where and denote the predicted and ground-truth keypoints transform from the grid points distributed in the source image evenly.
represents the numbe the grid points, which can be 10 × 10 = 100, for example. The difference between M and MAE is the points used for error calculation.

Effects of ETP's Architectures
We trained and compared the three different regression architectures of ETP as scribed in Section 2.2.1 to find out the best architecture. Figure 11 and Table 5 show performance of the training process and training results, respectively, on the validat set. Although the loss and error of the basic architecture decreased faster than the oth in the beginning, the architectures with weight structures performed better later w smaller MAE . This preliminarily proved that the proposed weight structures are ef tive, and structure B is better than A. The weight structure plays a role in discriminating and screening the image pa meaning the weights on the pairs that are not conducive to the regression are lowered. shown in Figure 12, higher weight values were assigned to the high-quality pairs (a) a (b) with abundant matched features. However, lower weight values were assigned to sub-image pairs (c) and (d) where corresponding features were not enough or had p distribution, which contributed less to regression, though they were true positive pair  Table 5. Comparative results of different architectures and inference procedures on validation set. "S→T" and "T→S" indicate the predicted transformations in two inference directions of the same trained model. "Bidirection" refers to the two-stream model trained bidirectionally and as an ensemble.

Architecture
Grid The weight structure plays a role in discriminating and screening the image pairs, meaning the weights on the pairs that are not conducive to the regression are lowered. As shown in Figure 12, higher weight values were assigned to the high-quality pairs (a) and (b) with abundant matched features. However, lower weight values were assigned to the sub-image pairs (c) and (d) where corresponding features were not enough or had poor distribution, which contributed less to regression, though they were true positive pairs. Remote Sens. 2021, 13, x FOR PEER REVIEW 16 of 26 Aiming at the problem of asymmetry in registration mentioned in [43], we conducted experiments to explore the symmetry of our networks' results. The transformation matrix and its inverse matrix were predicted by exchanging the positions of the source and target image, and the ensemble method [43] was utilized, where the models were trained by bidirectional loss, namely the two-stream model. We let the transformation matrix from source image to the target image be → , and → in the opposite direction. → and → are the inverse matrices. The ensemble refers to averaging → and → as the final result with the two-stream model. Table 5 shows that the performances of the predictions of → and → were alike on the validation set even if the model was trained only in the same direction. Although the bidirectional training brought in more constraints, it did not improve the performance with respect to our model. This reflects the generalization ability of our method and the full use we make of the information in the dataset through more generated samples, which avoids asymmetry. Table 5. Comparative results of different architectures and inference procedures on validation set. "S→T" and "T→S" indicate the predicted transformations in two inference directions of the same trained model. "Bidirection" refers to the two-stream model trained bidirectionally and as an ensemble.

Architecture
Grid

Effects of the Augmentation
In the training of ScoreCNN and ETP, we applied the augmentation of random cropping described in Section 2.3.2, and compared the models with and without. Table 6 and Table 7 show that augmentation had a positive impact on the task of both matching and regression, in which the accuracy of ScoreCNN increased by nearly 1%, and the average regression error decreased by about 6%. Aiming at the problem of asymmetry in registration mentioned in [43], we conducted experiments to explore the symmetry of our networks' results. The transformation matrix and its inverse matrix were predicted by exchanging the positions of the source and target image, and the ensemble method [43] was utilized, where the models were trained by bidirectional loss, namely the two-stream model. We let the transformation matrix from source image to the target image beÂ S→T , andÂ T→S in the opposite direction.Â S→T andÂ T→S are the inverse matrices. The ensemble refers to averagingÂ S→T andÂ −1 T→S as the final result with the two-stream model. Table 5 shows that the performances of the predictions ofÂ S→T andÂ T→S were alike on the validation set even if the model was trained only in the same direction. Although the bidirectional training brought in more constraints, it did not improve the performance with respect to our model. This reflects the generalization ability of our method and the full use we make of the information in the dataset through more generated samples, which avoids asymmetry.

Effects of the Augmentation
In the training of ScoreCNN and ETP, we applied the augmentation of random cropping described in Section 2.3.2, and compared the models with and without. Tables 6 and 7 show that augmentation had a positive impact on the task of both matching and regression, in which the accuracy of ScoreCNN increased by nearly 1%, and the average regression error decreased by about 6%.

Final Results
ResNet-18 and SE-ResNeXt-101 were adopted as the backbones of ScoreCNN and ETP in the best model of our method, and weight structure B was adopted in ETP. Sub-images with the number of n s were cropped from each source image and the matched target subimages were searched with sliding interval s t in corresponding target image. According to the principle that the selected sub-images should at least contain most areas of the whole image, n s better no less than 25, we adopted the average matching rate (AMR) and average matching precision (AMP) to evaluate the number and quality of the generated sub-image pairs. AMR was calculated by: where N M i denotes the number of matched sub-image pairs in each image pair, n s denotes the number of cropped source sub-images initially, and N is the total number of image pairs. AMP was caculated by: where NCM i is the number of correct matches. Figure 13 shows the AMR and AMP at different n s , where the matching performance is good when n s ≥ 25.

Final Results
ResNet-18 and SE-ResNeXt-101 were adopted as the backbones of ScoreCNN and ETP in the best model of our method, and weight structure B was adopted in ETP. Subimages with the number of were cropped from each source image and the matched target sub-images were searched with sliding interval in corresponding target image. According to the principle that the selected sub-images should at least contain most areas of the whole image, better no less than 25, we adopted the average matching rate (AMR) and average matching precision (AMP) to evaluate the number and quality of the generated sub-image pairs. AMR was calculated by: where denotes the number of matched sub-image pairs in each image pair, denotes the number of cropped source sub-images initially, and is the total number of image pairs. AMP was caculated by: where is the number of correct matches. Figure 13 shows the AMR and AMP at different , where the matching performance is good when 25. The trained models were tested by the testing set provided in [45] according to the registration process in Figure 1. There were all kinds of terrain in the images of testing set, such as buildings, river banks, bridges, fields, wasteland, and forests, and 20 keypoints were chosen for each image to evaluate registration.
We compared our method with SIFT, the representative of conventional methods, and an advanced deep learning method DAM [45]. The best model provided in DAM was The trained models were tested by the testing set provided in [45] according to the registration process in Figure 1. There were all kinds of terrain in the images of testing set, such as buildings, river banks, bridges, fields, wasteland, and forests, and 20 keypoints were chosen for each image to evaluate registration.
We compared our method with SIFT, the representative of conventional methods, and an advanced deep learning method DAM [45]. The best model provided in DAM was used for comparison, which is the two-stream network with bidirectional ensemble and the backbone of SE-ResNeXt101. Table 8 shows the PCK comparison results of all images registered in the testing set. It is evident that all our proposed models with different architectures achieved the best results in all cases of α, and the improvement of models with the weight structure was more significant. A PCK of over 99% at α = 0.05 showed that nearly all images were correctly matched within the tolerance. The significant improvement of PCK at α = 0.01, which was about 16% higher than DAM provided by our method with 'SE-resnext101' + 'structure B' + 'n s = 36', reflected the increase in registration precision. The PCK at α = 0.01 of weight structure B increased by 9.4% over the basic structure and 7.7% over structure A with the same n s , which showed the superiority of the weight structure, especially structure B. Overall, we proved that our method provided improvement in registration precision and retained or even optimized the robustness of the deep learning method. Table 8. Comparison of probability of correct keypoints (PCK) for registration with different methods on testing set. The whole registration stream for SIFT is SIFT + Random Sampling and Consensus (RANSAC) [48]. 'Int. Aug.' + 'Bi-En.' represents the best one in [45]. The scores marked with " †" were brought from [45] and evaluated on the same dataset. The models in the two-stage approach we proposed are in single-stream architecture, and n s denotes the number of cropped source sub-images. Conventional methods may have achieved higher precision if they did not fail in the registration of some images, such as images with small changes and those that are easy to register, but may have failed easily because of complex differences or distortion and the lack of enough correct matchings. In contrast, registration methods based on deep learning were more robust and are applicable to images in many circumstances. Moreover, the performance had a low correlation with feature types and the extent of geometric distortion. Our method is devoted to improving the registration precision with strong registration robustness based on deep learning, so that the error is within an acceptable range. We selected several representative image pairs with drastic changes in vegetation, topographic fluctuation, deformation and occlusion to show and evaluate the qualitative and quantitative results of different methods, shown as Figure 14. In the first pair of images, vegetation and river changed greatly, and the features were sparse; the second pair had high-rise buildings with great different appearances and shading other areas owing to aerial filming; the third had abundant features, but the fields and constructions changed greatly; most areas in the fourth pair represented vegetation, with variation in constructions but few correspondences occupying a small part in the images, making the registration difficult. The results of alignment are shown in Figure 14, and the MAE of control points are shown in Table 9. It is evident that SIFT failed to register when the corresponding features were few or ambiguous, and the error was still large on the third pair, though it was barely aligned. Both our method and DAM could realize coarse registration of these pairs, while our method achieved better alignment as presented in the marked box in Figure 14 clearly, where the key areas such as roads and lines are linked well. Thus, our method had the best robustness for different types of images with great differences and deformation, followed by DAM, and also had higher precision. Table 9. Comparison of quantitative results. Columns 1~4 represent the errors of registration for pair 1-4 in Figure 14 respectively. "\" represents the failure of registration. MAE: Mean absolute error.

Robustness
To test the robustness of the proposed method for multi-temporal optical remote sensing images further, images from different times and other regions outside the testing set of the Google Earth dataset above were collected for more experiments, as shown in Figure 15. The images were acquired in 2015, 2017, 2018, 2019, and 2020, and were greatly affected by human activities, seasonal changes, clouds, and illumination, making it harder to register them. Similar to the previous results, Figure 15 shows that the methods based

Robustness
To test the robustness of the proposed method for multi-temporal optical remote sensing images further, images from different times and other regions outside the testing set of the Google Earth dataset above were collected for more experiments, as shown in Figure 15. The images were acquired in 2015, 2017, 2018, 2019, and 2020, and were greatly affected by human activities, seasonal changes, clouds, and illumination, making it harder to register them. Similar to the previous results, Figure 15 shows that the methods based on deep learning affected the registration of multi-temporal images with extremely complex variance. However, the proposed method was more precise than others, which is shown by the alignment of unchanged roads in Figure 15, and SIFT still failed in the cases. This indicates that our method is not only effective for images from two specific times and regions collected by [45], but also robust for images of multiple times and other regions. on deep learning affected the registration of multi-temporal images with extremely complex variance. However, the proposed method was more precise than others, which is shown by the alignment of unchanged roads in Figure 15, and SIFT still failed in the cases. This indicates that our method is not only effective for images from two specific times and regions collected by [45], but also robust for images of multiple times and other regions. Experiments were also carried out on the images with huge differences such as vegetation, buildings, occlusion, and land coverage, as shown in Figure 16. Unlike the images in Figure 14, the corresponding landmarks that were more ambiguous and the registrations were more likely to fail even with deep learning methods. However, our method was able to capture and match the only few features and obtained satisfactory results. Experiments were also carried out on the images with huge differences such as vegetation, buildings, occlusion, and land coverage, as shown in Figure 16. Unlike the images in Figure 14, the corresponding landmarks that were more ambiguous and the registrations were more likely to fail even with deep learning methods. However, our method was able to capture and match the only few features and obtained satisfactory results.
In addition to the registration among high-resolution remote sensing images from Google Earth, we also conducted experiments on other high-resolution images in ISPRS Dortmund [51] and in the WHU [52] datasets. The images in ISPRS were taken by multihead obliquely without alignment in advance, while the images were fine registered in the WHU dataset, and the deformation was synthetic, similar to the Google Earth dataset. The registration results are displayed in Figure 17. Our method can register the images of ISPRS with perspective changes and uneven illumination, better than DAM, which is also based on deep learning. As the images of WHU were cropped from a wide range of ultra-high-resolution images after splicing and other processing, the data characteristics differed from the real images, resulting in approximate alignment. Remote Sens. 2021, 13, x FOR PEER REVIEW 21 of 26 In addition to the registration among high-resolution remote sensing images from Google Earth, we also conducted experiments on other high-resolution images in ISPRS Dortmund [51] and in the WHU [52] datasets. The images in ISPRS were taken by multihead obliquely without alignment in advance, while the images were fine registered in the WHU dataset, and the deformation was synthetic, similar to the Google Earth dataset. The registration results are displayed in Figure 17. Our method can register the images of ISPRS with perspective changes and uneven illumination, better than DAM, which is also based on deep learning. As the images of WHU were cropped from a wide range of ultrahigh-resolution images after splicing and other processing, the data characteristics differed from the real images, resulting in approximate alignment.

Limitations
Although our method showed good robustness, it is discussed in the following text that the MAE of our method did not achieve sub-pixel errors in the case of successful registration. The images had a high resolution with massive details that changed with seasons, and some were shot at a low altitude, greatly affected by topographic fluctuation.  (Row 1 and 2) and WHU [52] (Row 3).

Limitations
Although our method showed good robustness, it is discussed in the following text that the MAE of our method did not achieve sub-pixel errors in the case of successful registration. The images had a high resolution with massive details that changed with seasons, and some were shot at a low altitude, greatly affected by topographic fluctuation. However, our method, which is based on regression with deep learning, comes with uncertainty due to the noise in the data or the mismatch of training and testing data. Further improvement of precision requires more training data with little noise.
We experimented on images with slight changes in features and small deformation, shown in Figure 18. The registration error in Table 10 shows that for these easily registered images, especially those in Figure 18c with salient corners, SIFT was able to succeed in registration and achieved relatively good precision, while the results of DAM had large errors. Although the precision of registration with our method was a bit worse than SIFT, the errors were stable and restricted to an acceptable range.  Additionally, the networks did not learn from the cases of images with large-scale variance owing to the limitation of the training dataset, leading to the degeneration of the performance, as shown in Figure 19. Additional data are required to address this limitation. Rows are the source images, the target images and warped results by our method from top to bottom. "o" and "x" denote the keypoints in the source images and target images, respectively. Table 10. Comparison of registration results on the images displayed in Figure 18. Additionally, the networks did not learn from the cases of images with large-scale variance owing to the limitation of the training dataset, leading to the degeneration of the performance, as shown in Figure 19. Additional data are required to address this limitation. Additionally, the networks did not learn from the cases of images with lar variance owing to the limitation of the training dataset, leading to the degeneratio performance, as shown in Figure 19. Additional data are required to address thi tion.

Conclusions
In this paper, a two-stage deep learning registration method based on sub-image matching was proposed for remote sensing images with dramatic changes in different time periods. The method mainly consists of two closely connected modules, namely the matching of sub-images (MSI) and estimation of transformation parameters (ETP). We adopt a matching CNN with inner product structure of features to predict the similarities between sub-images and a fast algorithm to screen out high-quality matched pairs. Then, the network with weight structure and position embedding, which can handle uncertain number of sub-images and assign weights according to the quality of the image to suppress the baddish inputs, is proposed to directly estimate the global transformation affine model. We also introduce a training strategy of sharing samples and augmentation based on bounding rectangle to achieve better performance. The proposed method was evaluated on the datasets from Google Earth [45], ISPRS [51], and WHU [52] qualitatively and quantitatively. The results showed that the maximum improvement in PCK provided by our method was 16.8% at α = 0.01 compared with DAM, and the PCK at α = 0.05 of ours achieved over 99%. Thus, we proved that the proposed method improves the precision of registration and maintains the robustness of deep learning methods at the same time. As there is still room for improvement in precision, adding more constraints to the networks may be a possible future research direction. The proposed method can also be extended to other types of parameterized registration models, such as projection and polynomial.
Author Contributions: Y.C. conceived the paper, designed and performed the experiments, and wrote the paper. J.J. supervised the study and modified the paper. Both authors have read and agreed to the published version of the manuscript. Data Availability Statement: The Google Earth [45], ISPRS [51] and the WHU building [52] datasets presented in this work are openly available.