UAV Remote Sensing Image Automatic Registration Based on Deep Residual Features

: With the rapid development of unmanned aerial vehicle (UAV) technology, UAV remote sensing images are increasing sharply. However, due to the limitation of the perspective of UAV remote sensing, the UAV images obtained from different viewpoints of a same scene need to be stitched together for further applications. Therefore, an automatic registration method of UAV remote sensing images based on deep residual features is proposed in this work. It needs no additional training and does not depend on image features, such as points, lines and shapes, or on speciﬁc image contents. This registration framework is built as follows: Aimed at the problem that most of traditional registration methods only use low-level features for registration, we adopted deep residual neural network features extracted by an excellent deep neural network, ResNet-50. Then, a tensor product was employed to construct feature description vectors through exacted high-level abstract features. At last, the progressive consistency algorithm (PROSAC) was exploited to remove false matches and ﬁt a geometric transform model so as to enhance registration accuracy. The experimental results for different typical scene images with different resolutions acquired by different UAV image sensors indicate that the improved algorithm can achieve higher registration accuracy than a state-of-the-art deep learning registration algorithm and other popular registration algorithms.


Introduction
Nowadays, unmanned aerial vehicles (UAVs) are often used to collect airborne remote sensing images. However, the view fields of drone images are often limited by flight heights and camera focal lengths. As a result, it is commonly impossible to display an entire study area through a single image. In this case, image registration technology can assemble several single images with overlapping areas according to their own feature information to yield a large-scope scene image for subsequent scientific researches or applications [1][2][3]. Hence, UAV image registration is widely used in scene mosaicking and panorama production, so as to integrate information of acquired images and make up for the shortcomings of UAV photography.
In the field of image registration, algorithms based on feature points are most popular. In 1999, the scale invariant feature transform (SIFT) operator was proposed by D.G. Lowe et al. [4]. Features that are invariant to image scale, rotation and scaling can be obtained by using this method, so it has been widely used [5][6][7]. Since it takes a long time for the SIFT algorithm to yield feature descriptors of 128 dimension, some scholars have proposed different improved versions. The most famous is the speed up robust features aggregation by Kronecker product are adopted in feature vector construction. The progressive sampling consensus (PROSAC) algorithm is utilized to remove false matches and fit registration parameters. In experiments, the proposed method is compared with another prominent registration method based on deep learning and current representative registration methods based on point features. The rest of this article is organized as follows. The second section introduces our UAV image registration strategy in detail. The experimental results on images of different scenes and from different sources are given and analyzed in the third section. The fourth part is the conclusion of this article. Figure 1 illustrates the registration pipeline of our proposed method based on deep residual features. First of all, the deep residual network neural network ResNet-50 is exploited to extract feature for image registration. Here, the center of each 8 × 8 pixel region of an image is taken as a feature point. A multi-scale feature description vector of the feature points is constructed through convolution-layer outputs of the feature-extraction network in the ResNet-50 architecture. The outputs of residual blocks are merged by the Kronecker product to construct multi-scale feature description vectors. After feature matching, the PROSAC algorithm is utilized to remove false mismatches and fit a geometric transform model. In this work, inspired by a registration method based on a deep convolutional neural network VGG-16 [23], feature vectors of a deep residual network are built according to characteristics of UAV remote sensing images. Hence, high-level abstract features of images play main roles in registration processes. In VGG-16 feature construction, the lowerlevel image features outputted by the pooling layer 1 (pool1) and the pooling layer 2 (pool2) are discarded. The feature description vectors are assembled on the basis of the output features of three high-level layers, i.e., the pooling layer 3 (pool3), the pooling layer 4 (pool4) and a self-defined pooling layer (pool5-1). Therefore, similarly, we design registration feature vectors of ResNet-50 through a hidden-layer output of ResBlock-2, and the output of ResBlock-2 and ResBlock-3. Our strategy can overcome the degradation problem that appears with deepening neural networks, so that extracted image features are more representative. The registration pipeline of our proposed method based on deep residual features. Firstly, three feature map F 1 , F 2 , and F 3 corresponding to an 8 × 8 pixel region in UAV images are extracted from ResNet-50. Then they are feature integrated by Kronecker product to construct feature vectors. Finally, the PROSAC algorithm is performed to remove false matches and yield a homography matrix, H, for registration transform.

Materials and Methods
In this work, inspired by a registration method based on a deep convolutional neural network VGG-16 [23], feature vectors of a deep residual network are built according to characteristics of UAV remote sensing images. Hence, high-level abstract features of images play main roles in registration processes. In VGG-16 feature construction, the lower-level image features outputted by the pooling layer 1 (pool1) and the pooling layer 2 (pool2) are discarded. The feature description vectors are assembled on the basis of the output features of three high-level layers, i.e., the pooling layer 3 (pool3), the pooling layer 4 (pool4) and a self-defined pooling layer (pool5-1). Therefore, similarly, we design registration feature vectors of ResNet-50 through a hidden-layer output of ResBlock-2, and the output of ResBlock-2 and ResBlock-3. Our strategy can overcome the degradation problem that appears with deepening neural networks, so that extracted image features are more representative.

Deep Residual Feature Extraction Network
ResNet emerged in 2015. By virtue of its depth and simple structures, this network has become more and more popular. It has shown stronger classification and detection capacity than the VGG series networks. Theoretically, as a network becomes deeper, the extracted image features will be more representative. However, blindly deepening the network will result in a learning efficiency decline of networks, and accuracy of tasks will cease to increase, or even decrease. The development of ResNet is primarily to cope with the degradation problem as networks are deepened [22]. As a residual unit depicted in Figure 2, it can be found that there is an extra curve, i.e., a skip connection, from the input to the output. By this means, the ResNet networks can learn differences between inputs to outputs. That is, they learn residual changes instead of fitting functions. Even with continuously increasing of network depth, the ResNet networks still have good sensitivities to residuals, which can avoid the problem of gradient vanishing or explosion. This design makes the training processes of networks sample and fast. Hence, the invention of residual structures can greatly increase the depth of deep learning networks without over-fitting [24].

Deep Residual Feature Extraction Network
ResNet emerged in 2015. By virtue of its depth and simple structures, this network has become more and more popular. It has shown stronger classification and detection capacity than the VGG series networks. Theoretically, as a network becomes deeper, the extracted image features will be more representative. However, blindly deepening the network will result in a learning efficiency decline of networks, and accuracy of tasks will cease to increase, or even decrease. The development of ResNet is primarily to cope with the degradation problem as networks are deepened [22]. As a residual unit depicted in Figure 2, it can be found that there is an extra curve, i.e., a skip connection, from the input to the output. By this means, the ResNet networks can learn differences between inputs to outputs. That is, they learn residual changes instead of fitting functions. Even with continuously increasing of network depth, the ResNet networks still have good sensitivities to residuals, which can avoid the problem of gradient vanishing or explosion. This design makes the training processes of networks sample and fast. Hence, the invention of residual structures can greatly increase the depth of deep learning networks without over-fitting [24]. where xl and xl+1 respectively represent the input and output eigenmatrix of the lth residual element; F is a residual function, representing the learned residual via the unit; h(xl) = xl represents an identity mapping; and f is an ReLU activation function. Based on the above formulas, the feature learned from a shallower layer (l) to a deeper layer (L) is given by the following: According to chain rules, the backpropagation gradient can be written as follows: where ∂loss/∂xl represents the gradient of a loss function before the l-th layer, and the 1 in the bracket represents the lossless transfer of the gradient by the shortcut mechanism of residual networks. The other residual gradient needs to go through the weights layer and cannot be transferred directly. As a result, the residual gradients will not all be -1. It is most important that when their values become smaller as the accumulation of networks, the existence of 1 can restrain vanishing gradients. Therefore, compared with the ordinary deep A residual unit can be expressed by the following: and where x l and x l+1 respectively represent the input and output eigenmatrix of the lth residual element; F is a residual function, representing the learned residual via the unit; h(x l ) = x l represents an identity mapping; and f is an ReLU activation function. Based on the above formulas, the feature learned from a shallower layer (l) to a deeper layer (L) is given by the following: According to chain rules, the backpropagation gradient can be written as follows: where ∂loss/∂x l represents the gradient of a loss function before the l-th layer, and the 1 in the bracket represents the lossless transfer of the gradient by the shortcut mechanism of residual networks. The other residual gradient needs to go through the weights layer and cannot be transferred directly. As a result, the residual gradients will not all be −1. It is most important that when their values become smaller as the accumulation of networks, the existence of 1 can restrain vanishing gradients. Therefore, compared with the ordinary deep learning architecture, it is easier to learn from residuals. In summary, introducing shortcuts enables identity mappings to be realized in ResNet, and, in this way, gradients can be transferred smoothly among different layers. The architectures of various ResNet networks are basically similar. Data pass through a 7 × 7 × 64 convolution layer, a 3 × 3 max-pooling layer for down-sampling, various residual layers and an average pooling layer for down-sampling, successively. In the end, a Softmax model converts the previous outputs into a probability distribution to yield the final output. In general, the depth of ResNet networks are 18, 34, 50, 101 and 152. Among them, ResNet-50 and ResNet-101 are more common. The calculation complexity and time cost of the ResNet-101 network are relatively high. Hence, the other representative ResNet-50 with moderate number of layers are chosen for feature extraction in image registration. As illustrated in Figure 3, the feature-extraction network of ResNet-50 consists of five stages. Each stage is composed of a different number of residual blocks, and each residual block is realized via three convolution layers to eliminate depth effects. The number of learnable parameters in the ResNet-50 network model is up to 23 million. learning architecture, it is easier to learn from residuals. In summary, introducing shortcuts enables identity mappings to be realized in ResNet, and, in this way, gradients can be transferred smoothly among different layers. The architectures of various ResNet networks are basically similar. Data pass through a 7 × 7 × 64 convolution layer, a 3 × 3 max-pooling layer for down-sampling, various residual layers and an average pooling layer for down-sampling, successively. In the end, a Softmax model converts the previous outputs into a probability distribution to yield the final output. In general, the depth of ResNet networks are 18, 34, 50, 101 and 152. Among them, ResNet-50 and ResNet-101 are more common. The calculation complexity and time cost of the ResNet-101 network are relatively high. Hence, the other representative Res-Net-50 with moderate number of layers are chosen for feature extraction in image registration. As illustrated in Figure 3, the feature-extraction network of ResNet-50 consists of five stages. Each stage is composed of a different number of residual blocks, and each residual block is realized via three convolution layers to eliminate depth effects. The number of learnable parameters in the ResNet-50 network model is up to 23 million.

Feature Description Vector Construction
At first, since UAV images are usually of high resolution, if they are down-sampled at the input side of a deep learning network, image feature information will mostly be lost and, consequently, registration errors will increase. Therefore, an arbitrary-size image to be registered and its reference image of the same size are inputted into the ResNet-50 feature-extraction network with their original resolutions. Each region of 8 × 8 pixels in an input image is defined as a feature point. Accordingly, the 8 × 8, 16 × 16 and 32 × 32 pixel regions around a feature point are employed to extract feature vectors at different scales, which corresponds to the output of ResBlock-1 (residual block 1), ResBlock-2 (residual block 2) and ResBlock-3 (residual block 3) in ResNet-50, respectively. However, the features extracted by ResBlock-1 belong to the low level. Thus, a middle-layer output of ResBlock-2 was adopted in this work to construct a feature description vector, together with the output of ResBlock-2 and of ResBlock-3, for registration.
By some comparison, it is revealed that using the output of the second convolutional layer in ResBlock-2 as the first feature map, F1, to construct a feature description vector for a feature point, corresponding to an 8 × 8 pixel region in input images, can achieve an ideal registration performance. Given that the size of an input image is N × N, the size of F1 is (N/8) × (N/8) × 512. Every 8 × 8 pixel region in the input image corresponds to a 512dimensional vector in F1. On the other hand, each 16 × 16 pixel region in the input image corresponds to a 512-dimensional vector in the output of ResBlock-2, denoted by OResBlock- 2. Hence, the size of OResBlock-2 is (N/16) × (N/16) × 512 pixels. Since one feature vector in ResBlock-2 is shared by four defined feature points, the Kronecker product, denoted by a  symbol, is performed on OResBlock-2 to obtain the second feature map, F2, for an input image:

Feature Description Vector Construction
At first, since UAV images are usually of high resolution, if they are down-sampled at the input side of a deep learning network, image feature information will mostly be lost and, consequently, registration errors will increase. Therefore, an arbitrary-size image to be registered and its reference image of the same size are inputted into the ResNet-50 feature-extraction network with their original resolutions. Each region of 8 × 8 pixels in an input image is defined as a feature point. Accordingly, the 8 × 8, 16 × 16 and 32 × 32 pixel regions around a feature point are employed to extract feature vectors at different scales, which corresponds to the output of ResBlock-1 (residual block 1), ResBlock-2 (residual block 2) and ResBlock-3 (residual block 3) in ResNet-50, respectively. However, the features extracted by ResBlock-1 belong to the low level. Thus, a middle-layer output of ResBlock-2 was adopted in this work to construct a feature description vector, together with the output of ResBlock-2 and of ResBlock-3, for registration.
By some comparison, it is revealed that using the output of the second convolutional layer in ResBlock-2 as the first feature map, F 1 , to construct a feature description vector for a feature point, corresponding to an 8 × 8 pixel region in input images, can achieve an ideal registration performance. Given that the size of an input image is N × N, the size of F 1 is (N/8) × (N/8) × 512. Every 8 × 8 pixel region in the input image corresponds to a 512-dimensional vector in F 1 . On the other hand, each 16 × 16 pixel region in the input image corresponds to a 512-dimensional vector in the output of ResBlock-2, denoted by O ResBlock-2 . Hence, the size of O ResBlock-2 is (N/16) × (N/16) × 512 pixels. Since one feature vector in ResBlock-2 is shared by four defined feature points, the Kronecker product, denoted by a ⊗ symbol, is performed on O ResBlock-2 to obtain the second feature map, F 2 , for an input image: where I represents a tensor of subscripted shapes and it is filled with unities. Given that A = (a ij ) ∈ c m×n and B = (b ij ) ∈ c p×q , the Kronecker product of A is a block matrix, which is defined as follows: Moreover, each 32 × 32 pixel region in the input image corresponds to a 1024dimensional vector in the output of ResBlock-3, i.e., O ResBlock-3 . Accordingly, the size of O ResBlock-3 is (N/32) × (N/32) × 1024 pixels. Because each feature vector in O ResBlock-3 is shared by sixteen defined feature points, the Kronecker product is performed on the output of O ResBlock-2 to obtain the third feature map, F 3 : Then, three output feature maps, namely F 1 , F 2 , and F 3 , produced by the ResNet-50 feature extraction network are concatenated into one feature description map F. It contains the information of multiple layers, and its size is (N/8) × (N/8) × 2048 pixels. Every 2048-dimensional component in F corresponds to an 8 × 8 pixel region of the input image.
As can be seen, the size of three feature maps of the ResNet-50 network applied in feature description vector construction is (N/8) × (N/8) × 512, (N/16) × (N/16) × 512 and (N/32) × (N/32) × 1024, respectively. Therefore, it is necessary to up-sample O ResBlock-2 and O ResBlock-3 . Two types of up-sampling are adopted in this work for the purpose of comparison. One is to combine three feature components of a feature point into one description vector by the Kronecker product, as mentioned before. The other is up-sampling through bilinear interpolation [25]. It is the expansion of linear interpolation for twodimensional rectangular grids. It performs interpolating on bivariate functions, and its essence is one-dimensional linear interpolation, respectively, in two directions.

Feature Matching
Feature description vectors should be normalized before they are exploited in feature matching. In this work, defined feature points are matched by using the Euclidean distance as the similarity measure, i.e., comparing the geometric distance between feature descriptors of feature points.
where d(x i , y j ), for i and j = 1, 2, 3, . . . , is the distance between the i-th point x i in an image to be registered and the j-th point y j in its reference image. F(x i ) and F(y j ) represent the feature description vectors of the two points, respectively. Every component of feature description vectors is denoted by f (·). Given that point y j in the reference image-the point in the image to be registered that makes the similarity measure d(x i , y j ) minimum-can be regarded as the associated feature point of y j . After obtaining feature point mappings between the inputted images, these point coordinates should be restored to the original images for screening false matches and fitting a transform model.

False Match Elimination and Transform Model Fitting
In this work, the PROSAC algorithm [26] is utilized to sift false matches. Compared with uniform sampling from a set of matched point pairs by RANSAC [9], the PROSAC algorithm sorts all the matched point pairs according to a similarity metric. Then it samples from an increasing optimal set of matched point pairs. This method cannot only save calculation costs, but also improve operation speed. The points in a sample set are Remote Sens. 2021, 13, 3605 7 of 15 reordered in advance. The inner points for effectively estimating model are upper in the rankings, while the outer points that have negative influences on influence are lower ranked. Therefore, fitting models are fulfilled through sampling from the upper ranked point pairs, which reduces randomness of the algorithm and enhances the success rate of obtaining a proper model. The accuracy of image registration is further improved.
In addition, the registration method proposed in this work is primarily aimed at scene mosaicking and panorama generation. Thus, only those UAV images with small differences in shooting angles are taken into consideration. In this case, it can be believed that the requirements of homography transform are satisfied approximately. Therefore, a homography matrix is employed in the final registration transform for UAV images, since it is more suitable for two-dimensional content matching and scene expansion. If the shooting angles of images to be registered are quite different, this kind of registration problem should be solved through stereo matching approaches.

Results
In the experimental part of this work, the deep feature extraction methods were investigated for image registration in detail. Then, in this paper, our proposed registration method is compared with existing traditional registration algorithms and a state-of-art registration algorithm based on deep learning. The experiments were performed on UAV visible-light images from different sensors and scenes. The experimental results indicate that our proposed algorithm can provide higher registration accuracy than the other algorithms. The hardware platform for experiments is configured with an Intel Core i5-4590k processor at 3.5 GHz main frequency and a 16 GB RAM. The software environment is built by the 64-bit Windows10 operating system and tensorflow-1.13.1 deep learning framework. The Python version is 3.7.0.

Experimental UAV Images
In order to verify the stability and applicability of registration algorithms, the UAV visible-light image pairs were separately collected from five different typical scenes, including urban, buildings, roads (by a river), farmlands and forests. Among them, the city and building scene images are taken from the drone image dataset downloaded from the ISPRS official website [10]. This dataset was built in 2014 and contains a total of 26 high-altitude drone images, 1000 images taken close to the ground and 1000 ground shots. In this work, two high-altitude images and two near-ground images were selected from the dataset, including the urban and building scenes. The size of these images is 2044 × 1533 and 3000 × 2000 pixels, respectively. The road scene image pair is obtained from the UAV sample images provided by the Pix4Dmapper software. This testing dataset contains 13 drone images taken near the ground in 2013. The size of these images is 2000 × 1500 pixels. The UAV image pair about farmland scene was taken by a Parrot Sequoia camera with 5 mm focal length in September 2017 at Dayi County, Sichuan Province, China. The sensor resolution is 2404 × 1728 pixels, and the shooting height is 80 m. The near-ground image pair of a forest scene is acquired in June 2019 by a Zenmuse Z30 camera, whose minimum focal length is 10 mm. The image size is 1920 × 1080 pixels and the camera height is 152 m. The observation location is in Wusufo Mountain National Forest Park, Xinjiang Uygur Autonomous Region, China. All the UAV images used in our experiments are presented in Figure 4.

Visual Evaluation of Registration Results
Taking the urban scene as an example, two feature extraction methods based on deep neural networks, VGG-16 [23] and ResNet-50, are compared. In addition, distance weighting, bilinear interpolation and the Kronecker product are examined in building deep residual feature vectors of ResNet-50 network. The matched feature point pairs obtained by these methods are presented in Figure 5. DResNet-50 represents a deep residual registration method by feature distance weighting. It is similar to the method of VGG-16. BResNet-50 represents the deep residual registration method by using bilinear interpolation to realize up-sampling. KResNet-50 represents our deep residual registration method by using the Kronecker product to integrate feature vectors.
From the above five pairs of feature point matching images, it can be clearly seen that three methods based on the deep residual neural network can obtain more evenly distributed matched feature points in image overlap areas than the method based on VGG-16. These results will lead to improvements in the final registration accuracy.

Visual Evaluation of Registration Results
Taking the urban scene as an example, two feature extraction methods based on deep neural networks, VGG-16 [23] and ResNet-50, are compared. In addition, distance weighting, bilinear interpolation and the Kronecker product are examined in building deep residual feature vectors of ResNet-50 network. The matched feature point pairs obtained by these methods are presented in Figure 5. DResNet-50 represents a deep residual registration method by feature distance weighting. It is similar to the method of VGG-16. BResNet-50 represents the deep residual registration method by using bilinear interpolation to realize up-sampling. KResNet-50 represents our deep residual registration method by using the Kronecker product to integrate feature vectors.
From the above five pairs of feature point matching images, it can be clearly seen that three methods based on the deep residual neural network can obtain more evenly distributed matched feature points in image overlap areas than the method based on VGG-16. These results will lead to improvements in the final registration accuracy.

Visual Evaluation of Registration Results
Taking the urban scene as an example, two feature extraction methods based on deep neural networks, VGG-16 [23] and ResNet-50, are compared. In addition, distance weighting, bilinear interpolation and the Kronecker product are examined in building deep residual feature vectors of ResNet-50 network. The matched feature point pairs obtained by these methods are presented in Figure 5. DResNet-50 represents a deep residual registration method by feature distance weighting. It is similar to the method of VGG-16. BResNet-50 represents the deep residual registration method by using bilinear interpolation to realize up-sampling. KResNet-50 represents our deep residual registration method by using the Kronecker product to integrate feature vectors.
From the above five pairs of feature point matching images, it can be clearly seen that three methods based on the deep residual neural network can obtain more evenly distributed matched feature points in image overlap areas than the method based on VGG-16. These results will lead to improvements in the final registration accuracy.
(a) VGG-16 Furthermore, checkerboard mosaicked images for the urban scene obtained by different registration methods are displayed in Figure 6. Generally, a checkerboard mosaicked image is generated by alternately piecing blocks from a registered image and its reference image. In this manner, alignment details between the registered image and its reference image can be manifested. Some details of theses checkerboard mosaicked images are given in Figure 7. Because there are a little visual differences, no detailed comparison of the mosaicked images about the road scene is presented.
(a) Urban Furthermore, checkerboard mosaicked images for the urban scene obtained by different registration methods are displayed in Figure 6. Generally, a checkerboard mosaicked image is generated by alternately piecing blocks from a registered image and its reference image. In this manner, alignment details between the registered image and its reference image can be manifested. Some details of theses checkerboard mosaicked images are given in Figure 7. Because there are a little visual differences, no detailed comparison of the mosaicked images about the road scene is presented. Furthermore, checkerboard mosaicked images for the urban scene obtained by different registration methods are displayed in Figure 6. Generally, a checkerboard mosaicked image is generated by alternately piecing blocks from a registered image and its reference image. In this manner, alignment details between the registered image and its reference image can be manifested. Some details of theses checkerboard mosaicked images are given in Figure 7. Because there are a little visual differences, no detailed comparison of the mosaicked images about the road scene is presented.
(a) Urban    As shown in Figure 6, the matching effects of using our method based on the ResNet-50 features are better than the image registration method based on the VGG-16 features. This reason is that ResNet-50 has a deeper network structure and can generate feature vectors with higher dimensions, which is beneficial for distinguishing false and correct matched points and making mosaicking results more accurate. Moreover, in terms of details, it can be seen from Figure 7 that the Kronecker product integration method outperforms the distance weighting method and the bilinear interpolation method. The bilinear interpolation fusion method has the lowest accuracy. One reason is that, as for the distance weighting method, the similarity between the feature vectors, extracted in three different scales, are calculated respectively. The obtained results are weighted to get the final feature similarity for matching feature points. The processing method may enlarge feature distances improperly and cannot well represent the real similarity between feature points. The other reason is that the bilinear interpolation of feature vectors will assign inappropriate estimation values for the low-scale feature vectors, which leads to the increase of registration error. The Kronecker product method combines the residual feature vectors according to the relationship between them and feature points. It preserves the information of feature vectors to the most extent and improves image registration quality.

Quantitative Comparison of Registration Results
In terms of registration accuracy, besides the methods based on deep neural networks, the root mean square error (RMSE) results for the five scenes gained by other common image registration algorithms, including ORB [27], SIFT [28], SURF [29], KAZE [30], AKAZE [31], CFOG [32] and KNN + TAR [33], are listed in Table 1. The running time of As shown in Figure 6, the matching effects of using our method based on the ResNet-50 features are better than the image registration method based on the VGG-16 features. This reason is that ResNet-50 has a deeper network structure and can generate feature vectors with higher dimensions, which is beneficial for distinguishing false and correct matched points and making mosaicking results more accurate. Moreover, in terms of details, it can be seen from Figure 7 that the Kronecker product integration method outperforms the distance weighting method and the bilinear interpolation method. The bilinear interpolation fusion method has the lowest accuracy. One reason is that, as for the distance weighting method, the similarity between the feature vectors, extracted in three different scales, are calculated respectively. The obtained results are weighted to get the final feature similarity for matching feature points. The processing method may enlarge feature distances improperly and cannot well represent the real similarity between feature points. The other reason is that the bilinear interpolation of feature vectors will assign inappropriate estimation values for the low-scale feature vectors, which leads to the increase of registration error. The Kronecker product method combines the residual feature vectors according to the relationship between them and feature points. It preserves the information of feature vectors to the most extent and improves image registration quality.

Quantitative Comparison of Registration Results
In terms of registration accuracy, besides the methods based on deep neural networks, the root mean square error (RMSE) results for the five scenes gained by other common image registration algorithms, including ORB [27], SIFT [28], SURF [29], KAZE [30], AKAZE [31], CFOG [32] and KNN + TAR [33], are listed in Table 1. The running time of the ORB algorithm is proportional to the number of feature points. The more feature points that are required, the longer the running time of the algorithm is. Therefore, compromising running time and accuracy, the number of feature points for ORB is pre-set at 1000. Moreover, CFOG and KNN + TAR algorithms are implemented on the Matlab software platform with lower code efficiency. In the community of registration, especially involving point registration, RMSE can be expressed in the following form: where x i and y i , for i = 1, 2, 3, . . . , respectively represent the matching point pairs from the image to be registered and the reference registration image. Suppose that there is a total of n pairs. T is a transform model, θ is the model parameter vector and · represented is the Euclidean distance between the two points. Generally, the smaller the value of RMSE, the higher the registration accuracy. As given in Table 1, it can be found that the first five methods based on point features perform well on the UAV test images of different scenes. The reason for the low accuracy of the CFOG algorithm may be that it is more suitable for heterogeneous optical image registration. KNN + TAR algorithm is unstable, and it may be more suitable for satelliteborne optical image registration. The registration methods based on deep networks all provide higher registration accuracy for different image scenes than other current algorithms. KResNet-50 can even offer subpixel accuracy for five scenes. Differing from the registration methods based on point features, the methods based on depth learning are not dependent on complex contents and detail information in image scenes. As for dealing with the UAV images with simple scenes and less detail information, such as farmlands and forests, they still exhibit good registration performances. The primary reason is that, in deep-learning-based methods, the number of feature points is not directly determined by visible features of input images, but by their sizes. Many features used in registration are deep features of images. Moreover, compared with the existing method based on traditional convolutional neural networks, the proposed deep residual registration method can extract more effective information for registration. The reason is that, compared with ordinary CNN networks, the residual structures of ResNet-50 can effectively solve the problems about gradient vanishing, explosion and network degradation caused by the increase of network layers. It can make gradient information corresponding to defined feature points transfer smoothly in forward and back propagations. Thus, higher-level information can be effectively extracted for the construction of feature-point description vectors. Therefore, the methods based on ResNet-50 provide better accuracy in Table 1. Additionally, the time complexity comparisons of different registration algorithms for the five scenes are presented in Table 2. From Table 2, it can be seen that the ORB algorithm has obvious advantages in running time. The reason is that, due to high resolutions of UAV images, other algorithms should find feature points as much as possible. Hence, more time was consumed in the processes of feature-point detection and matching, whereas it can be observed that the registration accuracy of all other algorithms is almost higher than that of ORB, except CFOG and KNN + TAR. This result is also because of the number of feature points. Compared with the VGG-16 registration algorithm, the time increments of the registration algorithm based on deep residual network are about 10 to 20 s, which is acceptable for registration performance improvement.

Discussion and Conclusions
In this work, an automatic registration method for drone images based on the deep residual network feature was proposed. The method needs no additional training and does not depend on specific contents of images. It takes the center point of each 8 × 8 pixel region of an input image as a feature point and constructs multi-scale feature description vectors of the feature point from the output vector of three residual network layers by the Kronecker product. For matching feature points, the PROSAC algorithm is utilized to sifting outliers and fit a geometric transform model. The experimental results for UAV images from different scenes indicate that combining deep residual features and PROSAC can fulfill high accurate, even subpixel, registration. Compared with existing state-of-theart registration algorithms, it is manifested that the proposed image registration method based on deep residual features exhibits remarkable performance enhancements.
Although deep residual network features can describe images in-depth, these features are of high dimensions that are proportional to the size of input images. Therefore, our method has no advantage in running time, and subsequent research studies can be focused on reducing its computation complexity. In addition, the deep feature extraction network used in this work is a pre-trained model trained by ImageNet. It is more suitable for natural images. In the future, some more appropriate UAV image datasets can be adopted to tune the network model weights finely, so that the deep residual network can extract more distinctive features from UAV images to ameliorate registration results. Meanwhile, more false match sifting methods can be adopted in the future, or deep learning can be directly utilized to fit transformation parameters, so as to realize an end-to-end registration framework. The effects of acquisition environmental changes, such as illumination and shadows, on drone images also need to be further studied in detail. Moreover, the registration of multi-source heterogeneous images, such as infrared, near-infrared and multispectral, is also a problem that needs to be solved.
Author Contributions: Conceptualization and methodology X.L. and X.W.; software and validation, G.L. and X.W.; formal analysis, X.L. and Y.J.; resources, X.H.; data curation and writing-original draft preparation, G.L. and X.W.; writing-review and editing, X.L. and Y.J.; visualization and project administration, W.H.; funding acquisition, W.X. and W.H. All authors have read and agreed to the published version of the manuscript.
Funding: This research was funded by the Science and Technology Program of Sichuan, grant number 2020YFG0240 and 2020YFG0055, the Science and Technology Program of Hebei, grant number 20355901D and 19255901D.