Different Viewpoints Image Registration for Remote Sensing Based on Multiple Image Features

Remote sensing image registration with different viewpoints plays an important role in the field of geographic information system. However, when there exists ground relief variations and imaging viewpoint changes, non-rigid distortion occurs thus the registration becomes increasingly challenging. The current methods will suffer from missing true correspondences when non-rigid geometric distortion occurs. To address the problem, we propose a robust remote sensing image registration method based on SIFT feature distance and geometric structure features. At first, the scale-invariant feature transform (SIFT), a partial intensity invariant feature descriptor is used to extract reliable feature point set from sensed and reference image respectively. Secondly, a novel algorithm based on multiple image features which constrains the geometric structure during transformation is used to estimate exact correspondences between point sets. Finally, an accurate alignment is achieved by mapping the sensed image to reference image using thin-plate spline. We evaluated the performances of the proposed method by three sets of remote sensing images obtained from the unmanned aerial vehicle (UAV) and the Google earth, and compared with five state-of-the-art methods where our algorithm solved the non-rigid registration problem of remote sensing image with different viewpoints and showed the best alignments in most cases.


Introduction
In recent years, image registration has become a extremely important technique in a wide range of applications such as computer vision, pattern recognition, environment monitoring, medical image analysis and remote sensing.Image registration refers to the fundamental task in image processing to align two or more images of the same scene (i.e., the reference images and the sensed images), which can be multitemporal (taken at different times), multisource (derived from different sensors) and multiview (obtained from different viewpoints).In this paper, we mainly focus on registering the remote sensing images taken from different viewpoints.As a basic process it becomes a vital yet challenging problem to find more accurate registration algorithms.With an exact registration for remote sensing, it will be well used in many important applications, both at social and scientific levels.These applications include, for example, military automatic target recognition, compiling and analyzing images and data from satellites, assignment of climate changes, environment monitoring, and the management of nature disasters [1].
Existing remote sensing image registration methods can be approximately classified into two categories: area-based methods and feature-based methods.In [2][3][4][5][6], various reviews on image registration methods can be found.We briefly review them here, especially in the application of remote sensing image registration.
The area-based methods can be broadly classified into three types: correlation-like methods, Fourier methods, and mutual information (MI) methods [7].In correlation-like methods area, an automatic image registration method was proposed by Goncalves et al. [8], which is based on the identification of a thin line through the Hough transform.Based on the singular value decomposition (SVD) and the unified random sample consensus (RANSAC) algorithm, Tong et al. [9] presented a novel subpixel phase correlation method which demonstrated the promising performance and feasibility.However, The correlation-like methods suffer from the high computational complexity and the flatness of the similarity measure in textureless regions.Fourier based method is a popular approach for coarse registration, it is a frequency estimation problem in the frequency domain of the image data essentially.By employing the multiple signal classifier algorithm, Xu et al. [10] studied the application of a subspace-based frequency estimation approach for the Fourier-based image registration problem and achieved a more robust and accurate registration result.The Fourier based methods have less computational complexity, however, considerable spectral contents differences existed in image pairs can cause a drop in performance.MI methods has recently been used as a similarity measure for remote sensing and medical image registration, which possesses the characters of high accuracy and better generality.In order to register multitemporal remote sensing images, Chen et al. [11] demonstrated a new joint histogram estimation algorithm called generalized partial volume estimation (GPVE) for computing mutual information, the method produced a better registration consistency.In addition, an approach based on the implementation of particle swarm optimization (PSO) and mutual information (MI) is proposed to determine more accurate pairs of corresponding points between the images.Nevertheless, the accuracy of the MI-based methods decreases since these methods do not provide a global maximum of the entire search space for the transformation.
Area-based methods usually compare intensity patterns in images via correlation metrics and register entire images or sub-images, while feature-based methods find correspondence between image features such as points, lines, and contours, and establish a correspondence between a number of especially distinct points in images [12].The feature-based methods are based on pixel intensities instead of local shapes and structures, they perform robustly against noise contamination, rotation and illumination variations and multisensor analysis.In addition, when there is complex distortion between the images to be aligned, the computation complexity or the search space of the area-based methods will increase nonlinearly with the transformation complexity and feature-based methods are preferable.In this work, the remote sensing images captured exist the local no-rigid geometric distortions caused by ground relief variations and imaging viewpoint changes, thus we mainly focus on feature-based methods (i.e., local invariant features) for registration.The method generally consists of four steps [13]: (1)  There are a variety of popular local invariant features which have been proposed in remote sensing image registration, such as scale-invariant feature transform (SIFT) [14][15][16][17], speeded-up robust features (SURF) [18] and Harris [19], etc.Recently, the computational efficiency and registration accuracy have been improved by optimizing the feature extraction and adding extra constraint for feature matching based on local invariant features [12][18] [20][21][22][23].Generally, a good descriptor should have two main properties which are distinctiveness (different features should have different descriptors) and robustness (a descriptor's stability to a variety of image geometric and photometric transformations).Improving distinctiveness while maintaining robustness is the main concern in the design of local image descriptors [24].Having a comparison of the performance between different descriptors for affine transformation, scale change, rotation, image blur, jpeg compression, and illumination change, Mikolajczyk et al. [25] demonstrated that scale-invariant feature transform (SIFT) [14] performs the best for most of the tests.
SIFT as a capable feature descriptor of extracting distinctive invariant features from images, it can be applied to perform reliable registration across a substantial range of affine distortion, change in 3-D viewpoint, addition of noise, and change in illumination [15].Therefore, there are various researches for remote sensing image registration based on the variants of SIFT algorithm and the combination of SIFT and some other algorithms.Li et al. [26] proposed a new criterion named scale-orientation joint restriction criteria in order to overcome the intensity difference between remote sensing image pairs.Sedaghat et al. [27] introduced an automatic registration algorithm by extracting high-quality SIFT features in the uniform distribution of both the scale and image spaces.Furthermore, some variants of SIFT such as PCA-SIFT [28], SAR-SIFT [29], AB-SIFT [30] and SIFT-DRS [31] are proposed in order to make an improvement on SIFT.Furthermore, Goncalves et al. [32] developed a new AIR (Automatic image registration) method based on the combination of image segmentation and SIFT, and the method is complemented by a robust procedure of outlier removal.In [33], a novel coarse-to-fine scheme for automatic image registration based on SIFT and MI is proposed, and their method achieved the outlier removal and also can generally reject most incorrect matches.
However, if the image pairs are obtained under a large difference of camera viewpoints and suffer from local distortion caused by ground relief variations and imaging viewpoint changes ,the correspondences obtained by estimating only SIFT feature can be unreliable.
Feature-based methods are typically formulated as a point set registration problem since point representations are general and easy to extract.In order to achieve a robust point registration, it is crucial to construct putative correspondences based on local invariant feature similarity at first and then estimate the spatial transformation based on geometric structure feature constraint [1] [34].Here, we briefly review some point set registration methods since our method is based on feature point.Fischler et al. proposed the classical Random Sample Consensus (RANSAC) algorithm [35].Myronenko and Song [36] introduced a probabilistic method, called the Coherent Point Drift (CPD) algorithm, for both rigid and non-rigid point set registration.Moreover, Liu and Yan [37] investigated how to discover common visual pattern discovery via spatially coherent correspondences and recover the correct correspondences.The traditional LDA models which are used to solve the problem of scene classification lack the spatial relationship and linkages between the global and local information, so Zhang et al. [38] defined the Spatially Consistent Topic Model by making full use of the correlation between image classification and annotation.Recently, Ma et al. proposed a robust L2-minimizing estimate (L2E) [39] for non-rigid point set registration, they later proposed a flexible and general algorithm called locally linear transforming (LLT) [40] for both rigid and non-rigid registration on remote sensing images.More recently, a new method named GLMDTPS was proposed by Yang [41].However, these algorithms consider only the geometric features without combining with meaningful local invariant features, which makes the features of points be less distinctive.
Although many approaches above-mentioned have been proposed for different applications, there still exists the following problems for remote sensing image registration at present.First, most of the methods mainly focus on solving the rigid and affine geometric distortion problems, which means that the non-rigid distortion problems exacerbated by the complexity of space distribution and terrestrial objects are in urgent need to be settled.Second, the methods mainly focused on single feature and have a limited performance when directly applied to remote sensing images.For example, [14][15][16][17] achieved image registration through SIFT only.Although some variants of SIFT algorithm in [26][27][28][29][30][31][32] are proposed, these algorithms are still relatively limited as they do not consider the geometric structure features.Severe non-rigid distortion can significantly deteriorate the correspondence estimation on the remote sensing image pair.Furthermore, literatures [34][35][36][37][38][39][40][41] only use the geometric structure features between the correspondence and the transformation estimation.
Compared with the current methods, the major differences and advantages of this work include: (i) The Euclidean distance and the shape context [42] descriptor are respectively used as the global and local geometric features, both of which play complementary roles to exploit the geometric structure of the feature point sets.(ii) We combine the SIFT feature, a local invariant feature with geometric structure features to form the multiple image feature, this feature is the base upon which our robust remote sensing image registration method estimates the correspondence.(iii) To constrain the non-rigid transformation which is too arbitrary, the mapping is regularized by a global structure preservation term which further improves the accuracy of the registration.
The rest of this paper is organized as follows: section 2 introduces our method in detail.Section 3 demonstrates the registration performance of our method on various types of remote sensing images with different viewpoints against other approaches, followed by some concluding remarks in section 4. Results show that the proposed method is highly distinctive and robust under non-rigid geometric distortions.

Feature Descriptors Extraction
At the first step, given a remote sensing image I r (i.e., the reference image) and another one with the different viewpoint I s (i.e., the sensed image), we employ the SIFT detector [14] to extract feature points from the reference and the sensed images respectively.The SIFT algorithm has been introduced by Lowe in 1999 [14] and then improved in 2004 [15] in order to match local features in natural images.We now introduce the SIFT algorithm briefly in this section, it usually contain the following steps: • Keypoints Detection: At first, keypoints are selected and characterized by their position (x,y), scale σ, and orientation θ, i.e., P(x, y, σ, θ).
By using difference-of-gaussians, the pixel value extrema in both scale and space are found.• Keypoint localization: in this step, the most stable extrema are converted to keypoints.
• Orientation Assignment: Lowe [15] proposed to compute a local histogram of gradient orientations, weighted by the gradient magnitudes and a Gaussian window.The most dominant direction of the histogram is detected and used as the orientation θ of the keypoint.If the peak is above 80% of the maximum then it is selected as an orientation.We can obtain several keypoints with the same position (x,y) and scale σ but with different orientations θ. • Descriptors Extraction: the local image gradients are stored to represent each keypoint P(x, y, σ, θ).In order to get translation and scale invariance, a square neighborhood is defined around each point with a size depending on σ and θ, where θ can ensure its rotation invariance.The aforementioned normalized neighborhood consists of 4 × 4 histograms, in which all of the orientations are summarized.Each histogram exists of 8 bins which represent the gradients in that directions.Then for each keypoint, the 4×4×8 histogram bins make a feature vector with 128 values.Before computing these histograms, a Gaussian weighting function is applied to reduce the effect of pixels further away from the keypoint.Then the SIFT descriptor is obtained via concatenating and normalizing these histograms for each keypoint.
In this paper, we choose to use a SIFT detector to extract feature points from remote sensing images with different viewpoints, since the distinctiveness is promising, which make it scale and rotation invariant and performance better to a large range of affine distortion and illumination changes.What more, with this outstanding property, we believed that each feature vector with 128 values greatly help feature point sets registration, so a new SIFT feature distance is presented to improve the robustness and accuracy of transformation.More details will be elaborated later.After obtained feature point sets Y M×D = {y 1 , ..., y M } T and X N×D = {x 1 , ..., x N } T from the sensed image and the reference image where D denotes the dimension of the point set, and the corresponding point set be denoted by placing a hat above the letter.we define a warping template Y w (the initial Y w = Y) to estimate corresponding points X (the points in X).And our new non-rigid point set registration method is based on the iteration of the following two main steps: (i) estimating correspondence between Y and X; (ii) Mapping Y w to the correspondent X until resulting in the maximum point-wise overlap between two point sets.Here we consider point set registration as a probability density estimation problem.

Correspondence Estimation
• Gaussian Mixture Model(GMM): The Gaussian distribution is one of the most famous probability statistic models.The GMM probability density function is given by where P(m) denotes the mixing weights.In the meantime, all GMM components share the equal isotropic covariances and equal membership probabilities.Furthermore, in order to account for noise and outliers, we added an additional uniform distribution p(x|M + 1) = 1 N to the mixture model .Where ω, 0 ≤ ω ≤ 1, denotes the weight of the uniform distribution and the GMM density function takes the form Then we rewrite the posterior probability function of GMM in the form However, consider only the Euclidean distance between X and Y w in (3) may lead to insufficient robustness in some registration scenarios.For example, although the local structure around x a and x b are probably totally different, if , we obtain p ma = p mb .Shape context is another geometric feature which is presented for describing the local difference.And to settle this problem, we consider SIFT feature as a local invariant feature to combine with geometric features.
• Mixture-feature Gaussian Mixture Model: Firstly, let us define the Gaussian function in the form By changing either the constant w or the exponent term x − y w 2 , we can obtain a different p(x|θ).Due to the exponent term x − y w 2 is much easier to insert extra terms.So we aim to implement the adjustment on the exponent term of (3), and define the mixture feature descriptor of point set as where Shape context [42] is used as a local geometric structure descriptor.In considering two points, x n is in one set and y m is in another set; their shape contents are h n (k) and h m (k), respectively, where h(k) is the value of k-bin for the log-polar histogram.Let c mn represent the matching cost measure of these two points.Then the local geometric structure feature is defined as: The more similar the shape contexts for the two points of x n and y m are, the more likely for these two point to be corresponding.Thus the combination of the aforementioned global and local geometric structure features difference formed a mixed geometric structure features which is defined as G = G Global mn + γG Local mn .• SIFT Feature Distance: SIFT can transform an image into a large collection of local feature vectors and each of which is invariant to image rotation, scaling, translation and partially invariant to affine, 3D projection or illumination changes.In our work, SIFT feature distance is used to enhance the distinctiveness of each point and defined as: Where the matrix S xy describes the SIFT feature distance between point set x and point set y.It is considered as a feature vector with 128 values for each point set in X and Y.Each descriptor vector can be considered as the following form: y j = {y 1 , y 2 , ..., y 128 } and x i = {x 1 , x 2 , ..., x 128 }.
Substituting ( 5) into (4), it takes the form Where σ 2 is an equal isotropic covariances, γ and ξ are two constants, the default value of γ is 10.
Next, we can therefore rewrite (3) as As shown in the function above, (10) , where β is a constant to control the spatial smoothness.According to the representation theorem, the optimal transformation function takes the form y w = ∑ M m=1 G(y, y m )c m , where c m is a D × 1 coefficient vector.Hence, the minimization over H boils down to finding a finite M coefficients vector c m .We formulate the energy function as Where ρ 2 controls the convergence of the energy function, it is initialized to 0.05.λ 2 trace(C T GC) is endowed according to the Tikhonov regularization framework, named global structure preserving term.
Next, we take partial derivative of (12) with respect to coefficient matrix C and obtain the gradient function in the form where denotes the diagonal of a matrix, 1 is a 1 × D row vector of all ones.The Hadamard product which can be shown as is denoted by •, and ⊗ denotes the Kronecker product.Moreover, non-rigid point set registration has high degrees of flexibility, which meant to say that the optimization process may be trapped by local minima.Therefore, we improve the convergence by adopting deterministic annealing with a manner of coarse-to-fine on the parameter ρ 2 .This initializes with a large value for ρ 2 which is progressively decreased by ρ 2 = φρ 2 , where ρ is a constant.Note that a relatively large ρ makes the annealing scheme slow enough by which the robust result is acquired.And the energy function (12) will converge to a optimal solution eventually.

Image Transformation Model Estimation
In order to avoid holes and/or overlaps in the output image, a backward approach [43][44] is proposed for building the thin-plate spline (TPS) [45] transformation model.Note that the estimated TPS model from the reference image to the sensed image is the key idea of the backward approach, and can be defined as where I t and I s represent the transformed image and the sensed image, respectively.The size of I t is same with the reference image I r , and T(x, y) is the estimated TPS model from the reference image to the sensed image and can be defined as where Y w x and Y w y indicate the coordinates of feature point set y w .The kernel function U(y w i , y w j ) is defined as y w i − y w j 2 log y w i − y w j .Thus, the TPS transformation model (w 1 ..., w m , c 1 , c x , c y ) can be estimated by where T is the matrix transpose operator and O is a 3 × 3 matrix of zeros.Y is the m × 2 matrix and indicates the coordinates of feature point set Y .P is the m × 3 matrix where i th row of P is (1,y w i x ,y w i y ).K is the m × m kernel matrix where K ij = U(y w i , y w j ).

Image resampling and transformation
After the TPS transformation model (w 1 ..., w m , c 1 , c x , c y ) is estimated, the transformed image I t is calculated by the mapping functions constructed during the previous step.The bicubic interpolation algorithm is used in the sensed image I s on the regular grid.Each pixel from the sensed image can be directly transformed using the estimated mapping functions and the backward approach.The registered image data from the sensed image are determined using the coordinates of the sensed pixel (the same coordinate system as of the reference image) and the inverse of the estimated mapping function.The image interpolation takes place in the sensed image on the regular grid.Neither holes nor overlaps will occur in the output image (i.e., the transformed image I t ) by using the backward approach.

Data and Evaluation Criterion
We implemented the proposed method in Matlab, and three series of remote sensing data sets are used to evaluate the performances of the proposed method: (i) UAV image pairs with horizontal viewpoints transformation which contains 100 pairs.(ii) Containing 150 UAV image pairs with vertical viewpoints transformation.(iii) Satellite image registration with horizontal and vertical viewpoints transformation which contains 50 pairs.All of those image pairs were taken at the same time and the images range in overlap is 60-80% .The images have a resolution in the range from 640×450 to 1100×850 pixels.Moreover, compared against five state-of-the-art methods: SIFT [14], SURF [18], RSOC [23], CPD [36], GLMDTPS [41].All experiments are tested on a PC with 2.20 GHz Intel CPU and 8 GB memory.
A reliable and fair criteria is needed to evaluate the performance of registration approaches because of lacking public different viewpoints remote sensing image registration data sets with desirable ground truth.In this paper, the ground truth is established for matching correctness by checking manually and tried three evaluation methods to evaluate registration approaches.The first one is based on the root of mean square error (RMSE), the second one we measure the median error (MEE), and the third one is based on maximum error (MAE).
More specifically, the ground truth is constructed by the following four steps manually: (1) select at least ten control point pairs in each retinal image pair, (2) compute the transformation via the manual correspondences, (3) transform the matched points in the sensed image to the reference image by the forward spatial transformation, (4) calculate Euclidean distances between the transformed points and the corresponding points in the reference image.Note that we selected the matched points manually using Matlab R2016a and generate the manual ground truth.

Registration Results on UAV Image with Horizontal Viewpoints Transformation
In the first series of experiments, we test the the performance of our method on the UAV images captured over different province of China by different horizontal viewpoints.The test data sets consist of 100 image pairs and each image has the resolution of 586×452 to 1000×750.We evaluated the performance of our method by all corresponding points which were well-distributed and selected on the easily identified places in I t and I r .The registration error was defined as the mean of coordinate differences between the determined corresponding points.Table 1 provides the results of the means and standard deviations of MEE, MAE and RMSE on the whole data sets and compare our method to other five state-of-the-art registration methods, such as SIFT [14], SURF [18], RSOC [23], CPD [36], GLMDTPS [41], our proposed method showed the best performance for all the three criteria.To demonstrate the accurate alignment of our method, all registration examples are shown in Fig. 1.  [14], SURF [18], CPD [36], GLMDTPS [41], RSOC [23]

Registration Results on UAV Image with Vertical Viewpoints Transformation
In the second series of experiments, we also test the the performance of our method like section 3.2 on the UAV images captured over different province of China but by different vertical viewpoints.The test data sets consist of 150 image pairs and each image has the resolution of 606×422 to 1150×650.We evaluated the performance of our method by the all corresponding points were well-distributed and selected on the easily identified places in I t and I r .The registration error was defined as the mean of coordinate differences between the determined corresponding points.Table 2 provides the results of the means and standard deviations of MEE, MAE and RMSE on the whole data sets and compare our method to other five state-of-the-art registration methods aforementioned where our proposed method also showed the best performance for all the three criteria.Some typical registration examples are shown in Fig. 2.  [14], SURF [18], CPD [36], GLMDTPS [41], RSOC [23]    In the third series of experiments, we use satellite images obtained from Google earth for evaluation.The test data sets consists of 50 image pairs captured from different places.Each image has the resolution of 586×452, and each image pair (I s or I r ) involves ground relief variations and imaging viewpoint changes.The ground truth and the registration errors were determined as the same way in section 3.2 and section 3.3.The quantitative registration errors for the proposed method are shown in Table 3, where our method maintains accurate alignments in all experiments.Some typical satellite image data sets (i.e., Berlin, Las Vegas, Mekong River, Paris, WuHan, Washington and Hawaii) and the registration examples are given in Fig. 3 and Fig. 4 respectively.

Conclusion
In this paper, we proposed an accurate method based on local invariant features and geometric structure descriptors.This approach contains SIFT feature extraction, feature point sets registration based on improved SIFT algorithm via geometric structure constraint and image transformation by estimating non-rigid image transformation model.The main contributions of this work are considered as the proposed four-step approach which solved the remote sensing image registration problems in the moderate ground relief variations and imaging viewpoint changes.Experiments on both UAV images and satellite images from different viewpoints demonstrate that our method shows best registration performances against five state-of-the-art methods .

Figure 1 .Figure 2 .Figure 3 .
Figure 1.Registration examples on the six typical UAV image pairs.The columns show the registration process for the five UAV image pairs.(A) the Great Wall; (B) ChongQing; (C) GuangXi; (D) XiAn; (E) Jinsha River; (F) the Yangtse River.The first two rows indicate two UAV images ( Is and Ir ) with different viewpoints.The first row in each method shows the registration results of SIFT feature points, and the second row in each method gives a 5×5 checkerboard for alternately displaying the transformed image It from Is and the image Ir with the different viewpoint as a montage.

Figure 4 .
Figure 4. Registration examples on the seven typical satellite image pairs.The rows show the registration process for the seven satellite image pairs.The first column shows the transformed image I t from I s , the second row gives a 5×5 checkerboard for alternately displaying I t and I r as a montage.The last column shows the registration results of SIFT feature points.

preprints.org) | NOT PEER-REVIEWED | Posted: 3 May 2017 doi:10.20944/preprints201705.0027.v1 essential
and taken as global geometric structure features.The shape context is considered as local geometric structure features because it can supplement the above-mentioned deficiencies, distinguishing the situation of insufficient robustness in some registration scenarios.Local invariant features can enhance the ability to distinguish, SIFT feature is selected in our paper.We will explain each of them in the following.•Geometric Structure Features: In this work, global geometric structure features is used to describe squared Euclidean distances from one point to another, and defined as Globalis global geometric structure features and G Local is local geometric structure features, S denotes local invariant feature distance and it is a M × N matrix.Though the European distance feature is not robust, it is the basis for making the model effective, is Preprints (www.

preprints.org) | NOT PEER-REVIEWED | Posted: 3 May 2017 doi:10.20944/preprints201705.0027.v1 still
enhance the distinction of the point sets , and provide us a reliable correspondent target point set X .After requiring a row-by-row normalization process p mn ← p mn / ∑ M j=1 p jn on P, the corresponding target point set X takes the form We then define a reproducing kernel Hilbert space (RKHS) H by choosing a positive definite kernel, here we adopt the Gaussian radial basis function (GRBF), which can be written in the form G(y i , y j differentiate X and Y w by evaluating both global and local geometry structure feature and SIFT feature distance, while (3) concerns the Euclidean distance only.In the case when the geometry structure feature of X and Y w are similar, SIFT feature can Preprints (www.

Table 1 .
Experimental statistics on UAV images with horizontal viewpoints transformation.Means and standard deviations of MEE, MAE and RMSE for SIFT and our method on the whole test data involving 100 UAV image pairs with horizontal viewpoints transformation.Bold indicates the best performance.

Table 2 .
Experimental statistics on UAV images with vertical viewpoints transformation.Means and standard deviations of MEE, MAE and RMSE for SIFT and our method on the whole test data involving 150 UAV image pairs with vertical viewpoints transformation.Bold indicates the best performance.

Table 3 .
Experimental statistics on satellite image with horizonal and vertical viewpoints transformation, where (A)Berlin; (B)Las Vegas; (C)Mekong River; (D)Paris; (E)WuHan; (F)Washington; (G)Hawaii.Means and standard deviations of MEE, MAE and RMSE for seven typical image pairs with horizonal and vertical viewpoints transformation are shown.