Fast Aerial Image Geolocalization Using the Projective-Invariant Contour Feature

: We address the problem of aerial image geolocalization over an area as large as a whole city through road network matching, which is modeled as a 2D point set registration problem under the 2D projective transformation and solved in a two-stage manner. In the ﬁrst stage, all the potential transformations aligning the query road point set to the reference road point set are found by local point feature matching. A local geometric feature, called the Projective-Invariant Contour Feature (PICF), which consists of a road intersection and the closest points to it in each direction, is speciﬁcally designed. We prove that the proposed PICF is equivariant under the 2D projective transformation group. We then encode the PICF with a projective-invariant descriptor to enable the fast search of potential correspondences. The bad correspondences are then removed by a geometric consistency check with the graph-cut algorithm effectively. In the second stage, a ﬂexible strategy is developed to recover the homography transformation with all the PICF correspondences with the Random Sample Consensus (RANSAC) method or to recover the transformation with only one correspondence and then reﬁne it with the local-to-global Iterative Closest Point (ICP) algorithm when only a few correspondences exist. The strategy makes our method efﬁcient to deal with both scenes where roads are sparse and scenes where roads are dense. The reﬁned transformations are then veriﬁed with alignment accuracy to determine whether they are accepted as correct. Experimental results show that our method runs faster and greatly improves the recall compared with the state-of-the-art methods.


Introduction
Aligning an aerial image to a known Geographical Coordinate System (GCS), which is called aerial image geolocalization, is a fundamental task in the field of remote sensing. Roads are a kind of salient feature on the ground and are labeled on the existing Geographic Information System (GIS), so we consider it feasible to align aerial images to a certain GCS with roads. As is shown in Figure 1, we develop an efficient geolocalization method imposed directly on the aerial images by registering the road networks in the aerial images to the reference road networks in the GIS when the approximate positions of the aerial images, i.e., in which city the aerial images are captured, are known. The proposed method can geolocalize an aerial image over a search area larger than 1000 km 2 within seconds. Compared with traditional geolocalization methods, the proposed method is much less expensive and efficient, because the only sensor needed is the camera and the reference road vector map, which is highly compact and is readily available from many kinds of GISs (such as road vector maps from OpenStreetMap (https://www.openstreetmap.org)). Though appealing, the task is quite demanding. The main challenges lie in the great appearance and geometric differences between the reference road vector map and roads extracted from the aerial images. On the one hand, despite the good performance of the state-of-the-art road extraction methods [1][2][3][4], much noise still exists, causing the extracted Figure 1. The motivation of this study: performing efficient geolocalization directly on query aerial images by matching the roads in the aerial images with those in the reference road map (in blue) over an area as large as a whole city. Geolocalization is the task of estimating a transformation that aligns the aerial image to a certain Geographical Coordinate System (GCS). To make it clearly seen, only parts of the reference road map are shown here.
The geometric differences are mainly caused by the change of camera pose. Since the aerial images may be taken under any camera pose, it is necessary to use the 2D projective transformation to represent the mapping function between road points in the aerial images and the reference road vector map. Under the 2D projective transformation, most of the geometric features, including the lengths of line segments and the angles between two lines, are not invariant, leading to difficulties in establishing the correspondences between the roads in the aerial images and the reference road maps [5]. Because of this, many existing works on road based geolocalization [6][7][8][9] treat the transformation between the aerial image and the reference road network map as a similarity transformation, and extra attitude and altitude measurements from other sensors on the Unmanned Aerial Vehicles (UAVs) must be used to project the aerial images to the reference coordination.
Even though there are some shape matching methods [10][11][12] that can deal with the geometric deformation to some degree, they are not suitable for the task. The main drawback of these methods is that they can only deal with shape matching between two geometric primitive sets where the geometric primitives mostly correspond. In the geolocalization, the roads in the aerial images are only a small subset of the reference road maps, and it is hard to segment the corresponding part of the reference road vector map when the transformation is unknown. It is also inefficient to divide the reference road map into overlapping tiles and compare each tile with the roads in the aerial image with such shape matching methods because such tiles have difficulty capturing the correct corresponding area of the aerial image in the reference road map when the possible variation in translation, scale, and rotation is highly uncertain.
To address all the aforementioned challenges, we take a step further from our previous work [5] and develop a novel robust geolocalization algorithm. The core of the new method is a projective-invariant geometric feature, which consists of a road intersection together with all the road points surrounding it, called the Projective-Invariant Contour Feature (PICF) in this paper. The main motivation to design the PICF is that such a feature can split the same subsets of the point set under the 2D projective transformation, which can be proven mathematically. With such a property, we can encode the subsets of the point set in a projective-invariant manner. Compared with the "2-road-intersections-tuple" feature in [5], the PICF encodes much more geometric information and is more discriminative. This makes our algorithm more flexible. With much fewer initial feature correspondences and a much higher matching inlier ratio, each initial transformation recovered from the feature correspondences can be further refined with methods such as ICP [13][14][15]. Such a strategy eliminates the influence of the noisy estimation of the initial transformation, makes the geolocalization more robust, and improves the recall to a large extent.
Moreover, benefiting from the good invariance of the proposed PICF, a robust matching method, such as the methods proposed in [16][17][18], can be used to improve the matching quality further. Along with the PICF, we develop an effective matching algorithm considering the geometric coherence between matching pairs to improve the matching quality further. This is accomplished by minimizing the cost function, which includes the PICF descriptor loss and geometric inconsistency loss. The optimization problem can be solved sufficiently by employing the known graph cut method [18][19][20]. Finally, the correct transformation is determined by checking all the candidate matches. In this stage, a flexible way to recover the initial transformation from those correspondences is adopted. The homography transformation is estimated with all the correspondences when enough correspondences are found. Otherwise, we can recover the coarse transformation from just one matching pair and then refine it with a local-to-global ICP method. Such a strategy enables our method to deal with all cases efficiently no matter whether the roads in the scenes are dense or sparse.

Projective-Invariant Contour Feature
In this section, we first introduce the definition of the PICF and prove that the PICF is an equivariant representation under the 2D projective transformation group. Then, we introduce the details of the method to compute the PICF descriptor. Finally, we present a sufficient matching method, where the feature matching is modeled as an optimization problem to minimize both the matching distance between PICF descriptors and the geometric inconsistency between every two matching pairs. 2.1.1. Definition of the PICF Given a set of 2D points P = {(x 1 , y 1 ), (x 2 , y 2 ), . . . , (x N , y N )} and a 2D point p(x i , y i ) ∈ P, we then give the following definitions. All notations and definitions can be easily understood by referring to Figure 2. Definition 1. The neighbor point of p in the direction of θ is the point q x j , y j , which meets the following conditions: is the angle of vector q − p related to the X axis;

Equivariance of the PICF
Equivariant representations are highly sought after as they encode both class and deformation information in a predictable way. Similar to the equivariance defined for the image in [21], the equivariance for the point set can be defined as follows: Equivariance for the point set: Let G be a transformation group and L g P be the group action applied to a point set P. A mapping Φ : E → F is said to be equivariant to the group action L g , g ∈ G, if: where L g and L g correspond to the application of g to E and F, respectively, and satisfy L gh = L g L h .
In the case of the PICF, the mapping Φ aims to extract the contour of a given point p, and we will show that this mapping is equivariant to the 2D projective transformation group.
The equivariance of contour C(p, P) under the 2D projective transformation can be defined equally as follows: Theorem 1. Let P be the point set after applying the 2D projective transformation H to P, p be the corresponding point of p, and C(p, P), C (p , P ) be the contour of p and p , respectively, for any point x ∈ C(p, P); there exists a point x ∈ C (p , P ), which satisfies x = H(x), and equally, for any point x ∈ C (p , P ), there exists a point x ∈ C(p, P), which satisfies Before the proof of Theorem 1, we introduce another theorem. . ( Theorem 2 shows us the fact that the relative order of points on a line stays the same under the homography transformation. The proof of Theorem 2 is provided in Appendix A.
Proof of Theorem 1. Let x 1 , x 2 , . . . , x N θ be the ordered points on the line l θ = p + λ[cos θ, sin θ] T , λ ∈ R + in the point set P, where x i − p < x j − p , if i < j, and x 1 , x 2 , . . . , x N θ be the corresponding points of x 1 , x 2 , . . . , x N θ under the homography transformation H. As is illustrated by Proposition 2, x 1 , x 2 , . . . , x N θ are collinear, and which means x 1 is the point closest to p in the direction of angle θ . Therefore, x 1 is a point on the contour of C (p , P ); that is to say, there exists a corresponding point x ∈ C (p , P ) for any point x ∈ C(p, P), which satisfies x = H(x). Since the homography transformation is reversible, for any point x ∈ C (p , P ), there exists a point x ∈ C(p, P), which satisfies

The PICF Descriptor
At this point, we have introduced the definition of the PICF and proven its equivariance under the 2D projective transformation. Even though we know that the points on the contour C, C correspond one-to-one, we still do not know the definite corresponding point of a certain point. Therefore, it is necessary to align the contour point sets to the same direction and scale in order to encode the corresponding PICFs with the same descriptors.
Let the local coordinate relative to point p of a point x i on the contour of p be x p i = x i − p and the local coordinate of its corresponding point x i under the 2D projective Since the homography transformation H projects point x i to x i , we can get: Since the projective transformation H can be approximated to the affinity transformation in most applications in remote sensing, which means s T is close to the zero vector, the transformation between x p i and x p i can be approximated as: Equation (4) emphasizes an important fact that the transformation between the local coordinate x p i and x p i can be approximated as a linear transformation, which can be presented with a 2 × 2 matrix. Since an undetermined 2 × 2 matrix has 4 parameters and each point-to-point correspondence accounts for two constraints, it is necessary to specify 2 point correspondences in order to constrain the transformation fully.
Even though we cannot obtain the exact correspondences between two contour point sets, we know that they correspond one-to-one; that is to say, for a certain point x p i , there exist a certain point x p j that is corresponding to it, which satisfies x p j = Ax p i . Adding all the points up, we get: Let: be the centroid of contour C and C , respectively; we get: The line l passing through the center point p and the centroidx p of contour C divides the contour into two parts P 1 = Since the points correspond one-to-one and the line l and l also correspond, the points in P 1 should also correspond one-to-one to those in P 1 or P 2 . Similar to the definition of the centroid of the whole contour, we define the centroid of the half-contour as: With the correspondences of the centroids of the contours and centroids of the halfcontours, we get two groups of possible constraints, with which the 2 × 2 transformation matrix can be determined. The contours can be aligned to the same coordinate with the transformations so that we can get a projective-invariant description of the contour. As is shown in Figure 3, the detailed method to compute the PICF descriptor is: (1) Sample contour points: Divide the area centered at point p into N areas with angle  (6) and (8). Solve Equation (10) to get the two possible transformation matrices A 1 and A 2 : where x 1 , x 2 are two constant vectors and are taken as (3) Align contour points: Project all contour points with A 1 and A 2 , respectively, and get two transformed point sets (4) Compute the PICF descriptor: Convert the coordinates of the points in point sets C i , i = 1, 2 to the polar coordinates, and sort the points with their angles in ascending order, then we get , compute the average distance of contour points in the area to the center point as: where N is the dimension of the descriptor, and θ s = 2π N .
Finally, the two possible PICF descriptors of point p in the point set P are obtained as: (c) the contour is aligned with two affinity transformations, respectively, which is estimated by aligning the centroid of the whole contour and the centroids of the half-contours to given points; the two aligned contours are shown in pink and yellow, respectively; (d) the two contours reconstructed from the PICF descriptors are shown in pink and yellow, respectively, and the original contour is shown in light blue.

Discussion
As proven in Section 2.1.2, the equivariance of the PICF under the 2D projective transformation holds strictly true when the direction angle θ is continuous. To represent the contour point set with the vector, as introduced in Section 2.1.3, the contour point set is sampled with discrete θ. The strict equivariance does not hold in such a case, but the equivariance is approximately satisfied when the sample step θ s is small enough. For each point on the contour, there exist points that are close enough to its corresponding point on the transformed contour as long as the sample step θ s is small enough, which means the approximate corresponding point always exists. Therefore, in our implementation, the number of sample contour points is set to 4× the dimension of the PICF descriptor to make the sample contour a better approximation of the full contour. Besides, we choose the closest points to the center in each small direction interval as the sample points of the contour, which makes the estimated PICF descriptors less sensitive to small breaks of the detected roads. These measures make approximate equivariance hold well in practical applications, which is also proven by our experiments.

Matching PICFs with Geometric Coherence
At this point, given a point set P and a center point p, we can compute a projectiveinvariant descriptor, which encodes the local point cloud information. With such a descriptor, many fast retrieval algorithms, such as the FLANN [22], are sufficient to build the correspondences of two PICF sets. However, large amounts of outliers exist in the correspondences estimated from such methods, which leads to more unnecessary tries in the following transformation refining and validation. To solve the problem, we model the PICFs matching as an optimization problem to minimize the cost function, which penalizes the differences between the PICF descriptors and also the geometric inconsistency between two matching pairs.

Geometric Coherence between PICF Correspondences
Multiply both sides of Equation (11)  Considering the geometric consistency between matching pairs, the matching be- . . , m} can be modeled as a problem to minimize the cost function: where i(j), i(k) are the indexes of the corresponding PICF in S r of jth and kth PICF in S q , respectively. The first term of Equation (12) is the data term, which models the matching distance between PICF descriptors, and the second term is the smooth term, which penalizes the geometric inconsistency between two matching pairs.

Refining Initial Matches with Geometric Coherence
It is difficult to solve Equation (12) directly; instead, we seek a two-stage solution. We first find the best matching for each PICF and then verify the correctness of the correspondences with both the matching distances and geometric consistency, as described in Section 2.2.1. Let be the corresponding transformation matrix of m i . We define a label function L : m i → {0, 1}, which maps m i to 1 if the matching is correct, and 0 otherwise. Then, the PICF matching problem is equivalent to finding the labels of all initial correspondences that minimize both the matching distances and geometric inconsistency. where s are the kernel functions. We use the kernel function to normalize both the data term (matching distances) and the smooth term (geometric inconsistency) to improve the robustness. A i − A j is the distance between A i and A j and is calculated with the method proposed in [23].
The data term e(L(m i ), K d (φ(m i ))) is defined as: This energy term penalizes matching pairs whose labels are 1, but with large matching distances, and those whose labels are 0, but with small matching distances.
In e L(m i ), K s A i − A j , if both correspondences are labeled as outliers, the penalty is K s A i − A j , thus "rewarding" label 0 if the geometric inconsistency is big. The penalty of considering a pair of correspondences as the inlier is 1 − K s A i − A j , which rewards the labels if the transformation matrices of the two correspondences are similar.
Equation (13) can be solved efficiently and globally via the graph-cut algorithm [18][19][20]. The complete matching algorithm considering geometric consistency is illustrated in Algorithm 1.

Algorithm 1 PICF matching algorithm.
Input: S r , the reference PICF set, and S q , the query PICF set.
12: end for 13: for all m i , m j ∈ M initial do 14: c 01 , c 10 ← 1, 1. 15: G ← AddTerm2(G, c 00 , c 01 , c 10 , c 11 ) 18: end for 19: L ← Apply graph-cut to G. 20: M m i c r i ↔ c q i ← Get all the correspondences with label 1 from L.

PICF Based Fast Geolocalization
At this point, we propose the PICF, which is equivariant to the homography transformation group, and a way to describe the PICF by a vector invariant to homography transformation, with which the potential correspondences can be found effectively. Furthermore, to improve the matching, we model the PICFs matching as an optimization problem to minimize the cost function, which penalizes the differences between PICF descriptors and also the geometric inconsistency between two matching pairs, and solve it with the graph-cut algorithm sufficiently. In this section, we introduce our fast geolocalization method based on the PICF in detail. The overview of our method is shown in Figure 4.
Our fast geolocalization method contains the offline stage and the online stage. The offline stage is for processing the reference road vector map whose output is used in the online correspondence search scheme, allowing for a very fast operation. As is illustrated in Figure 4, in the offline stage, all the road intersections are detected and used as the centers of the PICFs; and then, the two possible PICF descriptors are computed for each PICF; finally, a KD-tree is built using all these PICF descriptors.  Figure 4. Overview of the proposed PICF based geolocalization method. Compared with the method in [5], the newly proposed algorithm is more sufficient, which is mainly a benefit of the good performance of the newly-designed PICF. The performance improvements come from two aspects: on the one hand, the effective encoding of the road geometric structure by the PICF, together with the effective matching method considering geometric coherence between correspondences, leads to an initial correspondences set with a high inlier ratio; on the other hand, benefiting from the good matching results (only at most one matching candidate for each feature), each recovered homography transformation can be refined, making it more robust and less sensitive to initial transformation estimation to verify whether the geolocalization is correct. Here, N is the number of obtained matching pairs.
The proposed online search scheme starts by extracting the roads for a given aerial image using our method in [1]. The same as the offline stage, all the PICFs are detected, and their corresponding descriptors are calculated. The correspondences between the query PICF set and the reference PICF set are found with the geometric-consistent matching algorithm introduced in Section 2.2. These matching PICF pairs are then used to recover the homography transformations between the aerial images and reference road map with two different strategies: (1) If the number of matching pairs is larger than N c (20 in this paper), an initial homography transformation is estimated with all the matching pairs using the RANSAC method and then refined with the plain ICP. The refined transformation is then verified with the alignment accuracy between the query road map and the reference road map. (2) If the obtained feature correspondences are less than N c or the first strategy fails to recover the right transformation, the obtained pairs are afterward sorted with their matching scores. Then, a homography transformation is estimated with a single correspondence and refined using all the road points in the query road map with a local-to-global iterative closest point algorithm. Furthermore, the refined transformation is then verified with the alignment accuracy. Each correspondence in the sorted pair set will be tested until a homography transformation that aligns the query road map and the reference one accurately enough is found. If our method fails to recover the correct transformation after all the correspondences obtained with the geometric-consistent matching algorithm, the left correspondences in the initial matching will be checked in the same manner.

Initial Homography Transformation Estimation from One Matching Pair
Give a PICF correspondence

Transformation Refining
When the approximate homography transformation H is estimated from a given correct PICF correspondence, the estimated homography transformation has poor accuracy and should be refined further. This is because H is approximated to an affinity transformation, and the upper-left 2 × 2 matrix A is estimated using only points on the contour.
The initial transformation estimated from one correspondence is refined so that it aligns the query road point set to the reference road point set accurately. The ICP algorithm is appropriate to solve such a problem. However, in our experiments, we find that the convergence rate is quite low when using the plain ICP algorithm, which may arise from the poor accuracy of the initial transformation. Since the initial transformation is estimated with the points near the center point of the PICF, it should be accurate enough for points near the center point. Inspired by [15], we use a local-to-global ICP algorithm, which first aligns the points near the center point and then aligns the points far from the center point gradually.
In the local-to-global ICP algorithm, the points whose distances to the center point are less than d in the query road point set are chosen to refine the initial transformation H 0 with the standard ICP algorithm firstly. If the ICP algorithm is convergent, the refined transformation H 1 is then used to align the points whose distances to the center point are less than σd(σ > 1), resulting in a further refined transformation H 2 ; otherwise, the correspondence used to estimate the initial transformation is regarded as an incorrect correspondence and is discarded. The optimization continues until all query road points are used.

Transformation Validation
After the transformation refining, the homography transformation estimated from a correct correspondence aligns the query road point set with the reference point set accurately, while the homography transformation estimated from an incorrect correspondence cannot. We use the inlier rate of road points in the query road map to measure the alignment accuracy. A road point is regarded as an inlier when the distance between its projection point under estimated transformation and the closest road point in the reference road map to its projection point is less than a certain threshold δ. Therefore, the inlier rate can be computed as: where H(H, x q ) is the projection point of x q under H. One may argue that the transformation validation should be done with the initial transformations estimated from the correspondences before the transformation refining to remove incorrect correspondences further and to reduce unnecessary tries to find the correct transformation. However, in our experiments, we find that the inlier ratios under the coarse transformations are not sufficient to distinguish the right transformations from those that are wrong. As is shown in Figure 5, before the transformation refining, the inlier ratios are low under both the transformations estimated from the correct and incorrect correspondences, and there is not an obvious threshold of inlier ratio that can distinguish the correct transformations from these wrong ones. After the transformation refining, the inlier ratios of correct transformations are much higher than those of incorrect transformations in most of the cases. Alignment accuracy after transformation refining Correct match Incorrect match Figure 5. Alignment accuracy before and after transformation refining. Before the transformation refining, neither the transformation estimated from a correct match, nor that from an incorrect match can align the query road point set to the reference road point set accurately. After the transformation refining, the alignment accuracy under most of the transformations estimated from the correct matches improves greatly, while that of the incorrect match does not.

Results
In this section, we firstly perform experiments on large-scale synthetic aerial image data sets to evaluate both the effectiveness of the PICF and the geolocalization method. Then, experiments on real aerial image data sets are conducted to test the performance of our algorithm on real applications. Implementation details: Our implementation was written in C++, single-threaded. All the experiments ran on a computer with an Intel Xeon E5-2620v2 CPU and 24 GB RAM. In all the experiments, the road maps for aerial images were extracted with our road segmentation method proposed in [1] in advance. The parameters of our algorithm were set based on experiments on a validation synthetic aerial image data set and kept fixed in all experiments. The main parameters settings were: the dimension of the PICF descriptor dim = 48; the weight for smooth energy term λ = 1/N, where N is the number of initial correspondences; the parameters of the kernel function for the data term and smooth term σ data = 48 and σ smooth = 2, respectively; the maximum iterations for plain ICP N icp = 120; the distance to divide the query road points into different parts in local-toglobal ICP d = max{w, h}/4, where w, h are the width and height of the query road image, respectively; the ratio of the distance to expand the points to align in local-to-global ICP σ = 2; the distance error threshold for a road point to be regarded as an inlier δ = 20; the minimum required inlier ratio to verify the geolocalization as correct α = 0.7.
Baseline and metrics: We compared our method with the state-of-the-art algorithm, the "2-road-intersections-tuple" (2RIT) based geolocalization method [5], in three aspects: precision, recall, and run time. Here, precision = TP/(TP + FP), where TP is the number of successfully and correctly geolocalized aerial images and FP is the number of successfully, but wrongly geolocalized aerial images; recall = TP/N, where N is the number of all aerial images used. We used the error between the estimated homography and the ground truth one to verify whether the geolocalization was correct, which is computed with the same method as in [5]. In all experiments, the parameters for "2RIT" were set the same as reported in [5].

Experiments on Synthetic Aerial Image Data Sets
In this section, experiments on synthetic aerial image data sets are conducted to test the performance of our geolocalization method under different viewing angles, resolutions, and scenes. Since it is quite difficult to collect enough aerial images with the needed viewing angles, resolutions, and context, we followed the same manner as in [5] to validate our algorithm using synthetic aerial image data sets. The three reported synthetic aerial image data sets in [5], named "multi-poses data set", "multi-GSDs data set", and "multiareas data set", were used in our experiments to test the performance of our method under different viewing angles, resolutions, and scenes, respectively. Each data set contains synthetic aerial images from two different areas in Massachusetts. The roads in one area (the 13th area) are sparse, while roads in the other area (the 17th area) are much denser. Therefore, these data sets are representative.
As is shown in Figure 6, the precision of the two methods in both areas is higher than 90% for all aerial images taken under different viewing angles. Meanwhile, our novel method improves the recall in both areas when the viewing angle is less than 30 • , which is usually the case in real applications. The improvement of recall is especially significant in the 17th area where the roads are dense. Figure 6 also shows that the recall of both methods decreases with the increase of the viewing angle. However, the performance degradation of the method presented in this paper is much slower.   Figure 7 shows the matching inlier rates under different viewing angles in both areas. As can be seen, the inlier rate of the initial PICF correspondence set obtained with finding the nearest neighbor in the PICF descriptor space is acceptable. In both areas and under all view angles, the inlier rate of the initial correspondence set is no less than 10%, which indicates that the proposed PICF can encode the geometric structure effectively and is discriminative. It also proves that the PICF is projective-invariant. Furthermore, the geometric coherence based matching method is efficient to improve the initial matching quality, especially when the viewing angle is relatively small. As is shown in Figure 8, the improvement mainly comes from the ability of the geometric coherence based matching method to reject most of the wrong correspondences while preserving most of the right correspondences. The good performance is more obvious when the roads are dense in the search area (17th area). In such areas, the similar local structures of the road network are ubiquitous, leading to a low inlier ratio of initial matching. It is thus more meaningful to use the geometric coherence constraint. The improvement of the matching inlier rate is significant to reduce the unnecessary tries in the following geometric consistency validation and to improve the efficiency. Even though the geometric coherence based matching method is less effective when the viewing angle is large, since it may reject all the right matching pairs for some aerial images in such a case, leading to an inlier rate of zero, we try to recover the homography transformation from the rest candidates to ensure our algorithm can work in such a case.
We also conducted experiments on the "multi-GSDs data set" to evaluate the influence of Ground Sample Distance (GSD) on the geolocalization. The result is shown in Figure 9. It can be seen that the recall of the "2RIT" method decreases sharply when the aerial images are captured with small GSDs, which indicates that the "2RIT" is sensitive to the GSD. Our PICF based geolocalization method is less sensitive to the change of GSD since the recall varies much more slightly with the GSD compared with the "2RIT" method.   To study the performance of our method on different scenes, we conducted experiments on the "multi-areas data set", which contains as many as 2000 aerial images from 20 areas. The precision and recall are shown in Figure 10. As can be seen, in all 20 areas, the precision of both methods is higher than 95%. The recall of our newly proposed method is improved significantly in 19 out of 20 areas. In 12 areas, the recall is higher than 90%. We highlight the improvement of recall in the first and seventh area. Several big cities lie in these two areas, leading to very dense roads. The similar structures of the road network in these two areas are ubiquitous. In such a case, the "2RIT" method may fail, since there exist too many candidates for each "2-road-intersections-tuple", while our new algorithm performs well in such case. When a large amount of matching PICF pairs is found, it is robust and accurate enough to estimate the initial homography transformation using all these pairs. Such a strategy makes our algorithm effective at dealing with such cases. The only area where the recall of the new method is less than the "2RIT" method is the 13th area, and this is mainly caused by the too sparse roads in some aerial images, leading to few PICFs found in those aerial images.  The run time of geolocalization for both methods is shown in Figure 11. In 17 out 20 areas, the average geolocalization time method proposed in this paper is less than 1 s and is less than 2 s in all 20 areas, while the average run time for "2RIT" is usually several seconds or even tens of seconds in areas like the seventh area and the 12th area. In all areas, the average run time of our new algorithm is less than that of "2RIT". Besides, the run time of the PICF based geolocalization method varies more slightly between different aerial images in the same area compared with "2RIT".

Experiments on Real Aerial Image Data Sets
We also conducted experiments on real aerial images captured with cameras on UAVs to test the performance of our algorithm in real applications. We collected aerial images from three areas over China, including Suizhou, Xian, and Fengyang. The aerial images were captured in the form of videos. The details of the videos are shown in Table 1. These videos were then downsampled with one frame every 48 frames. All the selected aerial images in Suizhou were then resized to the resolution of 1024 × 540, and aerial images in the other two areas were resized to 960 × 540, while the final GSDs of those aerial images ranged from 0.3 m/pixel to 1.0 m/pixel. The camera internal parameters were calibrated, and the distortion of the images was corrected offline.
Since the Field Of Vision (FOV) of the camera is less than 70 • , the observed ground area in one single aerial image cannot cover enough roads. The roads in individual aerial images are not enough to determine their positions uniquely. Therefore, we stitched the aerial image sequences into bigger aerial mosaics to ensure that each mosaic contained enough roads. The final mosaics covered an area of 0.5 km 2 to 3.0 km 2 . Finally, we obtained 21, 24 and 7 aerial image mosaics in Suizhou, Xian, and Fengyang, respectively. We then performed road segmentation to generate the corresponding road images for each mosaic, and those road images were then used to perform geolocalization.
All the reference road maps used in the geolocalization were downloaded from the Gaode map website (https://wprd01.is.autonavi.com/appmaptile?x={x}&y={y}&z={z}& lang=zh_cn&size=1&scl=2&style=8&ltype=11) with QGIS, which were saved in the raster map format with the resolution of 1.0 m/pixel. Each reference road map was large enough to cover the area where the aerial images were captured. The detailed information for the reference maps is shown in Table 2.
We conducted geolocalization on the generated road images with both the "2RIT" method and our new method. Since the ground truth of homography transformation between the aerial image mosaics and the reference road map cannot be obtained, we verified whether the geolocalization is correct by projecting the aerial images onto the reference road map and checking visually whether they overlap with the reference road map well. The result is shown in Table 3.  Table 3. Geolocalization result on the real flight aerial image data set.

2RIT Ours
Suizhou, Hubei 21 14 16 Xian, Shanxi 24 14 20 Fengyang, Anhui 7 5 7 In all three areas, the proposed method outperformed the "2RIT". The improvement is especially significant in Xian and Fengyang, where the recall increased by 42.86% and 28.57%, respectively. Compared with those in Suizhou, the GSDs of the mosaics in the two areas were higher. The width of roads may be more than 40 pixels in the road images for the main streets, making it difficult to estimate the tangents accurately. In such cases, the "2RIT" method failed. Our newly designed PICF is insensitive to road width, as is shown in the experiments on the synthetic data sets, which makes our new method more robust. Some cases where the "2RIT" failed while the proposed method succeeded are shown in Figure 12. The proposed method failed when no complete PICF was found in the scene. The PICF becomes less valuable when there exist too many directions in which no contour point is found, leading to the failure of geolocalization. A camera with a larger FOV may overcome this deficiency, which is left for further validation in our future work. Figure 12. Successful geolocalization results. The query aerial image mosaics are projected onto the reference road maps. The PICFs from which the homography transformations are estimated are shown in purple. The top mosaic is from Fengyang, while the other two mosaics are from Xian. Even though the roads in the aerial images cannot be aligned to those in the reference road map exactly in some cases, the geolocalization results can be used as a good initial estimation to calculate more accurate transformation. However, this is not the target of this paper.

Discussion
The results of the experiments on both synthetic and real aerial image datasets show that our newly proposed PICF based geolocalization method improves the recall and run time to a large extent compared with the state-of-the-art "2RIT" method. Furthermore, the results indicate that the PICF based geolocalization method is more robust to differences of viewing angle and GSD between the aerial images and the reference road map. In this section, we discuss the main factors that contribute to these improvements in detail.
The improvement of recall mainly comes from two aspects. First, the proposed PICF is projective-invariant and encoded with a projective-invariant descriptor, which means the representation of the PICF is robust to viewing angle change. Moreover, the PICF can encode much more geometric information of the road network compared with the "2-road-intersections-tuple", making it more efficient to establish initial correspondences, which can be clearly seen from Figure 7. Since few outliers exist in initial correspondences, it is possible to use time-consuming optimization methods, such as the ICP, to refine each estimated initial homography transformation in an acceptable run time. With more accurate homography transformation, the geometric consistency check is much more efficient, as is discussed in Figure 5. In the "2RIT", there usually exist hundreds or even thousands of matching candidates, making it unreasonable to refine each hypothesis generated by the feature correspondences. Therefore, the "2RIT" depends heavily on the accuracy of the initial homography transformation estimation. A correct, but less precise hypothesis may be rejected wrongly when the estimation of the tangents of road branches is noisy. The other factor contributing to the improvement lies in the effectiveness of the proposed homography transformation refining strategy, especially the idea of recovering the initial homography transformation from just one matching pair and refining it with the local-toglobal ICP method. Such a strategy makes our method feasible in more cases, even when the roads are sparse or the cover areas of the aerial images are relatively small.
The greater robustness of our new algorithm to changes of viewing angle and GSD also benefit from the good performance of the proposed PICF. On the one hand, the invariance of the PICF is proven mathematically. Therefore, the encoding of the PICF is robust to changes in viewing angle. On the other hand, no accurate geometric information, such as the tangent, is needed in the computing of the descriptor of the PICF, which makes the PICF descriptor less sensitive to road segmentation noise. In the "2RIT" method, the tangents of road branches are required to estimate precisely, which is hard when the road segmentation is not smooth or the GSD of the aerial image is small. Since when the GSD is small, the width of the road is usually dozens of pixels in aerial images, making it difficult to extract the centerlines of roads accurately. In such a case, the initial transformation is noise and may be rejected wrongly even when the correspondence between the "2-roadintersections-tuple" is correct.
Three factors contribute to the improvement of the efficiency. First, the PICF encodes more geometric information of roads and is more discriminative, leading to much fewer matching candidates. Second, the geometric coherence based matching method rejects most of the incorrect correspondences and improves the matching quality further. Last, the strategy of using all the matching pairs or only one matching pair to recover the initial transformation is elastic to deal with all cases despite the density of roads, making our method quite efficient to find the right location.

Conclusions
This paper presents an aerial image geolocalization method, which uses only the road vector map as the reference. The new method is capable of efficiently and effectively estimating the transformation aligning the query roads to the reference roads over a whole city within several seconds on a single CPU, making it possibly available in real scenarios. Experiments on both the synthetic and real aerial images show that our method can significantly improve the recall of geolocalization while maintaining high precision and efficiency compared with the state-of-the-art aerial image geolocalization methods based on road network matching. The experiments also highlight areas of potential improvement. In future work, we will further enhance the capability and useful ranges of our method by reducing its dependence on initial transformation estimations, especially on aerial images with few roads. (A2) Since H projects the points x 1 , x 2 to x 1 , x 2 , we get: x 1 = (Ax 1 + t) s T x 1 + 1 , x 2 = (Ax 2 + t) s T x 2 + 1 .