A Grid Feature-Point Selection Method for Large-Scale Street View Image Retrieval Based on Deep Local Features

.


Introduction
Street view images can be used to analyze and solve many problems, such as street-level vegetation estimates [1] and leaf area indexes [2] or urban land use mapping [3]. The above researches premise is the known location street view images. However, some valuable or interesting street view images have no or blurry location information. Some photos are taken by devices without GPS or taken in urban environments with multipath error [4]. Moreover, some pictures lose the location information during the propagation, such as upload and download. Thus, it is necessary to find the street view images location, and the problem is attracting increasing attention [5][6][7][8][9].
The content of street view images provides effective clues to help locate the shooting location, which can provide more contextual information about the scene. The location of street view images can be identified and matched by retrieving a large-scale street view dataset. Street view image retrieval extracts visual information as image features, and then the features are encoded to improve retrieval efficiency and speed on the large-scale dataset. The encoding method clusters the image features and numbers them so that the query features are searched in the small cluster during retrieval. Finally, the similarity is computed between the query and dataset features. At present, the image features used performance. However, the quantity of local feature data is too large, and it is difficult to apply to large-scale street view image retrieval. Therefore, this paper proposes a grid feature-point selection method (GFS) based on deep local features. The method deletes low attention score feature points to reduce the feature file size and minimize the precision loss. The CNNs are constructed first to extract attentive and multiscale deep local features (DELF) that learn the weight of the object in the image by using the attention mechanisms during the training process and only require weakly-supervised classification. Then, distinctive features are selected by GFS to reduce memory usage so that local features can be applied to large-scale street view image retrieval. In this step, the image is divided into multiple grid regions, and features are mapped to the corresponding regions. According to the attention score, the features in each region are selected to remove redundant features. In addition, product quantization and the inverted index are used to improve retrieval efficiency. An ANN search is performed to retrieve query images. GFS compresses the number of feature points by selecting features with higher weights in the local region and shows higher accuracy compared with other methods.

Methodology
The GFS method includes the following: (1) data preprocessing, (2) street view image feature extraction based on DELF, (3) a grid feature-point selection method, (4) large-scale street view image retrieval based on product quantization, and (5) retrieval method comparison and evaluation. The workflow of the paper is shown in Figure 1.
Remote Sens. 2020, 12, x FOR PEER REVIEW 3 of 18 performance. However, the quantity of local feature data is too large, and it is difficult to apply to large-scale street view image retrieval. Therefore, this paper proposes a grid feature-point selection method (GFS) based on deep local features. The method deletes low attention score feature points to reduce the feature file size and minimize the precision loss. The CNNs are constructed first to extract attentive and multiscale deep local features (DELF) that learn the weight of the object in the image by using the attention mechanisms during the training process and only require weakly-supervised classification. Then, distinctive features are selected by GFS to reduce memory usage so that local features can be applied to large-scale street view image retrieval. In this step, the image is divided into multiple grid regions, and features are mapped to the corresponding regions. According to the attention score, the features in each region are selected to remove redundant features. In addition, product quantization and the inverted index are used to improve retrieval efficiency. An ANN search is performed to retrieve query images. GFS compresses the number of feature points by selecting features with higher weights in the local region and shows higher accuracy compared with other methods.

Methodology
The GFS method includes the following: (1) data preprocessing, (2) street view image feature extraction based on DELF, (3) a grid feature-point selection method, (4) large-scale street view image retrieval based on product quantization, and (5) retrieval method comparison and evaluation. The workflow of the paper is shown in Figure 1.

Data Preprocessing
The geotagged perspective view images extracted from panoramas are used as the retrieval and training dataset, and images taken by a mobile phone are used as the query image. As shown in Figure 2, the original images of the dataset are the fragments of the equirectangular panoramic image collected by the camera-equipped on the vehicle. For this reason, multiple fragments are merged together into a complete panorama. The deformation at the center of the image is small, while the upper and lower sides are large. Considering that the deformation of the panoramic image is inconsistent with the query image and the matching results are poor [7], the panoramic images are

Data Preprocessing
The geotagged perspective view images extracted from panoramas are used as the retrieval and training dataset, and images taken by a mobile phone are used as the query image. As shown in Figure 2, the original images of the dataset are the fragments of the equirectangular panoramic image collected by the camera-equipped on the vehicle. For this reason, multiple fragments are merged together into a complete panorama. The deformation at the center of the image is small, while the upper and lower sides are large. Considering that the deformation of the panoramic image is inconsistent with the query image and the matching results are poor [7], the panoramic images are projected into the same perspective view as the query image so that the similarity between their features is improved. In more detail, the street view image preprocessing is as follows: (1) Merge image fragments into a complete street view panorama. The equirectangular panorama is a single image with an aspect ratio of 2:1, so crop the black part of the panorama and keep the proper aspect ratio of the image. (2) Set projection parameters. Parameters include FOV (field of view), pitch, and heading. The FOV determines the field of the projection, while the pitch and heading determine the location of the projection. The larger the FOV is, the larger the region is covered by the image. However, if the FOV is too large, the edges of the perspective image will be deformed. Thus, there is a trade-off between the image coverage area and the image distortion. The recommended FOV range is 40 • -80 • . (3) Project panorama into perspective view image. Specifically, the projection process is divided into two steps. First, the equirectangular panorama is mapped to a sphere. Then, according to the projection parameters, an image of a certain range of the sphere is mapped to a plane to obtain a perspective view image. (4) Generate training data. Every three neighboring panoramas are grouped into a class by querying the nearest neighbor images. The images with different orientations in the class are removed so that the image with the same scenes is left. Some images facing a certain scene in each class are retained.
Remote Sens. 2020, 12, x FOR PEER REVIEW 4 of 18 projected into the same perspective view as the query image so that the similarity between their features is improved. In more detail, the street view image preprocessing is as follows: (1) Merge image fragments into a complete street view panorama. The equirectangular panorama is a single image with an aspect ratio of 2:1, so crop the black part of the panorama and keep the proper aspect ratio of the image. (2) Set projection parameters. Parameters include FOV (field of view), pitch, and heading. The FOV determines the field of the projection, while the pitch and heading determine the location of the projection. The larger the FOV is, the larger the region is covered by the image. However, if the FOV is too large, the edges of the perspective image will be deformed. Thus, there is a trade-off between the image coverage area and the image distortion. The recommended FOV range is 40°-80°. (3) Project panorama into perspective view image. Specifically, the projection process is divided into two steps. First, the equirectangular panorama is mapped to a sphere. Then, according to the projection parameters, an image of a certain range of the sphere is mapped to a plane to obtain a perspective view image. (4) Generate training data. Every three neighboring panoramas are grouped into a class by querying the nearest neighbor images. The images with different orientations in the class are removed so that the image with the same scenes is left. Some images facing a certain scene in each class are retained. The photos taken from the mobile phone have high resolution that slows down the process of feature extraction. In addition, the upper part of the photo usually has not a corresponding street view in the dataset or has a large deformation image, and the lower part of that is roads and pedestrians. Therefore, the photos are center cropped and resized to 640 × 480 size, and a Gaussian filter is finally performed to remove the noise.

Street View Image Feature Extraction Based on DELF
Attention-based multiscale deep local features are used to represent street view images. The attentive module can learn the targets with the same semantic information in the samples of each class and increase their feature weights, such as buildings and signs. DELF [21] is used to extract attention-based features. Image pyramids are constructed by resizing images and fed into CNNs to generate multiscale features that have the ability to deal with scale changes. A CNN convolutional layer is used to extract dense local features that are sorted according to the scores provided by the attention layer, and the features with lower scores are removed. The street view feature extraction network has two modules, which are trained separately, as follows: (1) Dense feature extraction module. A fully convolutional neural network is used to extract the dense features. ResNet50 [46] is employed for training, and the output of block 4 is used for dense features. The module is trained in the first step. A total of 1500 classes of the Google Landmark dataset [21] and 500 classes of the San Francisco dataset [7] are used to train with 100 epochs, and then the 2000 classes of the Hong Kong street view dataset are used to fine-tune 100 epochs. The cross-entropy loss function is adopted. (2) Attention module. The attention score function ∝ ; is constructed, represents the weight of each feature vector , and 1, … , represents the feature vector. ∈ , The photos taken from the mobile phone have high resolution that slows down the process of feature extraction. In addition, the upper part of the photo usually has not a corresponding street view in the dataset or has a large deformation image, and the lower part of that is roads and pedestrians. Therefore, the photos are center cropped and resized to 640 × 480 size, and a Gaussian filter is finally performed to remove the noise.

Street View Image Feature Extraction Based on DELF
Attention-based multiscale deep local features are used to represent street view images. The attentive module can learn the targets with the same semantic information in the samples of each class and increase their feature weights, such as buildings and signs. DELF [21] is used to extract attention-based features. Image pyramids are constructed by resizing images and fed into CNNs to generate multiscale features that have the ability to deal with scale changes. A CNN convolutional layer is used to extract dense local features that are sorted according to the scores provided by the attention layer, and the features with lower scores are removed. The street view feature extraction network has two modules, which are trained separately, as follows: (1) Dense feature extraction module. A fully convolutional neural network is used to extract the dense features. ResNet50 [46] is employed for training, and the output of block 4 is used for dense features. The module is trained in the first step. A total of 1500 classes of the Google Landmark dataset [21] and 500 classes of the San Francisco dataset [7] are used to train with 100 epochs, and then the 2000 classes of the Hong Kong street view dataset are used to fine-tune 100 epochs. The cross-entropy loss function is adopted. (2) Attention module. The attention score function ∝ ( f n ; θ) is constructed, θ represents the weight of each feature vector f n , and n(1, . . . , N) represents the n th feature vector. f n ∈ R d , d is the size of f n , which depends on the dimensionality of the outputs of the convolution layer. The output f n is the weighted features, which is given by As shown in Figure 3, two convolution layers with the softplus function are embedded into the dense feature extraction network in the second step. The subsequent pooling layer and fully connected layer are added to predict classes. The module is trained in this step singly, which means that the weights of dense feature extraction modules are frozen during the training process. The image pyramids generated from the Hong Kong street view dataset are used as a training dataset with 100 epochs. The cross-entropy loss function is adopted. The output of the fully connected layer y is the sum of weighted features, which is given by is the size of , which depends on the dimensionality of the outputs of the convolution layer. The output is the weighted features, which is given by As shown in Figure 3, two convolution layers with the softplus function are embedded into the dense feature extraction network in the second step. The subsequent pooling layer and fully connected layer are added to predict classes. The module is trained in this step singly, which means that the weights of dense feature extraction modules are frozen during the training process. The image pyramids generated from the Hong Kong street view dataset are used as a training dataset with 100 epochs. The cross-entropy loss function is adopted. The output of the fully connected layer y is the sum of weighted features, which is given by (2) where ∈^ represents the weights of the final fully connected layer of the CNNs trained to predict M classes.

A Grid Feature-Point Selection Method
Although the precision of local features is better than that of global features, an image requires a large number of local features for a description, which also leads to the need for more storage space. Due to a large number of images in a street view image retrieval system, the feature files take up considerable hard disk storage space. Table 1 shows the relationship between the number of feature points and the size of the feature file. It is an impossible task to load them directly into memory for image retrieval. Thus, product quantization is used to compress the image features for more efficient coding. It would be presented in Section 2.4. However, when performing product quantization, the original feature file also needs to be loaded into the GPU or memory for the training codebook. Although the CPU can be used to index features, the training speed is slow, while the execution speed on the GPU is dozens or even hundreds of times that of the CPU. However, graphics memory has limitations and hardly loads all feature files. Therefore, it is necessary to reduce the image feature file to an appropriate size, use as few distinctive features as possible to represent the images, and minimize the loss of precision. The effective feature selection method can reduce the size of the feature file and improve the speed of index construction. To reduce memory consumption, the grid feature-point selection method (GFS) is proposed to reduce the features. As shown in Figure 4, the method is different from methods that simply select the first N features with high attention scores, but the image is divided into multiple regions uniformly by a grid and selects the first N features in each region. The method steps are as follows: (1) The street view image of H W size is divided into I J regions, and the size of each region is h w. Each region contains corresponding image features. Features are located in the region of the image based on the receptive field of CNNs, which is calculated by the configuration of the convolution layer and the pooling layer. The center pixel coordinate of the receptive field is used for the position of the feature. In addition, the size of the receptive field is inversely proportional to the scale when performing multiscale feature extraction. The feature of an image is , and its position in the image is , . The feature is located in the , region, and , is the region number corresponding to the feature :

A Grid Feature-Point Selection Method
Although the precision of local features is better than that of global features, an image requires a large number of local features for a description, which also leads to the need for more storage space. Due to a large number of images in a street view image retrieval system, the feature files take up considerable hard disk storage space. Table 1 shows the relationship between the number of feature points and the size of the feature file. It is an impossible task to load them directly into memory for image retrieval. Thus, product quantization is used to compress the image features for more efficient coding. It would be presented in Section 2.4. However, when performing product quantization, the original feature file also needs to be loaded into the GPU or memory for the training codebook. Although the CPU can be used to index features, the training speed is slow, while the execution speed on the GPU is dozens or even hundreds of times that of the CPU. However, graphics memory has limitations and hardly loads all feature files. Therefore, it is necessary to reduce the image feature file to an appropriate size, use as few distinctive features as possible to represent the images, and minimize the loss of precision. The effective feature selection method can reduce the size of the feature file and improve the speed of index construction. To reduce memory consumption, the grid feature-point selection method (GFS) is proposed to reduce the features. As shown in Figure 4, the method is different from methods that simply select the first N features with high attention scores, but the image is divided into multiple regions uniformly by a grid and selects the first N features in each region. The method steps are as follows: (1) The street view image of H × W size is divided into I × J regions, and the size of each region is h × w. Each region contains corresponding image features. Features are located in the region of the image based on the receptive field of CNNs, which is calculated by the configuration of the convolution layer and the pooling layer. The center pixel coordinate of the receptive field is used for the position of the feature. In addition, the size of the receptive field is inversely proportional to the scale when performing multiscale feature extraction. The k th feature of an image is T k , Remote Sens. 2020, 12, 3978 7 of 18 and its position in the image is (x k , y k ). The feature is located in the (i, j) region, and G k (i k , j k ) is the region number corresponding to the feature T k : (2) The number of feature points in each region M i,j is counted according to G k .
(3) Finally, two strategies for selecting feature points are proposed.
a. GFS-N: The first strategy is to select by number, which means select the first N features with a high attention score in each region and all features in the grid are selected if the number of features in the grid is less than N, and S i,j is the feature number to be selected in each region (i, j): b. GFS-P: The second strategy is to select by percentage, which means select the top n% features in each region. Actually, GFS-P equivalent to select high score feature points without a grid. Therefore, the method is proposed to compared with GFS-N to evaluate GFS performance.
Remote Sens. 2020, 12, x FOR PEER REVIEW 7 of 18 (2) The number of feature points in each region , is counted according to .
(3) Finally, two strategies for selecting feature points are proposed.
a． GFS-N: The first strategy is to select by number, which means select the first N features with a high attention score in each region and all features in the grid are selected if the number of features in the grid is less than N, and , is the feature number to be selected in each region i, j : b. GFS-P: The second strategy is to select by percentage, which means select the top % features in each region. Actually, GFS-P equivalent to select high score feature points without a grid. Therefore, the method is proposed to compared with GFS-N to evaluate GFS performance.

Large-Scale Street View Image Retrieval Based on Product Quantization
The street view image retrieval method searches the feature vector closest to the query vector from the features dataset, but it is not an effective method that traverses all features for a large-scale image retrieval system because it will take considerable time to calculate the distance among millions of vectors. Therefore, product quantization is deployed to construct an image index that improves the speed of retrieval and reduces memory requirements during retrieval.
Product quantization divides the N D-dimensional features into M parts first, and each part has N D/M-dimensional features. For example, N 128-dimensional features are divided into 4 parts, and each part has N 32-dimensional features. Next, each part is trained to a codebook separately. The size of the codebook is K. Then, each part of the feature generates an index value, which is the number of the nearest cluster center of this part. Finally, each feature is represented as an index value combined with the M parts index. The process is shown in Figure 5. To improve the precision of image retrieval, the asymmetric distance calculation is performed, where the original features of the query image do not undergo the product quantization process. The distance of two image features is converted into the Euclidean distance of the indexes. The resulting image ID can be queried according to the image features through the inverted index file. The task is completed using Faiss [47], which is a library for similarity searching and clustering of vectors.

Large-Scale Street View Image Retrieval Based on Product Quantization
The street view image retrieval method searches the feature vector closest to the query vector from the features dataset, but it is not an effective method that traverses all features for a large-scale image retrieval system because it will take considerable time to calculate the distance among millions of vectors. Therefore, product quantization is deployed to construct an image index that improves the speed of retrieval and reduces memory requirements during retrieval.
Product quantization divides the N D-dimensional features into M parts first, and each part has N D/M-dimensional features. For example, N 128-dimensional features are divided into 4 parts, and each part has N 32-dimensional features. Next, each part is trained to a codebook separately. The size of the codebook is K. Then, each part of the feature generates an index value, which is the number of the nearest cluster center of this part. Finally, each feature is represented as an index value combined with the M parts index. The process is shown in Figure 5. To improve the precision of image retrieval, the asymmetric distance calculation is performed, where the original features of the query image do not undergo the product quantization process. The distance of two image features is converted into the Euclidean distance of the indexes. The resulting image ID can be queried according to the image features through the inverted index file. The task is completed using Faiss [47], which is a library for similarity searching and clustering of vectors. An ANN (approximate nearest neighbor) search is performed for each local feature of the query image. As shown in Figure 6, all the retrieval results are summarized and correlated with the street view images in the dataset if there are k features that match the query image. The process is as follows: (1) A query image contains features ∈ 1 … , and an ANN search is performed on each feature to obtain the most similar top features ∈ 1 … on the index of the feature database. Performing queries will obtain * features. (2) Since each feature corresponds to a street view image in the inverted index file, it is easy to count the number of similar features in each street view image and sort the results accordingly.
(3) Geometric verification is performed using RANSAC on retrieval results to exclude some distractor images that match the query image using ANN but are different from visual information. The retrieval results are sorted according to the number of inliers, and output. An ANN (approximate nearest neighbor) search is performed for each local feature of the query image. As shown in Figure 6, all the retrieval results are summarized and correlated with the street view images in the dataset if there are k features that match the query image. The process is as follows: (1) A query image Q contains n features q i (i ∈ 1 . . . n), and an ANN search is performed on each feature q i to obtain the most similar top m features f i (i ∈ 1 . . . m) on the index of the feature database. Performing n queries will obtain n * m features. (2) Since each feature f i corresponds to a street view image in the inverted index file, it is easy to count the number of similar features in each street view image and sort the results accordingly. (3) Geometric verification is performed using RANSAC on retrieval results to exclude some distractor images that match the query image using ANN but are different from visual information. The retrieval results are sorted according to the number of inliers, and output.

Retrieval Method Comparison and Evaluation
The results of the method in this paper are compared with GEM [22], CROW [28], RMCA [29], and Hessian-Affine extractor + SIFT descriptor (Hesaff SIFT) [12] to evaluate the ability of street view image retrieval performance. The GEM, CROW, and RMCA are global features based on CNNs. Thus, the network of the dense feature extraction module with the same configuration in Section 2.2 is used, and dense features are pooled to extract the global features. The Hesaff SIFT method is a handcrafted local feature that has better performance than SIFT on large-scale retrieval tasks. Product quantization is used to generate indexes for retrieval features. In addition, all methods are tested on the Hong Kong street view dataset. feature to obtain the most similar top features ∈ 1 … on the index of the feature database. Performing queries will obtain * features. (2) Since each feature corresponds to a street view image in the inverted index file, it is easy to count the number of similar features in each street view image and sort the results accordingly.
(3) Geometric verification is performed using RANSAC on retrieval results to exclude some distractor images that match the query image using ANN but are different from visual information. The retrieval results are sorted according to the number of inliers, and output.  In this paper, the same visual information between the results and query image is considered. The evaluation measure P v is employed, which is given by where Q is the number of query images, and if at least one image that has the same visual content is retrieved within the first N results for i th query image, q i = 1; otherwise, q i = 0. In addition, the distance between the retrieval results and the query image is also evaluated. The P r in different radii of the query image is also counted, which is given by where d j denotes the distance between the query image and j th results; if at least one image that is in the range of query image radius (D) is retrieved within the first N results for the i th query image, r i = 1; otherwise, r i = 0.

Study Area
The Causeway Bay and Wan Chai areas in the northern part of Hong Kong Island are selected as the study area, which is famous for a large number of skyscrapers with diverse building façades. The area is dense with roads, population, and buildings, and has a high rate of street view image collection. The buildings, especially residential buildings, have similar styles. Reflection glass curtainwalls, numerous vehicles, and pedestrians usually represent a considerable challenge for street view image retrieval.
The Hong Kong street view dataset in the study area contains 239,400 images (640 × 480) from 6650 panoramic classes collected in 2017 and 38 query images taken by mobile phones collected in 2019. The experimental data are within this range of (114.16267, 22.283887) to (114.18704, 22.273796), and the regional area is 2.81 km 2 . The dataset covers almost all major roads in the area. Each class is labeled with a GPS coordinate, and the distance between each class is approximately 10-12 m. The geographic distribution of these images is shown in Figure 7, and example images are presented in Figure 8. It is worth noting that the dataset images and the query images are not taken at the same time, which means that the billboards on the building facade may have been replaced or the buildings may have been renovated for other styles. There are obstacles in the image, such as vehicles and pedestrians. Both of them also create difficulties in the retrieval task.

Experiments Design
The experiments are completed on a computer equipped with an RTX 2080TI GPU, an INTEL I7-9700K CPU, and 32 GB RAM. The detailed implementation of the experiment is as follows: (1) Data preprocessing. Projection parameters are set to crop the panoramic image and generate perspective images with approximately 20% overlap between the two adjacent images. The parameters set and other information are shown in Table 2. Each combination of parameters generates an image, and a panoramic image produces a total of 36 images. (2) Street view image feature extraction based on DELF. As shown in Table 3, a configuration similar to [21] is adopted during the training process, and CNNs are built using PyTorch. The first 1000 local features are extracted based on the attention score.  (1) Grid feature-point selection method: After experimenting with different sizes and numbers of grids, it was found that changes in the number and size of grids had no significant impact on the retrieval precision and the number of features. Therefore, as shown in Table 4, the grid is set to 8 × 10, and N (GFS-N) and n% (GFS-P) are set to different values to measure the impact on retrieval, where N = all refers to retaining all points in the grid region.

Experiments Design
The experiments are completed on a computer equipped with an RTX 2080TI GPU, an INTEL I7-9700K CPU, and 32 GB RAM. The detailed implementation of the experiment is as follows: (1) Data preprocessing. Projection parameters are set to crop the panoramic image and generate perspective images with approximately 20% overlap between the two adjacent images.
The parameters set and other information are shown in Table 2. Each combination of parameters generates an image, and a panoramic image produces a total of 36 images. (2) Street view image feature extraction based on DELF. As shown in Table 3, a configuration similar to [21] is adopted during the training process, and CNNs are built using PyTorch. The first 1000 local features are extracted based on the attention score.  (3) Grid feature-point selection method: After experimenting with different sizes and numbers of grids, it was found that changes in the number and size of grids had no significant impact on the retrieval precision and the number of features. Therefore, as shown in Table 4, the grid is set to 8 × 10, and N (GFS-N) and n% (GFS-P) are set to different values to measure the impact on retrieval, where N = all refers to retaining all points in the grid region.

Results
As shown in Table 5, GFS-N is evaluated on the Hong Kong street view dataset. Feature points refer to the average number of features per image after performing the method. The index size refers to the size of the image database index file generated from product quantization. The precision of the top1 retrieval results is presented. The precision of image retrieval increases as the value of N increases until it reaches the highest value when N = 8. After that, the precision of the model grows slowly. The size of the index file and the number of feature points are proportional to N.
The top 1 retrieval result of GFS-P is shown in Table 6. Since the number of feature points is proportional to the size of the index file, only the number of feature points is counted. The retrieval precision gradually increases with the proportion of feature selection. When it increased to 70%, the growth rate decreased. Compared with 100%, the precision lost 5.27%, and the number of feature points decreased by 22.02%. The percentage is positively related to the number of feature points.
As shown in Table 7, GFS (N = 8) outperforms other methods. The top 1 results are counted. The P v of GFS is 5.27-23.59% higher than those of the other methods. Compared with other methods, the P r=100 , P r=200 , and P r=500 of GPS increased by 13.16-26.32%, 10.89-26.32%, and 0-21.05%, respectively.

The Evaluation of Compression Capability and Precision of GFS
GFS has the ability to effectively reduce feature volume and memory usage, which alleviates the problem that deep local features have too many feature points to load into memory in large-scale street view image retrieval. Figure 9 shows the precision of GFS-N and GFS-P at different feature points. Overall, combined with Tables 5 and 6, it is shown that GFS-N is better than GFS-P in the range of 115 to 464 feature points that mean the performance of GFS is better than the method that selects high score feature points without a grid. In addition, according to the different application scenarios, GFS can reduce the number of feature points in the range of 32.27-77.09%. The feature points can be effectively reduced by GFS-N between 340 and 446 of the original points without loss precision performance. Compared to the results without any feature-point filtering (N = all), the number of feature points at the peak of precision (N = 8) is reduced by 32.27%, and precision increases by 2.63%. The number of feature points is reduced by 50.80% (N = 6), and the precision decreases by 2.64%. Further, the number of feature points is reduced by 77.09% (N = 2), and the precision decreases by 15.79%. GFS filters out the features with the smallest attention scores. Most of these features express interfering objects, which reduces the precision of the ANN search. The grid can select the high attention score feature points of the local region and keep richer feature points in the image for retrieval. In GFS-N, since N limits the upper limit of the number of feature points in each grid area, if the number of feature points in a grid area is less than N, the area will not be affected by N.

The Evaluation of Compression Capability and Precision of GFS
GFS has the ability to effectively reduce feature volume and memory usage, which alleviates the problem that deep local features have too many feature points to load into memory in large-scale street view image retrieval. Figure 9 shows the precision of GFS-N and GFS-P at different feature points. Overall, combined with Tables 5 and 6, it is shown that GFS-N is better than GFS-P in the range of 115 to 464 feature points that mean the performance of GFS is better than the method that selects high score feature points without a grid. In addition, according to the different application scenarios, GFS can reduce the number of feature points in the range of 32.27%-77.09%. The feature points can be effectively reduced by GFS-N between 340 and 446 of the original points without loss precision performance. Compared to the results without any feature-point filtering (N = all), the number of feature points at the peak of precision (N = 8) is reduced by 32.27%, and precision increases by 2.63%. The number of feature points is reduced by 50.80% (N = 6), and the precision decreases by 2.64%. Further, the number of feature points is reduced by 77.09% (N = 2), and the precision decreases by 15.79%. GFS filters out the features with the smallest attention scores. Most of these features express interfering objects, which reduces the precision of the ANN search. The grid can select the high attention score feature points of the local region and keep richer feature points in the image for retrieval. In GFS-N, since N limits the upper limit of the number of feature points in each grid area, if the number of feature points in a grid area is less than N, the area will not be affected by N.

Comparision with Other Methods
Compared with the GEM, RMAC, CROW, and Hesaff SIFT methods, GFS shows better precision.
of GFS (N = 8) increased by 13.16%, 5.27%, 23.69%, and 15.79%, respectively. Figure 10 shows a comparison of the results with other methods. In most cases, the retrieval system correctly retrieves the same scene as the query image. However, due to the shooting location, urban environment, and weather, the query image will have different content from the dataset, so it is necessary to extract the image features with the invariant angle or scale from the image. GFS has the ability to retrieve images with different angles but with the same semantic content, while other methods may retrieve results that are similar in content but are actually incorrect such as the sixth query image. There may be many reasons for GFS retrieval failure. The selected feature samples are not representative during the index training process, which causes the query features to search in the wrong cluster, such as the fourth and the fifth query images. As shown in Figure 11, although the recall of most methods in large-scale datasets is not good, GFS still has a high recall of correctly retrieved images. In addition to retrieving street scenes similar to the query image overall, the method can also extract features of a small area of the image, such as signs, billboards, and the building textures. The correct extraction of deep local features has a decisive influence on the retrieval results. This indicates that GFS has the ability to select representative features.

Comparision with Other Methods
Compared with the GEM, RMAC, CROW, and Hesaff SIFT methods, GFS shows better precision. P v of GFS (N = 8) increased by 13.16%, 5.27%, 23.69%, and 15.79%, respectively. Figure 10 shows a comparison of the results with other methods. In most cases, the retrieval system correctly retrieves the same scene as the query image. However, due to the shooting location, urban environment, and weather, the query image will have different content from the dataset, so it is necessary to extract the image features with the invariant angle or scale from the image. GFS has the ability to retrieve images with different angles but with the same semantic content, while other methods may retrieve results that are similar in content but are actually incorrect such as the sixth query image. There may be many reasons for GFS retrieval failure. The selected feature samples are not representative during the index training process, which causes the query features to search in the wrong cluster, such as the fourth and the fifth query images. As shown in Figure 11, although the recall of most methods in large-scale datasets is not good, GFS still has a high recall of correctly retrieved images. In addition to retrieving street scenes similar to the query image overall, the method can also extract features of a small area of the image, such as signs, billboards, and the building textures. The correct extraction of deep local features has a decisive influence on the retrieval results. This indicates that GFS has the ability to select representative features. Remote Sens. 2020, 12, x FOR PEER REVIEW 14 of 18

Limitations and Future Enhancements
GFS is proposed to reduce the storage requirements and shows excellent performance. However, there are still some limitations. Although GFS can reduce the number of feature points and index file size, a regular rectangular grid cannot filter features of irregular objects. Furthermore, since there are a large number of uncertain interference factors such as vehicles and pedestrians in the street view image, it is difficult for the regular grid to effectively filter the features expressing the interferences. Therefore, the interference information in the image can be purposefully filtered through other methods, such as semantic segmentation [48,49] or object detection [50,51], to reduce their contribution in similarity calculation. Moreover, because of the lack of research on feature compression currently, GFS will be compared with other future research methods.
In addition, experiments are only performed on a partial area of the city, and the query data coverage is relatively small. Consideration should be given to different urban environments, such as snow or night. Further experiments will also be conducted in a more varied environment.
Finally, because the shooting location of the street view image is usually different from the query image, there are usually some errors using only the visual information of the image to retrieve the location. It is possible to combine with 3D reconstruction [52] to improve the precision of street view localization.

Conclusions
This paper proposes a grid feature-point selection method (GFS) suitable for large-scale street view image retrieval based on deep local features. Attention-based multiscale features are extracted to represent street view images. The grid is used to divide the image into several rectangular regions and selects a certain number of features in each region to reduce the number of feature points. Product quantization is performed to construct an index of features and speed up image retrieval. The Hong Kong street view dataset and mobile phone photos are used in experiments. The results show that the GFS can select representative local features and reduce the number of feature points by 32.27-77.09% compared with raw feature points. In addition, GFS outperforms other methods in retrieval precision. Future exploration will also focus on selecting more representative features and improving the robustness of retrieval in a variety of urban environments.