Hierarchical Clustering-Based Image Retrieval for Indoor Visual Localization

: Visual localization is employed for indoor navigation and embedded in various applications, such as augmented reality and mixed reality. Image retrieval and geometrical measurement are the primary steps in visual localization, and the key to improving localization efﬁciency is to reduce the time consumption of the image retrieval. Therefore, a hierarchical clustering-based image-retrieval method is proposed to hierarchically organize an off-line image database, resulting in control of the time consumption of image retrieval within a reasonable range. The image database is hierarchically organized by two stages: scene-level clustering and sub-scene-level clustering. In scene-level clustering, an improved cumulative sum algorithm is proposed to detect change points and then group images by global features. On the basis of scene-level clustering, a feature tracking-based method is introduced to further group images into sub-scene-level clusters. An image retrieval algorithm with a backtracking mechanism is designed and applied for visual localization. In addition, a weighted KNN-based visual localization method is presented, and the estimated query position is solved by the Armijo–Goldstein algorithm. Experimental results indicate that the running time of image retrieval does not linearly increase with the size of image databases, which is beneﬁcial to improving localization efﬁciency.


Introduction
With the development of communication technology, smart mobile terminals, such as smartphones and tablet personal computers, have become indispensable in modern society. Various applications on smart mobile terminals bring convenience to many aspects of people's lives, and one example is navigation applications. The Global Navigation Satellite System (GNSS) allows individuals to acquire position information at any moment in outdoor environments [1][2][3][4]. Navigation and positioning services play crucial roles in public traffic, maritime transportation, and aviation flight. As the interior environments of buildings become increasingly complex, demands for indoor position services continue to rise. However, due to the shielding effect of structures, signals of the Global Navigation Satellite System are incapable of penetrating buildings, leading to users not being able to obtain reliable position services by the GNSS. Therefore, a stable and efficient indoor localization method independent of satellite signals has become a research hotspot in late years. Numerous daily activities can benefit from indoor localization technologies, such as shopping in large malls, finding books in libraries, and planning routes in railway stations and airports.
Many signal-based localization technologies have been investigated for querying indoor position information, such as WiFi-based [5,6], Bluetooth-based [7,8], and UWBbased [9][10][11] methods. All these methods, however, require investments in localization infrastructures. For example, WiFi-based approaches demand that a mobile terminal should receive signals transmitted by more than one access point [12,13]. Generally, more A typical indoor visual localization system contains two stages: an on-line stage and an off-line stage, as shown in Figure 1. In the off-line stage, images are captured by the database camera mounted on the mapping equipment, and poses of the equipment are recorded simultaneously. In order to construct an indoor 3D map, Microsoft Kinects and laser scanners are also mounted on the mapping equipment [19]. An off-line database should be generated before the implementation of visual localization. It contains the essential elements for localization: database images, poses (including orientations and positions) of the equipment, and indoor 3D maps.
In the on-line stage, a query image is captured by the user and uploaded to the server by wireless networks. The most similar database images (i.e., matched database images) to the query image are retrieved based on the visual features extracted from the images. Then, the position of the matched database images can be employed to estimate the position of the query camera. Accuracy and efficiency of image retrieval are the key to ensuring the good performance of the system. A hierarchical clustering-based image retrieval is proposed in this paper and mainly discussed in the following.

Feature Extraction and Pre-Processing
Gist is a scene-centered global feature commonly used in scene classification and place recognition. In this paper, Gist features are employed for scene-level image clustering by change-point detection. In the implementation of creating the off-line database (i.e., an indoor 3D map), database images are successively acquired by the mapping equipment in the same indoor scene, and visual features of these images have high correlations. In contrast, when the mapping equipment moves from one indoor scene to another, the correlations of the captured database images are weakened. Based on this characteristic of database images, change-points can be detected in the global features extracted from the images, resulting in the database images captured in the same indoor scene being grouped in a cluster, which achieves scene-level image clustering. The center of each cluster represents the main features of the scene, so that the query image orderly retrieves each cluster by measuring the difference between the query image and the centers of database image clusters.
To reduce computation complexity, visual feature vectors should keep a low dimension, so an image is regarded as a whole from which global features are extracted. For each query image and database image, a three-scale ( 1, 2, 3 G S = ) and six-orientation ( =0 1 18 q ≤ ≤ ), and i is the index of database images. The database images successively acquired in the same indoor scene have high visual correlations. However, once the mapping equipment switches to another scene, the correlations subsequently decrease. The change-point detection method is employed, aiming to detect the change in the visual

Feature Extraction and Pre-Processing
Gist is a scene-centered global feature commonly used in scene classification and place recognition. In this paper, Gist features are employed for scene-level image clustering by change-point detection. In the implementation of creating the off-line database (i.e., an indoor 3D map), database images are successively acquired by the mapping equipment in the same indoor scene, and visual features of these images have high correlations. In contrast, when the mapping equipment moves from one indoor scene to another, the correlations of the captured database images are weakened. Based on this characteristic of database images, change-points can be detected in the global features extracted from the images, resulting in the database images captured in the same indoor scene being grouped in a cluster, which achieves scene-level image clustering. The center of each cluster represents the main features of the scene, so that the query image orderly retrieves each cluster by measuring the difference between the query image and the centers of database image clusters.
To reduce computation complexity, visual feature vectors should keep a low dimension, so an image is regarded as a whole from which global features are extracted. For each query image and database image, a three-scale (S G = 1, 2, 3) and six-orientation (O G = 0 • , 60 • , 120 • , 180 • , 240 • , 300 • ) filters are used in Gist feature extraction, where S G and O G present spatial scale levels and cardinal orientations, respectively. During feature extraction, images are processed via convolution operation by multi-channel filters, and then the filtering results are connected to achieve 18 (3 × 6 = 18) dimensional feature vectors. G Q and G D denote global features extracted from the query image and database images, respectively. Figure 2 shows examples of database images and the corresponding spectrograms of Gist features.
Electronics 2022, 11, x FOR PEER REVIEW 6 of 32 correlations between database images, and further, the database images captured in the same scene are grouped into one cluster. If there are in total D n database images in the indoor 3D map (i.e., the off-line database), D n features can be extracted from the database images. Therefore, a Gist feature matrix G M can be obtained by organizing all feature vectors: Figure 2. Examples of database images and the corresponding spectrograms of Gist features.
For visual localization, the query image is captured by a hand-held smartphone, and image retrieval is completed on the server side after the image is uploaded to the server. The Gist feature vectors of the query image and database images are separately presented Electronics 2022, 11, 3609 6 of 31 by G Q = g 1 Q , · · · g p Q , · · · g 18 Q and G i D = g (i, 1) D , · · · , g (i, q) D , · · · , g (i, 18) D , where g p Q and g (i, q) D denote the p-th and q-th elements in the feature vector G Q and G D (1 ≤ p ≤ 18 and 1 ≤ q ≤ 18), and i is the index of database images. The database images successively acquired in the same indoor scene have high visual correlations. However, once the mapping equipment switches to another scene, the correlations subsequently decrease. The change-point detection method is employed, aiming to detect the change in the visual correlations between database images, and further, the database images captured in the same scene are grouped into one cluster.
If there are in total n D database images in the indoor 3D map (i.e., the off-line database), n D features can be extracted from the database images. Therefore, a Gist feature matrix M G can be obtained by organizing all feature vectors: To achieve scene-level clustering, change-point detection acts on each column in M G . Specifically, change points should be separately detected in the column vector On account of noise existing in feature extractions, a pre-process should be applied on Gist features. Therefore, the Kalman filter and the Kalman smoother are used in this paper to recover the visual correlations of database images acquired in the same scene. The purpose of feature pre-processing is to avoid false detection caused by noise, namely, detecting a change point without a scene change.
If the state variable and the observed variable of features are separately set as x i and y i , the system equation is: where Φ i−1 and H i are gain matrixes. The process noise w i and measurement noise v i satisfy: where N (µ, σ) denotes a normal distribution with the variance σ and the expectation µ. The initial value of x i is defined as The discrete Kalman filter estimates the process state by feedback control. The typical Kalman filter can be divided into two parts, namely, the time-update part and the measurement-update part. In the time-update part, to obtain the prior estimate of the next time state, the Kalman filter calculates the state variables of the current time and the estimated covariance of errors by the update equation. In the measurement-update section, new observations are combined with prior estimates to obtain more reasonable posterior estimates by feedback operations.
When using a discrete Kalman filter to process data, the previous error covariance P i before system updating should be calculated by: where V i−1 is the error covariance after system updating. The Kalman gain K i can be further calculated based on the error covariance P i by: According to the un-updated error covariance and Kalman gain, the error covariance can be updated by: Then, the posterior estimates of the state variablex i (updated value) can be obtained by: where the prior estimate of the state variable can be obtained by the extrapolation formula: System iterative updates can be achieved by Equations (7) and (8), which achieves Kalman estimation for all measured values.
After Kalman filtering, Kalman smoothing needs to be performed on the filtering results. According to the Kalman backward smoothing equations, the smoothed estimatê x i of the state variable and the smoothed error covariance V ' i can be obtained by: where J i is defined as: The optimized vector M can be obtained based on the Kalman smoother by Equations (9) and (10), where g (i, q) K is the feature element after Kalman filtering and smoothing. n D is the total number of database images, and q is the index of feature vectors. As 18 Gist features are extracted from a database image, the range of q satisfies 1 ≤ q ≤ 18.

Image Clustering Based on CUSUM Change-Point Detection
Change-point detection on Gist features of database images is to group the successive database images between change points into one cluster, thereby realizing scene-level database image clustering. For a random process that occurs in chronological order, changepoint detection is detecting whether the distribution or distribution parameters of random elements in the process suddenly change at a certain moment. In this paper, change points are detected on the Gist features extracted from successive database images, thereby finding the database image in which the indoor scene changes. When the change-point detection on all Gist features is completed, the database images between the change points are grouped into one cluster, and these images are deemed to be acquired in the same scene (e.g., office, kitchen, corridor, etc.). The rationality of the algorithm is that when constructing the off-line 3D indoor map, the database camera successively captures the database images in the same scene. Therefore, the obtained database images are successive in the same scene. That is, the Gist features of the database images have a certain correlation. Once the indoor scene changes, it is possible to perceive the occurrence of such a change by detecting the change points of the Gist features.
The CUSUM (Cumulative Sum) algorithm used for change-point detection in this paper is an anomaly detection method commonly used in industrial fields. The CUSUM algorithm is generally applied to all data to detect change points. For data located at a certain position in the sequence, other data in front of and behind this position are used for change-point detection. However, such a detection method will undoubtedly increase the time overhead, especially as the amount of historical data becomes larger and larger with time, which will eventually cause excessive time overhead. Therefore, a sliding window is introduced to constrain the number of Gist features that need to be processed in change-point detection, as shown in Figure 3.
ing the off-line 3D indoor map, the database camera successively captures the database images in the same scene. Therefore, the obtained database images are successive in the same scene. That is, the Gist features of the database images have a certain correlation. Once the indoor scene changes, it is possible to perceive the occurrence of such a change by detecting the change points of the Gist features.
The CUSUM (Cumulative Sum) algorithm used for change-point detection in this paper is an anomaly detection method commonly used in industrial fields. The CUSUM algorithm is generally applied to all data to detect change points. For data located at a certain position in the sequence, other data in front of and behind this position are used for change-point detection. However, such a detection method will undoubtedly increase the time overhead, especially as the amount of historical data becomes larger and larger with time, which will eventually cause excessive time overhead. Therefore, a sliding window is introduced to constrain the number of Gist features that need to be processed in change-point detection, as shown in Figure 3. Figure 3. Diagram of sliding windows in CUSUM change-point detection.
As shown in Figure 3, an 18-dimensional Gist feature vector can be extracted with different scales and directions for each database image. After Kalman filtering and smoothing, each Gist feature can be represented as where the superscript i indicates the index of database images, and the subscript q indicates the element position in a feature vector. For the Gist features extracted from different database images, if they are located in the same column q, the scales and orientations of these features are the same.
In order to constrain the data size during change-point detection, a sliding window of the size ω is implemented in CUSUM change-point detection. Specifically, if the currently detected position is i , and the corresponding Gist feature is ( ) In the CUSUM change-point detection, considering the sequence of successively acquired database images is a time series, the acquisition time corresponds to the index of the database image in the sequence. Therefore, the change-point detection in this paper detects the position at which the change point appears in the image sequence. The CUSUM change-point detection algorithm estimates the position of the change point in the sequence by calculating the parameter models of Gist feature sequences. The probability density function is employed to determine the positions of change points in a Gist feature sequence. For a Gist feature sequence As shown in Figure 3, an 18-dimensional Gist feature vector can be extracted with different scales and directions for each database image. After Kalman filtering and smoothing, each Gist feature can be represented as g , where the superscript i indicates the index of database images, and the subscript q indicates the element position in a feature vector. For the Gist features extracted from different database images, if they are located in the same column q, the scales and orientations of these features are the same. In order to constrain the data size during change-point detection, a sliding window of the size ω is implemented in CUSUM change-point detection. Specifically, if the currently detected position is i, and the corresponding Gist feature is g (i, q) K , only the features in the sliding window are considered. That is to say, only the features between g (i−w/2+1, q) K and g (i+w/2, q) K are detected. In addition, the change-point detection algorithm is only applied to the features that have the same scale and orientation. For different Gist feature sequences, such as M 1 K , · · · , M 17 K , and M 18 K , they should be separately detected.
In the CUSUM change-point detection, considering the sequence of successively acquired database images is a time series, the acquisition time corresponds to the index of the database image in the sequence. Therefore, the change-point detection in this paper detects the position at which the change point appears in the image sequence. The CUSUM change-point detection algorithm estimates the position of the change point in the sequence by calculating the parameter models of Gist feature sequences. The probability density function is employed to determine the positions of change points in a Gist feature sequence. For a Gist feature sequence M  , sequence M q W can be divided into two sub-sequences: According to the Neyman-Person lemma, the core of CUSUM change-point detection can be considered a hypothesis-test problem. For this hypothesis test, the null hypothesis H 0 is the case that the feature element is not a change point. In this case, the indoor scene corresponding to the database image does not change. In contrast to the null hypothesis, alternative hypothesis H 1 indicates the scene changes, in which case the feature element does not satisfy the previous parameter model. The purpose of the CUSUM algorithm is to monitor and determine at which point hypothesis H 1 switches to H 0 in the feature sequence. For sub-sequences M A W and M B W in the sliding window, it is considered that the feature elements in the sequences are independent variables and subject to the normal distribution, so two parameter models, namely, parameter model A and parameter model B, can be obtained.
The probability density functions of the two parameter models are f A and f B which satisfy: where µ A and µ B are the expectation and variance of the parameter model A. µ B and σ B are the expectation and variance of the parameter model B.
If the hypothesis test is applied to the feature elements in the sliding window, the probability density function under the null hypothesis H 0 is f A , and the probability density function under the alternative hypothesis H 1 is f B . Thus, a likelihood ratio function h i relating to g (i, q) K can be obtained by: Based on the likelihood ratio function, a cumulative sum function can be defined as: For the CUSUM algorithm, the position t ch of the change point can be calculated by a given threshold T ch : According to the detection principle shown in (17), a threshold must be set in advance when employing the typical CUSUM algorithm for change-point detection. However, in many cases, it is difficult to determine the threshold for change-point detection due to the complexity and diversity of indoor scenes. Therefore, an improved cumulative sum change-point detection (ICSCD) algorithm is proposed to identify change points without a given threshold.
Since the probability density functions of the two sub-sequences in the sliding window are subject to a normal distribution, the likelihood ratio function can be expanded as: If the variances of parameter model A and parameter model B are considered identical (i.e., σ A = σ B = σ AB ), then the likelihood ratio function h i can be simplified as: The cumulative sum function of g (i, q) K in the proposed ICSCD algorithm is defined as: where expectations µ A and µ B corresponding to parameter models A and B can be calculated by: According to Equation (20), the cumulative sum function depends on three variables: the Gist feature element g (i, q) K detected as the change-point, the expectation µ A of parameter model A, and the expectation µ B of parameter model B. Therefore, the numerator of Equation (20) can be defined as a change-point detection function to monitor whether the indoor scene changes on the position i: Depending on the above analysis, Gist feature sequences can be processed by the change-point detection function. The specific process is as follows: first, for the feature elements between w/2 and (n D − w/2) in the sliding window, the values of the changepoint detection function need to be calculated, where w is the size of the window and n D is the total number of database images. Second, the peaks of the discrete values of function F C are detected, and the peaks correspond to the change points in feature sequences. More than one Gist feature sequence is extracted from the database images (18 feature sequences extracted in this paper, i.e., M q K , 1 ≤ q ≤ 18), and each feature sequence needs to be detected separately. Therefore, it is necessary to propose a strategy to integrate the detected change points from each feature sequence to find the images corresponding to indoor scene changes.
Since 18 Gist feature elements are extracted from each database image, an 18 × n D matrix containing change-point marks can be obtained for n D database images. Each mark in the matrix represents whether the Gist feature element on this position is a change point. Specifically, if the feature element in one position is detected as a change point, the value of the mark in this position is defined as 1. Otherwise, the value is defined as 0. As shown in Figure 4, a window with the size l w (l w is an odd number, namely l w = 2k w + 1, and k w is a positive integer) slides on the matrix, and the values of the marks in the window are added up. The meaning of the accumulated value of the marks is the total number of change points in the window. If the accumulated value of the marks in the window exceeds a given threshold, the center of the window is considered the position of the change point.

Feature Tracking-Based Image Clustering, Hierarchical Retrieval and Visual Localization
Based on the proposed ICSCD algorithm, scene-level database image clustering can be achieved by global visual features of database images. However, in the implementation of database construction, there are usually too many database images acquired in the same indoor scene, which leads to excessive time overheads of image retrieval in the on-line stage. Therefore, the database images in the same scene-level cluster are further grouped to achieve sub-scene-level clustering. For sub-scene-level clustering, local visual features

Feature Tracking-Based Image Clustering, Hierarchical Retrieval and Visual Localization
Based on the proposed ICSCD algorithm, scene-level database image clustering can be achieved by global visual features of database images. However, in the implementation of database construction, there are usually too many database images acquired in the same indoor scene, which leads to excessive time overheads of image retrieval in the on-line stage. Therefore, the database images in the same scene-level cluster are further grouped to achieve sub-scene-level clustering. For sub-scene-level clustering, local visual features of images are employed. In the process of local feature tracking, if the change rate of the number of tracked features is below the given threshold, the database image is defined as a breakpoint image. The database images between the two breakpoint images are acquired in the same sub-scene, so the images between the breakpoints are grouped into one sub-cluster.

Image Clustering Based on KLT Feature Tracking
A database image can be described by the gray-level function I(x I , y I , t), where (x I , y I ) is the position of a pixel on the image, and t presents the time stamp. Since database images are successively acquired, time stamp t is equivalent to the image index. In addition, the interval of image acquisition is small, resulting in high visual correlations between database images, which is the precondition for using feature tracking for image breakpoint detection. When the database camera moves within a sub-scene, an overlap exists between the adjacent database images, and a certain number of local features on the images can be tracked. According to this characteristic of database images, a KLT (Kanade-Lucas-Tomasi) feature tracking-based image-clustering method is proposed in this section.
The features are continuously tracked between the two adjacent database images with index i and i + 1 by the KLT algorithm. Let Φ 1 denote an image cluster in the scenelevel clustering results. In the off-line stage, the local visual features (i.e., ORB features) are extracted from images and stored in the database used in the sub-scene-level image To track features, a rectangular window on the image needs to be set whose length is (2w K + 1) pixels and whose width is (2h K + 1) pixels. Based on the assumptions of constant brightness, time continuity, and spatial consistency in the rectangular window, it is considered that the matching feature points on the database image satisfy the following relationship: where I x is the gray value of the feature located at x . d x and d y denote the displacement distances in the X and Y directions on the image, respectively. Equation (24) indicates that for the matching feature points on the database image, only the displacement changes occur on the adjacent images, and the magnitude of the gray values does not change. The core of feature tracking is to solve In order to obtain the displacement change, a sum of the squared intensity difference function is defined as: Taking the derivative of the sum of the squared intensity difference function and setting it to zero, the optimal solution d * xy of the displacement can be obtained by: where I(x, y, i) and J x + d x , y + d y , i + 1 are abbreviated to I and J, respectively. Based on the Taylor formula, the first-order approximation of Equation (26) on [0, 0] T can be obtained by: Equation (27) can be further expressed as: where I x , I y , and H xy are: Let Equation (28) be equal to zero, and the optimal solution d * xy can be obtained by: Initialization should be applied to database image set Φ 1 , which means that the first database image in the set Φ 1 is defined as a breakpoint. From the second database image in the set Φ 1 , the KLT algorithm is employed to track local features between the adjacent images. Let k denote the index of the breakpoint image, and then the number of tracked features can be presented by the vector n T = n k+1 T , n k+2 T , · · · , n k+k T , · · · , where n k+k T is the number of tracked features corresponding to the image with index k + k . To determine the position of the breakpoint image, a threshold is required to monitor the number of tracked features. For the database image with index k + k , the corresponding breakpoint image-detection threshold T k+k Tr is defined as: where w Tr is the scale coefficient, and the change rate of the number of tracked features is According to the change rate r k+k Tr and threshold T k+k Tr , an image can be determined whether or not it is a breakpoint image. Specifically, if r k+k Tr ≤ T k+k Tr , the database image with index k + k is regarded as a breakpoint image, as shown in Figure 5. Breakpoint detection is applied to each scene-level clustering result, and then all breakpoint images in the database image set can be found. Database images between two breakpoint images and the front breakpoint image are grouped into one cluster, achieving the sub-scene-level image clustering. The feature vector of the breakpoint image of each cluster is the cluster center. That is, the feature vector of the first image in the cluster is the cluster center (as shown in Figure 5). In the i-th scene-level cluster, the center of the j-th sub-scene-level cluster can be denoted by L j i , which is an ORB feature vector.
Initialization should be applied to database image set 1 Φ , which means that the first database image in the set 1 Φ is defined as a breakpoint. From the second database image in the set 1 Φ , the KLT algorithm is employed to track local features between the adjacent images. Let k denote the index of the breakpoint image, and then the number of tracked features can be presented by the vector 1 2 , , , , is the number of tracked features corresponding to the image with index k k′ + . To determine the position of the breakpoint image, a threshold is required to monitor the number of tracked features. For the database image with index k k′ + , the corresponding breakpoint image-detection threshold k k Tr T ′ + is defined as: where Tr w is the scale coefficient, and the change rate of the number of tracked features is ( ) According , the database image with index k k′ + is regarded as a breakpoint image, as shown in Figure 5. Breakpoint detection is applied to each scene-level clustering result, and then all breakpoint images in the database image set can be found. Database images between two breakpoint images and the front breakpoint image are grouped into one cluster, achieving the sub-scene-level image clustering. The feature vector of the breakpoint image of each cluster is the cluster center. That is, the feature vector of the first image in the cluster is the cluster center (as shown in Figure 5). In the i-th scene-level cluster, the center of the j-th sub-scene-level cluster can be denoted by j i L , which is an ORB feature vector.

Hierarchical Image Retrieval and Visual Localization
In the off-line stage, database images are hierarchically grouped, and a search tree with three layers is achieved: (1) the first layer contains centers of scene-level clusters, (2) the second layer consists of centers of sub-scene-level clusters, and (3) the third layer contains database images in sub-scene-level clusters. It should be noted that the center of the scene-level cluster is a global feature vector (i.e., a Gist feature vector), and the center of the sub-scene-level cluster is a local feature vector (i.e., an ORB feature vector). According to the results of hierarchical image clustering, a three-layer search tree can be organized.

Hierarchical Image Retrieval and Visual Localization
In the off-line stage, database images are hierarchically grouped, and a search tree with three layers is achieved: (1) the first layer contains centers of scene-level clusters, (2) the second layer consists of centers of sub-scene-level clusters, and (3) the third layer contains database images in sub-scene-level clusters. It should be noted that the center of the scene-level cluster is a global feature vector (i.e., a Gist feature vector), and the center of the sub-scene-level cluster is a local feature vector (i.e., an ORB feature vector). According to the results of hierarchical image clustering, a three-layer search tree can be organized. As shown in Figure 6, there are m clustering results in the first layer of the search tree, and each clustering result in the first layer, such as G i C , corresponds to more than one second-layer result (i.e., L 1 i , · · · , L n i i ). In addition, the database images grouped into one sub-cluster are associated with the cluster center, such as L n i i . Based on the organized multi-layer search tree, a hierarchical clustering-based image retrieval (HCIR) algorithm is proposed for visual localization. Based on the multi-layer search tree shown in Figure 6, hierarchical image retrieval is applied to find the most similar database image to the query image. In the on-line stage, global and local features are extracted from the query image and uploaded to the server by wireless networks. Then, the similarity between the query image and the centers of scene-level clusters can be defined as: where Q G is the Gist feature vector extracted from the query image, and i is a scene-level cluster center. By measuring similarities between the global features of the query image and scenelevel cluster centers, the scene-level clusters can be ranked. Then, the query image orderly retrieves each scene-level cluster. When the most similar scene-level cluster is found, the sub-scene-level clusters in that cluster should be sorted. Specifically, suppose a query image needs to find its most similar database image in the cluster i sub-scene-level clusters are orderly retrieved by the query image. By this means, scenelevel clusters and sub-scene-level clusters can be ranked based on visual similarities between the query image and cluster centers. According to the ranked clusters, local feature matching should be orderly executed between the query image and the database images in the third-layer clusters.  Based on the multi-layer search tree shown in Figure 6, hierarchical image retrieval is applied to find the most similar database image to the query image. In the on-line stage, global and local features are extracted from the query image and uploaded to the server by wireless networks. Then, the similarity between the query image and the centers of scene-level clusters can be defined as: where G Q is the Gist feature vector extracted from the query image, and By measuring similarities between the global features of the query image and scenelevel cluster centers, the scene-level clusters can be ranked. Then, the query image orderly retrieves each scene-level cluster. When the most similar scene-level cluster is found, the sub-scene-level clusters in that cluster should be sorted. Specifically, suppose a query image needs to find its most similar database image in the cluster G i C . In that case, local features extracted from the query image should be matched with each center of sub-scene-level clusters, i.e., L 1 i , · · · , L n i i . The number of matched local features reflects the similarities between the query image and the sub-scene cluster centers. For the cluster G i C , subscene-level clusters are orderly retrieved by the query image. By this means, scene-level clusters and sub-scene-level clusters can be ranked based on visual similarities between the query image and cluster centers. According to the ranked clusters, local feature matching should be orderly executed between the query image and the database images in the third-layer clusters.
Let n k s mat denote the number of matched features between the query image and the k s -th database image. Then, the feature matching ratio can be defined as: where 0 ≤ s k S L ≤ 1 and n Q is the number of the ORB features extracted from the query image. The matched features are used in visual localization, and more matched features contribute to improving localization accuracy. In addition, more matched features indicate that the query image is closer to the database image. Therefore, the best-matched database image with the query image is desired to estimate the position of the query camera in visual localization. Since database images are successively captured, when the query image is matched with database images, the trend of feature-matching ratios presents regularity. Specifically, if the query image and the database image are acquired in the same scene, when the query image is orderly matched with the database image, the trend of matching ratios first increases and then decreases. The reason is that when the query image gradually approaches the best-matched database image, the matching ratios will gradually increase until the ratio reaches a maximum value. At this position, the database image is best matched with the query image. After that, the distance between the query image and the best-matched database image gradually increases, and the matching ratios decrease. Based on the above analysis, if the maximum value of the matching ratios can be found, the best-matched database image can be determined. The method of finding the best-matched database image is named the maximum similarity method in this paper.
To find the best-matched database image, a sliding window should be set as shown in Figure 7. In a sliding window with the size w F + 1, the index of the image at the center is k S . If the image with the index k S is determined as the best-matched database image, the matching ratio s k S L of the database image should satisfy: where r mat ∈ [0, 1] is the threshold of the matching ratio. The threshold ensures that the database images and the query image are captured in the same scene.
image with the query image is desired to estimate the position of the query camera in visual localization. Since database images are successively captured, when the query image is matched with database images, the trend of feature-matching ratios presents regularity. Specifically, if the query image and the database image are acquired in the same scene, when the query image is orderly matched with the database image, the trend of matching ratios first increases and then decreases. The reason is that when the query image gradually approaches the best-matched database image, the matching ratios will gradually increase until the ratio reaches a maximum value. At this position, the database image is best matched with the query image. After that, the distance between the query image and the best-matched database image gradually increases, and the matching ratios decrease. Based on the above analysis, if the maximum value of the matching ratios can be found, the best-matched database image can be determined. The method of finding the best-matched database image is named the maximum similarity method in this paper.
To find the best-matched database image, a sliding window should be set as shown in Figure 7. In a sliding window with the size 1 F w + , the index of the image at the center is S k . If the image with the index S k is determined as the best-matched database image, the matching ratio S k L s of the database image should satisfy: According to the ranking results of the scene-level and sub-scene-level clusters, the query image orderly retrieves each cluster until it finds the best-matched database image. A situation may arise in image retrieval. That is, the query image is matched with all the database images of the most similar scene-level cluster, but the best-matched database image is still not found. Therefore, a backtracking mechanism is introduced in the proposed hierarchical retrieval. In this mechanism, when the best-matched database image with the query image cannot be found after comparing with all the database images in the scene-level cluster, the query image will return to the top of the search tree and continue to retrieve the next cluster according to the ranking results, and so on. In the worst case, all database images are compared with the query image, and the best-matched database image is still not found. Then, the database image with the maximal matching ratio is determined as the best-matched image, but in this case, the distance between the query image and the database image is perhaps far.
By the proposed HCIR algorithm, the best-matched database image with the query image can be found in the database, and the best-matched database image has the following characteristics: (1) the database image is captured in the same scene as the query image; (2) there are a number of matching feature points between the query image and the According to the ranking results of the scene-level and sub-scene-level clusters, the query image orderly retrieves each cluster until it finds the best-matched database image. A situation may arise in image retrieval. That is, the query image is matched with all the database images of the most similar scene-level cluster, but the best-matched database image is still not found. Therefore, a backtracking mechanism is introduced in the proposed hierarchical retrieval. In this mechanism, when the best-matched database image with the query image cannot be found after comparing with all the database images in the scene-level cluster, the query image will return to the top of the search tree and continue to retrieve the next cluster according to the ranking results, and so on. In the worst case, all database images are compared with the query image, and the best-matched database image is still not found. Then, the database image with the maximal matching ratio is determined as the best-matched image, but in this case, the distance between the query image and the database image is perhaps far.
By the proposed HCIR algorithm, the best-matched database image with the query image can be found in the database, and the best-matched database image has the following characteristics: (1) the database image is captured in the same scene as the query image; (2) there are a number of matching feature points between the query image and the database image. If the query image is considered to be coincident with the position of the bestmatched database image, a preliminary position estimation of the query camera can be achieved. However, this position estimation method is subject to the acquisition density of the database images. In order to improve the localization accuracy, the top-K best-matched database images are selected and used to estimate the position of the query image.

Visual Localization Based on Weighted KNN Method and Armijo-Goldstein Algorithm
In practical localization scenarios, images with higher similarity tend to be closer. Namely, the more similar the query image and the database image are, the smaller the distance between the two images is. With this thinking, a weighted KNN-based visual localization method is proposed, by which the matched database image with a higher similarity is assigned a larger weight. The similarities between images are evaluated by the number of matched feature points in visual localization.
The top-K best-matched database images with the query image are regarded as the nearest neighbors to estimate the query position, so the localization error function can be defined as: where p Q is the estimated position of the query image, and p i D (1 ≤ i ≤ K) is the position of the database image. w i is the weight that can be calculated by: where n i mat denotes the number of matched feature points between the query image and the database image.
For Equation (37), the Armijo-Goldstein algorithm is used to solve the estimated position of the query image (i.e., the position of the query camera) [60]. The gradient vector of f e at p k Q = x k Q , y k Q is: According to the gradient vector, the search direction s can be further determined by s = −g. The procedure of visual positioning can be treated as a line search, as shown in Algorithm 1. The count flag s k , index t m , maximum number of iterations k max (=5000), threshold σ r (=10 −3 ), amplification coefficient γ (=0.4), and step length d (=0.01) are set as inputs.

Algorithm 1: Visual localization based on Armijo-Goldstein algorithm
Input: localization error function f e , count flag s k , index t m , maximum number of iterations k max , threshold σ r , amplification coefficient γ, and step length d Output: estimated position p Q of the query camera Step 1: set initial values s k = 0 and t m = 0; Step 2: calculate the norm of the direction vector by s n = s if s n > σ r , turn to Step 3, else turn to Step 5; Step 3: if f e p k Q + d t m s < f e p k Q + γd t m (g) −1 s, then s k = t m and turn to Step 4, else t m ← t m + 1 and turn to Step 4; Step 4: update the position of the query camera by p k+1 Q = p k Q + d s k s and k ← k + 1 , if k ≥ k max , then turn to Step 5, else turn to Step 1; Step 5: determine the position of the query camera by p k Q , i.e., p Q = p k Q .
With the visual localization method, the estimated position p Q of the query camera can be achieved by solving the line search problem, in which the similarity between the query image and the database images is reflected by the weight w i . Therefore, the estimated position of the query camera is closer to the database images with high similarity.
In summary, visual localization is achieved by two steps: hierarchical clusteringbased image retrieval and query camera position estimation. Since the proposed visual localization method dispenses with camera calibration, it can be widely used in different application scenarios and applied to various smart mobile terminals.

Performance Analysis on Hierarchical Image Retrieval
The proposed HCIR algorithm aims to decrease the on-line search time by sacrificing the processing time of off-line image clustering. Still, database image clustering is continuously efficacious, which means that once the database image clustering is completed, the results of clustering can be repeatedly applied to on-line image retrieval. The proposed algorithm in this paper achieves multi-layer image clustering. Compared with the single-layer clustering algorithm (i.e., only scene-level image clustering is implemented), an advantage of the proposed algorithm is that the search time does not scale up for the database size. Next, the computation performance of the proposed algorithm and the single-layer clustering-based algorithm will be analyzed in detail. For clustering-based image retrieval, on-line time consumptions contain five parts: (1) the time t G to extract the global features of the query image, (2) the time t L to extract the local features of the query image, (3) the time t GS to measure the similarity of global features between the query image and database images, (4) the time t LM to match the local features between the query image and database images, and (5) the time t FS to sort database images according to their similarity to the query image. If there are n L 1 scene-level clusters in the first layer, and each cluster contains n L 2 sub-scene-level clusters, the average running time T SL of the single-layer clustering-based retrieval algorithm is: where k L 1 is the number of scene-level clusters that have been retrieved when the backtracking mechanism is enabled, and m L 1 and m L 2 are the number of database images in the scene-level cluster and the sub-scene-level cluster, separately. The average running time T ML of the proposed image retrieval algorithm is: where k L 2 is the image number of sub-scene-level clusters that have been retrieved when the backtracking mechanism is enabled. In this case, if the query image does not obtain a matched database image after retrieving a complete scene-level and sub-scene-level clustering result, the time consumption of the two processes is n L 1 t LM and n L 2 t LM , respectively. The total number m total of database images satisfies: m total = n L 1 m L 1 and m L 1 = n L 2 m L 2 , where m L 1 and m L 2 are image numbers of the scene-level cluster and the sub-scene-level cluster, respectively.
For single-layer clustering-based image retrieval, database images are orderly matched with the query, so the average retrieving time of an image cluster is 1 + m L 1 t LM /2. Similarly, for the proposed algorithm, the average retrieval time of a sub-scene-level cluster is 1 + m L 2 t LM /2. According to the principle of multi-layer clustering, the image number m L 1 is far more than m L 2 . As a result, the query image could find its matched database image in a cluster using less time for the proposed HCIR algorithm. The proposed algorithm has three additional time overheads (i.e., the time k L 2 m L 2 t LM to retrieve the k L 2 results, the time n L 2 t LM to match features, and the time t FS to sort database images) compared with the single-layer clustering algorithm. However, in practical applications, feature-sorting time is much shorter than feature-matching time, and in most cases, the value of k L 2 is zero. Therefore, the sum of the retrieval time 1 + m L 2 t LM and the feature-matching time n L 2 t LM is still less than the retrieval time 1 + m L 1 t LM /2. If the best-matched database image can be obtained without trigging the backtracking mechanism (i.e., k L 1 = k L 2 = 0), the difference ∆t in time consumption between the single-layer clustering algorithm and the proposed algorithm is: Compared with the multi-layer clustering-based algorithm, there are more database images contained in the cluster for the single-layer clustering-based algorithms. Moreover, as a multi-layer clustering-based algorithm, the proposed HCIR algorithm has priorities for retrieving the image clusters with high similarities to the query, so that the matched image can be found by searching fewer database images. Therefore, the structure of a multi-layer search tree is beneficial in reducing retrieval time consumption.

Experimental Results and Discussion
In this section, the hierarchical clustering-based image retrieval is implemented, and the computation performance of the retrieval algorithm is analyzed. In addition, the position accuracy of the visual localization is evaluated.

Experimental Results of Database Image-Clustering Algorithm
Two image databases (namely, the KTH image database [61] and the HIT-TUM image database) were used to evaluate the performance of the proposed algorithm. The images in the HIT-TUM database were acquired from the Harbin Institute of Technology and the Technical University of Munich. Each database contains 400 images captured in 10 different indoor scenes, such as an office, a corridor, a restaurant, and so on. All data processing was run on MATLAB 2018A with an Intel Core i7 CPU and 8GB RAM. Randomly selected example images in the databases are shown in Figure 8. It is worth noting that images in the databases are successively captured in indoor scenes, so the visual features extracted from the images captured in the same indoor scene have high correlations. For scene-level image clustering, database images are grouped by their global features based on the CUSUM change-point detection. The results of the scene-level clustering guide the query image to retrieve the clusters that are similar to the query. Therefore, the performance of database image clustering affects the efficiency of the image retrieval system. For an image retrieval system, the efficiency of the retrieval algorithm is reflected in two aspects: the number of searched database images and the time consumption of the image retrieval. Generally, the fewer database images are searched, the less time retrieval takes.
In the same indoor scene, as the database images are successively captured, the Gist features extracted from these images have high correlations. Taking advantage of the correlations, scene-level clustering of database images can be achieved. However, the noise generated in feature extraction affects the correlation of the features. Therefore, the original Gist features of database images need to be pre-processed (including Kalman filtering and Kalman smoothing) to restore the correlation of image features. For a database image, Gist features can be extracted according to different scales and directions. In this paper, Gist features are extracted at three scales and six directions, so 18 ( ) 3 6=18 × feature ele- For scene-level image clustering, database images are grouped by their global features based on the CUSUM change-point detection. The results of the scene-level clustering guide the query image to retrieve the clusters that are similar to the query. Therefore, the performance of database image clustering affects the efficiency of the image retrieval system. For an image retrieval system, the efficiency of the retrieval algorithm is reflected in two aspects: the number of searched database images and the time consumption of the image retrieval. Generally, the fewer database images are searched, the less time retrieval takes.
In the same indoor scene, as the database images are successively captured, the Gist features extracted from these images have high correlations. Taking advantage of the correlations, scene-level clustering of database images can be achieved. However, the noise generated in feature extraction affects the correlation of the features. Therefore, the original Gist features of database images need to be pre-processed (including Kalman filtering and Kalman smoothing) to restore the correlation of image features. For a database image, Gist features can be extracted according to different scales and directions. In this paper, Gist features are extracted at three scales and six directions, so 18 (3 × 6 = 18) feature elements are extracted from each image. That is, the Gist feature vector of each image contains 18 feature elements. Figure 9 shows an example of the Gist feature pre-processing result of a database image, including the original Gist feature values, Kalman filtering results, and Kalman smoothing results. It can be found from Figure 9 that the correlation between original feature values is not evident due to the influence of noise. In contrast, by Kalman filtering, noise is effectively suppressed, and the correlations between features are restored. According to Kalman filtering results, more obvious correlations can be obtained by further Kalman smoothing of Gist features. The pre-processing of features recovers the correlations of global features of database images, which is beneficial to scene-level clustering.

takes.
In the same indoor scene, as the database images are successively captured, the Gist features extracted from these images have high correlations. Taking advantage of the correlations, scene-level clustering of database images can be achieved. However, the noise generated in feature extraction affects the correlation of the features. Therefore, the original Gist features of database images need to be pre-processed (including Kalman filtering and Kalman smoothing) to restore the correlation of image features. For a database image, Gist features can be extracted according to different scales and directions. In this paper, Gist features are extracted at three scales and six directions, so 18   3 6=18  feature elements are extracted from each image. That is, the Gist feature vector of each image contains 18 feature elements. Figure 9 shows an example of the Gist feature pre-processing result of a database image, including the original Gist feature values, Kalman filtering results, and Kalman smoothing results. It can be found from Figure 9 that the correlation between original feature values is not evident due to the influence of noise. In contrast, by Kalman filtering, noise is effectively suppressed, and the correlations between features are restored. According to Kalman filtering results, more obvious correlations can be obtained by further Kalman smoothing of Gist features. The pre-processing of features recovers the correlations of global features of database images, which is beneficial to scenelevel clustering. Scene-level database image clustering is achieved by detecting the change-points in Gist feature sequences. When image retrieval is executed in the results of scene-level clustering, according to the similarity of the global features, the query image will preferentially search the database image clusters with a higher similarity. Therefore, if all database images in the same scene are grouped into one cluster, the query image captured in this scene can find its matched database images in this cluster. In contrast, if a database image captured in a certain scene is falsely grouped into other clusters, this database image cannot be retrieved when the query searches the right cluster. Depending on the above analysis, the core of the proposed algorithm is that the database images in the same scene are grouped into one cluster as much as possible.
To analyze the performance of the proposed algorithm, scene-level image clustering is executed, and confusion matrices are employed to evaluate clustering accuracy. The confusion matrices of the results of database image clustering are shown in Figure 10. The confusion matrix used to evaluate clustering accuracy in this paper can also be regarded  Scene-level database image clustering is achieved by detecting the change-points in Gist feature sequences. When image retrieval is executed in the results of scene-level clustering, according to the similarity of the global features, the query image will preferentially search the database image clusters with a higher similarity. Therefore, if all database images in the same scene are grouped into one cluster, the query image captured in this scene can find its matched database images in this cluster. In contrast, if a database image captured in a certain scene is falsely grouped into other clusters, this database image cannot be retrieved when the query searches the right cluster. Depending on the above analysis, the core of the proposed algorithm is that the database images in the same scene are grouped into one cluster as much as possible.
To analyze the performance of the proposed algorithm, scene-level image clustering is executed, and confusion matrices are employed to evaluate clustering accuracy. The confusion matrices of the results of database image clustering are shown in Figure 10. The confusion matrix used to evaluate clustering accuracy in this paper can also be regarded as a clustering error matrix. The row labels of the matrix are the correct cluster labels, and the column labels are the predicted cluster labels. For the matrix in Figure 10a, the values in the third, fourth, and fifth rows of the fourth column are 1, 39, and 6, respectively. This set of values shows that for 46 (1 + 39 + 6 = 46) database images that are grouped into one cluster, 39 database images were truly captured in the Office II scene, one image was misclassified into the Corridor II scene category, and six images were misclassified into the Corridor III scene. For a row of the confusion matrix, the sum of all values in that row represents the actual number of images in the cluster. For a column of the confusion matrix, the sum of all values in that column represents the predicted number of images in the cluster. The confusion matrix effectively reflects the performance of the proposed scene-level clustering algorithm. By observing the confusion matrix, it can be known that for image-clustering results, incorrectly grouped images only occur in two adjacent image clusters, and the reason accords with the clustering principle in this paper. Specifically, image clustering acts on the indoor database image sequence, and the change-points are detected based on global features of database images, so that images between two change-points are grouped into one cluster. Therefore, the incorrect image grouping is caused by errors in change-point detection. Obviously, if errors exist in change-point detection, some database images that should belong to a certain cluster are grouped into the former or the latter cluster. In this paper, four criteria (i.e., recall rate, precision rate, accuracy rate, and F1 score) are used to evaluate the performance of the clustering algorithm. The recall rate RR r is the ratio of the number of correctly grouped images to the actual number of images in that cluster. The precision rate PR r refers to the ratio of the number of correctly grouped images to the number of images in the cluster. The accuracy rate AR r refers to the ratio of the number of correctly grouped images to the total number of images. The F1 score F1 s is used in statistics to measure the accuracy of a classification model. This score can be calculated by the recall rate and the precision rate:  Table 1. For the experimental results, RR r , PR r and F1 s denote the average recall rates, average precision rates, and average F1 scores, respectively. From the results shown in Table 1, the color features (such as color histograms and color moments) perform weakly on scenelevel clustering. The reason is that the color difference of indoor scenes is relatively small. In this paper, four criteria (i.e., recall rate, precision rate, accuracy rate, and F1 score) are used to evaluate the performance of the clustering algorithm. The recall rate r RR is the ratio of the number of correctly grouped images to the actual number of images in that cluster. The precision rate r PR refers to the ratio of the number of correctly grouped images to the number of images in the cluster. The accuracy rate r AR refers to the ratio of the number of correctly grouped images to the total number of images. The F1 score s F1 is used in statistics to measure the accuracy of a classification model. This score can be calculated by the recall rate and the precision rate: Global features of an image include color features (such as color histogram features and color moment features) and texture features (such as wavelet transform features and Gabor transform features). In simulation experiments, Gist features, color histogram features, color moments, wavelet transform features, and Gabor features are used to perform scenelevel clustering on database images, and experimental results are shown in Table 1. For the experimental results, r RR , r PR and s F1 denote the average recall rates, average precision rates, and average F1 scores, respectively. From the results shown in Table 1, the color features (such as color histograms and color moments) perform weakly on scene-level clustering. The reason is that the color difference of indoor scenes is relatively small. Especially in an environment with a white wall as the main background, it is not easy to distinguish the scenes by the color information. Compared with color features of images, texture features of images perform better in terms of clustering performance, especially for Gabor features and Gist features. Because multiple Gabor filters with different scales and directions are used in extracting Gist features, Gist features describe the textures of scenes more comprehensively, thereby achieving more accurate image-clustering results. To reveal the clustering performance of the ICSCD algorithm proposed in this paper, two typical change-point detection algorithms (i.e., the mean shift-based algorithm [36] and the Bayesian estimation-based algorithm [62]) are simulated for grouping database images at the scene level. The experimental results shown in Table 2 indicate that the proposed ICSCD algorithm significantly outperforms the Bayesian estimation-based algorithm in four metrics: the average recall rate, the average precision rate, the average F1 score, and the accuracy rate. The reason is that the Bayesian estimation-based algorithm utilizes local features in change-point detection, but the local features are too sensitive to scene changes and tend to group database images belonging to the same scene into multiple image clusters or group images belonging to the same class into other clusters. This also shows that the local features of the images are more suitable for further classification of the scene-level clustering results, which is why local features are used for the second layer of clustering in this paper. Both the proposed ICSCD algorithm and the mean shift-based algorithm use global features for clustering, but the difference is that the change-point detection function F C is employed to detect the change points for image clustering in the proposed ICSCD algorithm, whereas the mean shift function is utilized to detect the change points in the mean shift-based algorithm. Since both the influence of the values at the detection position and the influence of the expected values of the parameter models (i.e., the parameter model A and the parameter model B in the hypothesis test) within the sliding window (as shown in Figure 4) are taken account in the change-point detection function F C , a higher clustering accuracy can be obtained. Specifically, the average recall rate, the average precision rate, the average F1 score, and the accuracy rate of the proposed ICSCD algorithm are greater than 0.92, which is significantly higher than the mean shift-based algorithm.

Experimental Results of Hierarchical Image Retrieval and Visual Localization
In the proposed HCIR algorithm, the best-matched database image is determined by the maximum similarity method. Therefore, the validity of the method needs to be verified by experiments. In this part of the experiments, since the best-matched image is the database image that is most similar to the query image, the database image I GS with the highest matching similarity to the query image is found by the global search, and the index of this database image is k i B . In addition, another best-matched database image I MS is determined by the proposed maximum similarity method, and the index of the database image is k i M . The average error ε index of the index positions of best-matched database images can be calculated by: where k i BM is the index error of the best-matched database image in the i-th experiment, and n TQ is the total number of query images for experiments.
Based on the average error ε index , the average distance error between the best-matched database images I GS and I MS can be further defined by ε dis = ε index ·d D , where d D is the fixed acquisition distance of database images. The average index error ε index and the average distance error ε dis reflect the performance of retrieving the best-matched database images with the maximum similarity method. For the experimental results, the smaller values of ε dis and ε index indicate that the matched database images are closer to the query image. The results of matched image retrieval are shown in Table 3 under the condition that d D is set to 10 cm. For the experimental results, the ratio r j S (j = 0, 1, 2) is the percentage of the number of experimental results that satisfy k i BM = j. In addition, the average value s L of the matching similarity and the average value n mat of matched feature points are also calculated in experiments. The experimental results shown in Table 3 indicate that for image retrieval experiments with the similarity maximum method separately conducted in the KTH database and the TUM-HIT database, the probability of successfully retrieving a matched database image (i.e., the situation of r 0 S ) exceeds 94%. In other cases, although the similarity between the matched image and the query image cannot reach the maximum value, the index errors are less than 2, which indicates that the matched image is close to the query image, and there are enough matched features between the matching image and the query image. Therefore, the matched database images under the situation of r 0 S , r 1 S , and r 2 S can be used for visual localization. For the two databases, the average error ε index of the index positions is less than 0.1, and the average distance error is less than 1 cm, showing the effectiveness of the similarity maximum method in determining matched database images. Moreover, the experimental results also show that for different databases, the average matching similarity between the query image and the best-matched database image is greater than 0.5, and there are more than 120 pairs of matched feature points between the query image and the bestmatched database image, which provides a fundamental guarantee for visual localization.
In the proposed HCIR algorithm, the scene-level clustering results are sorted based on the global feature similarity, and then the database images in the sub-scene clusters are sorted based on the local feature similarity. After the two-stage sorting, the query image is matched with database images according to the sorting result. From the above process, it is known that when retrieving the scene-level clustering results based on the global feature similarity, the best case is to obtain the matched image in the first clustering result, and the worst situation is obtaining the matched images after all the results are retrieved. Therefore, for the scene-level image retrieval, the success rate of image retrieval within the top-K clusters is proposed in this paper to evaluate the performance of the clustering algorithm in image retrieval. Specifically, after scene-level clustering of database images, more than one image cluster can be obtained. If the matched database image can be retrieved after searching K image clusters, image retrieval is considered to be achieved within K database image clusters. For a total of n Q query images, if there are n K query images, and their matched database images are in the K-th cluster, the success rate of the top-K clusters is defined as 100·n K /n Q %. The success rate effectively reflects the impact of the scene-level clustering algorithm on the performance of image retrieval. The scene-level clustering algorithms of database images can be divided into two categories: one is based on the method of detecting change points of visual features (such as the proposed HCIR algorithm in this paper, the mean shift-based algorithm, and the Bayesian estimation-based algorithm), and another is clustering a fixed number of database images (such as the C-GIST algorithm [37]). In the C-GIST algorithm, five consecutive database images are grouped into one cluster, and the cluster center is a feature vector of the image that is located at the center position of each cluster. In this paper, two categories of imageclustering algorithms are simulated, respectively, and the success rate of the top-K clusters is calculated. The results are presented in Table 4. The results shown in Table 4 indicate that the proposed HCIR algorithm is beneficial in improving the success rate of the top-K clusters for a query image. For the two databases, the success rates of the top-five clusters achieved by the HCIR algorithm, mean shift-based algorithm, and the Bayesian estimation-based algorithm are more than 90%. At the same time, it is not difficult to find that the HCIR algorithm has more obvious performance advantages. Especially in the KTH database, the success rate of the first cluster is 66.75%, and the success rate of the top-five clusters reaches 99.75%. For the HIT-TUM database and the KTH database, the best-matched database image can be retrieved within the top-five clusters by the HCIR algorithm. In addition, for the sub-scene-level image clustering, success rates of the first K-clusters are also calculated and recorded. Experimental results show that for the HCIR algorithm, the success rate of the first cluster is more than 88%, which indicates that in most cases, the best-matched database image can be found in the first sub-cluster.
To verify the image retrieval efficiency of the proposed HCIR algorithm, image retrieval experiments are performed on the HCIR algorithm and the comparison algorithms. In the experiments, the mean shift-based algorithm, the Bayesian estimation-based algorithm, and the C-GIST algorithm are single-layer clustering algorithms. In addition, another two multilayer clustering algorithms are considered: the mean shift-KLT algorithm and the Bayesian estimation-KLT algorithm. For the two multi-layer clustering algorithms, database images are firstly grouped by the mean shift-based algorithm or the Bayesian estimation-based algorithm, and then the images are further grouped by the KLT algorithm. According to the average number of retrieved images shown in Table 5, multi-layer clustering algorithms have higher retrieval efficiency, and the number of similar comparisons (i.e., the processes of feature matching) can be limited to 10% of the database size. The reason is that database images are only grouped into scene-level clusters for the single-layer algorithms, and thus the query image needs to match with database images in the scene-level cluster one-by-one. In contrast, database images are further grouped on the basis of scene-level image clusters in multi-layer algorithms. Then, according to visual similarities, image clusters are ranked, and the query image preferentially matches with the database images in the most similar cluster. Therefore, multi-layer algorithms have a better performance at average numbers of retrieved database images. It can be observed from the experimental results shown in Table 5 that fewer database images are retrieved in the HCIR algorithm compared with the other two multi-layer algorithms. The reason is that the ICSCD algorithm has a better performance at scene-level clustering (as shown in Table 2), so that the cluster center can better express the global features of the images in the cluster. Table 6 shows the average running time of the image-retrieval system using different clustering algorithms. By comparing the number of retrieved images with the average running time of image retrieval, it can be known that when there are more retrieved database images, the running time consumed by image retrieval is also more. Experimental results shown in Tables 5 and 6 indicate that more database images are retrieved in singlelayer clustering algorithms, leading to larger time overheads than in multi-layer clustering algorithms. It is obvious that the HCIR algorithm has advantages in terms of the number of retrieved database images and the running time of image retrieval. The reason is that multi-layer clustering on database images is employed in the HCIR algorithm, and more importantly, the ICSCD algorithm is employed in the proposed retrieval algorithm that achieves a better performance in scene-level database image clustering.  Table 7 shows the average running time of different stages in image retrieval. In the practical implementation of image retrieval, hundreds of local features are needed to be matched between the query image and the database image, resulting in the most time consumption appearing at this stage. To reveal the performance difference between the single-layer clustering algorithm and the proposed HCIR algorithm, there are ten indoor scenes used for simulation, and m L 1 is separately set as 100, 200, 400, 800, and 1000, and then the running time of image retrieval for different database sizes can be simulated, as shown in Figure 11. As a precondition of simulation, the backtracking mechanism is always not triggered. According to the simulation results, the advantage of the proposed algorithm is that the running time does not linearly increase along with the growth of the database size. Even when the database image size is increased to 10,000, the running time of image retrieval is less than 110 ms. In this case, the running time of image retrieval corresponding to the single-layer clustering-based algorithm almost reaches 500 ms, which means that only two retrievals can be performed per second. In contrast, by the proposed HCIR algorithm, image retrieval can be executed nine times when there are 10,000 images in the database. The reason that the proposed algorithm spends less time coping with image retrieval is that database images are reasonably grouped in the off-line stage. Furthermore, on a deeper level, time for image clustering is sacrificed in the off-line stage to reduce the time consumption of image retrieval in the on-line stage. To reveal the performance difference between the single-layer clustering algorithm and the proposed HCIR algorithm, there are ten indoor scenes used for simulation, and L m 1 is separately set as 100, 200, 400, 800, and 1000, and then the running time of image retrieval for different database sizes can be simulated, as shown in Figure 11. As a precondition of simulation, the backtracking mechanism is always not triggered. According to the simulation results, the advantage of the proposed algorithm is that the running time does not linearly increase along with the growth of the database size. Even when the database image size is increased to 10,000, the running time of image retrieval is less than 110 ms. In this case, the running time of image retrieval corresponding to the single-layer clustering-based algorithm almost reaches 500 ms, which means that only two retrievals can be performed per second. In contrast, by the proposed HCIR algorithm, image retrieval can be executed nine times when there are 10,000 images in the database. The reason that the proposed algorithm spends less time coping with image retrieval is that database images are reasonably grouped in the off-line stage. Furthermore, on a deeper level, time for image clustering is sacrificed in the off-line stage to reduce the time consumption of image retrieval in the on-line stage. To demonstrate the performance of the proposed WKNN algorithm, two typical image retrieval-based localization methods (i.e., the NN method [49,54,55] and the KNN method [56,57]) are selected and implemented. Each image in the KTH database and the HIT-TUM database is employed as a query image for visual localization. For the proposed WKNN method and the typical KNN method, five nearest neighbors are selected to estimate the query position [57]. In order to reveal the impact of image acquisition intervals (din) on localization accuracy, database images with different acquisition intervals are set for experiments. Localization errors of query images are calculated, and the cumulative distributions of the errors are shown in Figure 12. To demonstrate the performance of the proposed WKNN algorithm, two typical image retrieval-based localization methods (i.e., the NN method [49,54,55] and the KNN method [56,57]) are selected and implemented. Each image in the KTH database and the HIT-TUM database is employed as a query image for visual localization. For the proposed WKNN method and the typical KNN method, five nearest neighbors are selected to estimate the query position [57]. In order to reveal the impact of image acquisition intervals (d in ) on localization accuracy, database images with different acquisition intervals are set for experiments. Localization errors of query images are calculated, and the cumulative distributions of the errors are shown in Figure 12.
method [56,57]) are selected and implemented. Each image in the KTH database and the HIT-TUM database is employed as a query image for visual localization. For the proposed WKNN method and the typical KNN method, five nearest neighbors are selected to estimate the query position [57]. In order to reveal the impact of image acquisition intervals (din) on localization accuracy, database images with different acquisition intervals are set for experiments. Localization errors of query images are calculated, and the cumulative distributions of the errors are shown in Figure 12. To quantitatively analyze the performance improvement of the WKNN method, an accuracy improvement rate rim is introduced and defined as: where p e and c e are the average errors by the proposed WKNN method and the comparative method, respectively. Compared with the NN and KNN methods, the proposed WKNN method achieves a better performance on localization accuracy, as shown in Table 8. In all experimental cases, the improvement of average localization accuracy reaches at least 22% and 34%, respectively, compared with the KNN and NN methods. From the localization results, it can be found that when the database images are more densely captured, the advantage of the proposed method in terms of localization accuracy is more obvious compared with the two other localization methods. The reason is that when the intervals of database images are large, the common visual features between the query image and the database images are few, which weakens the contributions of the weights in the WKNN method.
As illustrated in Table 8, when the database image acquisition intervals are set to be 10 cm, 20 cm, 30 cm, 40 cm, and 50 cm, the average localization errors of the WKNN method are 0.0490 m, 0.1299 m, 0.2604 m, 0.3673 m, and 0.5048 m, respectively. The results indicate that localization accuracy increases along with database image acquisition intervals. Even if the acquisition interval is increased to 50 cm, the sub-meter localization accuracy can be achieved by the proposed method, which satisfies the requirements of most indoor location-based services. But it is worth noting that acquisition intervals that are too small lead to a large off-line image database and result further in a high time overhead of image retrieval. Therefore, when designing a visual indoor localization system, a proper database image acquisition interval should be selected by striking a balance between localization accuracy and efficiency.  To quantitatively analyze the performance improvement of the WKNN method, an accuracy improvement rate r im is introduced and defined as: where e p and e c are the average errors by the proposed WKNN method and the comparative method, respectively. Compared with the NN and KNN methods, the proposed WKNN method achieves a better performance on localization accuracy, as shown in Table 8. In all experimental cases, the improvement of average localization accuracy reaches at least 22% and 34%, respectively, compared with the KNN and NN methods. From the localization results, it can be found that when the database images are more densely captured, the advantage of the proposed method in terms of localization accuracy is more obvious compared with the two other localization methods. The reason is that when the intervals of database images are large, the common visual features between the query image and the database images are few, which weakens the contributions of the weights in the WKNN method.
As illustrated in Table 8, when the database image acquisition intervals are set to be 10 cm, 20 cm, 30 cm, 40 cm, and 50 cm, the average localization errors of the WKNN method are 0.0490 m, 0.1299 m, 0.2604 m, 0.3673 m, and 0.5048 m, respectively. The results indicate that localization accuracy increases along with database image acquisition intervals. Even if the acquisition interval is increased to 50 cm, the sub-meter localization accuracy can be achieved by the proposed method, which satisfies the requirements of most indoor location-based services. But it is worth noting that acquisition intervals that are too small lead to a large off-line image database and result further in a high time overhead of image retrieval. Therefore, when designing a visual indoor localization system, a proper database image acquisition interval should be selected by striking a balance between localization accuracy and efficiency.

Discussion
In the visual localization system, an off-line database generally contains a large number of images for position estimation. For example, over 40,000 database images were captured over a distance of 4.5 km in the TUMindoor localization system, which means that database images were acquired at approximately 10 cm intervals [26]. With the traditional image retrieval strategy, the query image is exhaustively compared with each database image, which is not scalable for a large-scale database. Recently, clustering-based hierarchical image retrieval has been proposed and applied in large-scale image retrieval [63][64][65]. The main advantage of hierarchical image retrieval is creating an indexing strategy by grouping the images based on the visual cues before retrieval, so that only the relevant clusters are examined in the retrieval process. With this strategy, clustering-based hierarchical image retrieval significantly speeds up the search process at the expense of the time consumption of image clustering beforehand.
However, although the existing works on hierarchical image retrieval achieve high searching efficiency, these works are unsuitable for visual localization. The reason is that geographic factors on image clustering have not been taken into consideration, and database images acquired in the same scene are not necessarily grouped in a cluster. In the visual localization system, the query image is desired to be orderly compared with database images in the relevant scenes according to visual similarity. Specifically, the query image should be compared with similar database images as a priority. In this way, the query image need not be compared with all database images, and the retrieval can be obtained.
Considering the particular requirements of image retrieval in visual localization, a hierarchical clustering-based image retrieval (i.e., HCIR) algorithm is proposed in this paper to organize database images and achieve image retrieval. The main contribution to the HCIR method is that database images are orderly grouped into clusters by visual cues according with geographical distribution characteristics. Since the database images for visual localization are successively captured by the mapping equipment in indoor scenes, visual features in the same scene have high visual correlations. However, once the mapping equipment switches to another scene, the correlations subsequently decrease. Taking advantage of this characteristic of database images, an ICSCD algorithm is presented to group database images into clusters at the scene level. Moreover, an image clustering algorithm based on KLT feature tracking is proposed to group database images at the sub-scene level. With the ICSCD algorithm and KLT feature tracking-based algorithms, the visual features of different scenes and sub-scenes can be described by the cluster centers. In the process of retrieval, the query image is initially compared with the cluster centers, and then the clusters that have the largest similarity with the query are selected, and the images in these clusters are used to compare the query. By this means, the database images with high similarities to the query are preferentially retrieved, thus reducing the time consumption of image retrieval.
Compared with the existing change-point detection algorithms (i.e., the mean shiftbased algorithm [36] and the Bayesian estimation-based algorithm [62]), the proposed ICSCD algorithm achieves a better performance in terms of scene-level clustering on different evaluation criteria, such as the average recall rate, the average precision rate, the average F1 score, and the accuracy rate. The reason is that the typical CUSUM algorithm is improved at targeting change-point detection on database image sequences without threshold selection. In addition, the utilization of appropriate global visual features and the definition of the change-point detection function also contribute to raising cluster accuracy. On the basis of scene-level clustering results, the KLT feature tracking-based clustering algorithm is employed to further group database images, and then a multi-layer search tree can be generated for image retrieval. A distinguishing characteristic of the search tree is that database image clusters attached to the tree are determined by geo-information and visual cues. The query image is initially compared with cluster centers of the scenes with high similarity, which boosts the retrieval efficiency. The experimental results indicate that the image retrieval by our multi-layer search tree is more efficient compared with other single-layer and multi-layer clustering algorithms, such as the mean shift-based [36], C-Gist [37], Bayesian estimation-based [62], Bayesian estimation-KLT, and mean shift-KLT algorithms. Multi-layer clustering algorithms outperform single-layer algorithms because the KLT feature tracking-based algorithm subdivides the scene-level clusters and groups images at the sub-scene level, resulting in reducing the comparison between the query and database images. Due to the higher accuracy of the ICSCD algorithm on scene-level image clustering, the proposed HCIR algorithm outperforms other multi-layer clustering algorithms on retrieval efficiency, as shown in Tables 5 and 6.
Visual localization without camera-intrinsic parameters is essential in indoor navigation and augmented reality. Existing works mainly focus on the NN method [49,54,55] and the KNN method [56,57] to solve the estimated position of the query. However, the effect of similarity between the query image and database images on localization is not taken into consideration either in the NN method or in the KNN method. Therefore, a weighted KNN method is proposed, and the Armijo-Goldstein algorithm is employed to calculate the position of the query camera. The highlight of the weighted KNN method is that a novel localization error function is defined in consideration of visual similarity. Specifically, the matched database image with a higher similarity to the query is assigned a larger weight in the localization error function, resulting in the estimated position being close to similar database images. As discussed earlier, the drawback of the NN method is that the position of the most similar database image is assigned to the query, but the database image may be far from the query. For the KNN method, each neighbor database image has the same weight, which is not in accordance with the actual localization situation. The weighted KNN method overcomes these drawbacks by similarity comparison, and neighbor database images with higher similarity are given a larger weight. Experimental results show that the weighted KNN method achieves a better performance on localization accuracy compared with the NN and KNN methods. For a typical interval of database image acquisition (i.e., the interval is approximately 10 cm [26]), the average localization errors are limited to 5 cm, and this outperforms the NN and KNN methods. The proposed WKNN localization method has the potential to be embedded in visually impaired navigation systems and shopping guide applications.

Conclusions
In this paper, a hierarchical clustering-based image retrieval algorithm is presented, in which database images are grouped by improved cumulative sum change-point detection and KLT feature tracking. Taking advantage of hierarchical clusters, database images that are similar to the query image have a priority to match with the query, which effectively reduces the time consumption of image retrieval. For different indoor image databases, the number of similar comparisons can be limited to 10% of the database size. Simulation experiments also indicate that the running time of image retrieval does not linearly increase with the size of image databases. Compared with other single-layer and multi-layer clustering-based image retrieval algorithms, the proposed HCIR algorithm executes less similar comparisons and acquires low time overheads. Under the premise of ensuring image retrieval accuracy, the proposed visual localization system achieves high operational efficiency. With the proposed localization method, the position estimation error is limited to 5 cm when the database image acquisition intervals are set to 10 cm. Future work will focus on improving the accuracy of visual localization, especially reducing the impact of illumination changes on localization.