A Precise Indoor Visual Positioning Approach Using a Built Image Feature Database and Single User Image from Smartphone Cameras

: Indoor visual positioning is a key technology in a variety of indoor location services and applications. The particular spatial structures and environments of indoor spaces is a challenging scene for visual positioning. To address the existing problems of low positioning accuracy and low robustness, this paper proposes a precision single-image-based indoor visual positioning method for a smartphone. The proposed method includes three procedures: First, color sequence images of the indoor environment are collected in an experimental room, from which an indoor precise-positioning-feature database is produced, using a classic speed-up robust features (SURF) point matching strategy and the multi-image spatial forward intersection. Then, the relationships between the smartphone positioning image SURF feature points and object 3D points are obtained by an efficient similarity feature description retrieval method, in which a more reliable and correct matching point pair set is obtained, using a novel matching error elimination technology based on Hough transform voting. Finally, efficient perspective-n-point (EPnP) and bundle adjustment (BA) methods are used to calculate the intrinsic and extrinsic parameters of the positioning image, and the location of the smartphone is obtained as a result. Compared with the ground truth, results of the experiments indicate that the proposed approach can be used for indoor positioning, with an accuracy of approximately 10 cm . In addition, experiments show that the proposed method is more robust and efficient than the baseline method in a real scene. In the case where sufficient indoor textures are present, it has the potential to become a low-cost, precise, and highly available indoor positioning technology.


Introduction
Positioning is one of the core technologies used in location-based services, augmented reality (AR), internet of everything, customer analytics, guiding vulnerable people, robotic navigation, and artificial intelligence applications [1][2][3][4][5]. At present, outdoor GNSS-based smartphone positioning services can achieve centimeter-level accuracy. However, GNSS signals are unavailable indoors, and it is still difficult to achieve low-cost, high-availability, and high-precision indoor positioning effects with the existing indoor positioning technology [1,3,5]. In this area, vision-based indoor positioning of smartphones is an important indoor positioning technology, which does not require much extra consumption to change the indoor environment and only needs to use the existing decorative texture information in a room. Vision-based indoor positioning has the advantages of strong practicality and wide coverage [2][3][4][5][6]; moreover, it is an efficient expansion of positioning technologies based on Bluetooth/iBeacon [7], WIFI [8], UWB [9,10], PDR [11], INS [12], and Geomagnetic Fields [13], with the benefits of better accuracy and lower cost.
Visual localization has become an emerging research hotspot in the field of indoor positioning [1][2][3][4][5][6][14][15][16][17][18][19][20]. Most state-of-the-art methods [1,2,4,6,[14][15][16][17][18] rely on local features such as SIFT or SURF [21,22] to solve the problem of image-based localization. These methods usually contain two steps, namely establishing 2D-3D matches between features extracted from the positioning image and 3D points via descriptor matching and perspective-n-point (PnP), which calculates the extrinsic parameters. Pose estimation can only succeed if enough appropriate matches have been found in the first stage; otherwise, it will cause positioning approaches to fail. Recently, some new approaches have tackled the problem of localization with end-to-end learning. They formulate localization as a classification problem with a deep-learning architecture, where the current position is matched to the best position in the training set [19,20]. Rather than precomputing feature points and building a 3D points model, as done in classical feature-based matching localization methods, they can handle the hard scenarios with textureless areas and repetitive structures. However, in the region of certain textures, the visual positioning methods based on feature matching has advantages, especially in positioning accuracy. In image invariant local features-based indoor visual positioning approaches, according to the research content and characteristics of the visual positioning technology, there are two key problems at the algorithm level: (1) how to calculate the precise spatial pose of the positioning image robustly and rapidly; and (2) how to generate a high-precision positioning feature library. In the case of no new observation sources (such as WIFI, Ins, magnetic, etc.) being available, there exist many problems to be solved in these two aspects. This makes the application of visual positioning in indoor scenes more difficult than in outdoor environments, especially for the problem of image-feature mismatch caused by the lack of decorative texture or texture repetition in indoor scenes. In this paper, we first modify and extend a method for indoor positioning feature database establishment based on existing classical matching algorithms and strategies; our main aim is to use an epipolar constraint based on the fundamental matrix and a matching image screening strategy based on image overlap during construction of the positioning feature database. Then, introducing the image feature retrieval strategy of Kd-Tree+BBF (K dimensional Tree, Kd-Tree; Best bin first, BBF) improves the retrieval efficiency of the positioning image features, and the PROSAC algorithm is used (instead of the commonly used RANSAC algorithm) for matching optimization. In addition, the final matching result is further optimization by our proposed novel mismatched elimination method based on Hough transform voting idea, thus improving the matching precision and speed of obtaining corresponding feature points. Finally, the efficient PnP algorithm and bundle adjustment are used to solve the camera pose with accuracy. The paper proceeds with a review of visual positioning methods and related works in Section 2. The theory and methodology for proposed visual positioning method are explained in Section 3. The experimental design and the evaluation results are discussed in Sections 4 and 5, followed by conclusions in Section 6.

Related Work
With the rapid development of photogrammetric computer vision and optical camera technology, it is possible to achieve fast and economical image acquisition; precise and efficient image feature extraction and image matching; and quick solution of the projection matrix and external orientation elements. Moreover, image-based visual positioning has the characteristics of good visualization effect, context-rich information, and better precision. Thus, it has potential as a low-cost, accuracy active indoor positioning technology. Therefore, visual positioning technologies have been widely studied by international researchers. Generally, methods for image-based visual positioning consist of two steps: establishing a visual location feature library for place recognition and using perspective-n-point (PnP) for a camera pose estimation [1,2]. Many algorithms and solutions have emerged in applications in different fields. For example, [1] presented a two-step pipeline for performing image-based positioning of mobile devices in indoor environments. In the first step, it generated a sparse 2.5D georeferenced image database; in the second step, a query image was matched against the image database to retrieve the best-matching database image. In [2], an accurate indoor visual positioning method was proposed for smartphones, based on a high-precision 3D photorealistic map using PnP algorithms. It focused, in particular, on the research and comparison of camera pose estimation in the case of unknown mobile phone camera internal parameters. Similarly, [3] proposed a smartphone indoor positioning dynamic ground truth reference system using robust visual encoded targets for the real-time measurement of smartphone indoor positioning technologies, providing a new low-cost and convenient method for direct ground truth measurement in the research of smartphone indoor positioning technologies. In [5], a localization method was carried out by matching image sequences captured by a camera, using a 3D model of the building in a model-based visual tracking framework. The works [14] and [15] studied and proposed a wide baseline matching technique based on the SIFT algorithm to improve the accuracy of image matching between the positioning image and the database image. In [16], visual features of the identification images taken from a location space were extracted by studying the SIFT-based word of bag retrieval technology, and then matched them with massive images in the database to realize indoor visual positioning. In [17], a spatial visual self-localization method based on mobile platforms in urban environments was proposed, which was useful for exploring high-precision visual positioning of smartphones in outdoor spaces. In addition, with the emergence of some fast image matching algorithms (e.g., ORB, SURF, and so on) and clustering algorithms (e.g., Random forests, SVM, and so on), the real-time performance of visual positioning methods has been studied more and more [18,[21][22][23][24][25]. In [26], the PnP method was used to solve the motion of a calibrated camera through a set of n 3D points in the world and their corresponding 2D projections in the image. A continuous camera pose estimation method for indoor monocular cameras was proposed, which improved the camera pose estimation accuracy. However, further research on the establishment of high-precision 3D maps and rapid image retrieval from the location image database is still needed. Simultaneously, research on SfM and SLAM technologies has provided a lot of reference for the establishment of high-precision positioning feature libraries, storage and retrieval of the information in them, and accurate solution of the camera pose. In [27][28][29], the authors considered how to use SfM to solve the projective transformation matrix and camera parameters more robustly. In [30][31][32][33][34][35], a variety of visual SLAM schemes were proposed. A large number of algorithms for continuous camera-pose estimation and global optimization in indoor environments have been studied, and visual-based indoor real-time 3D mapping and positioning technologies have gradually been improved. However, they generally require continuous input data and occupy a large amount of the computing resources in smartphones, and they have generally been used only in a small range of VR/AR applications. In [36][37][38][39][40][41], RGB-D depth cameras were used to study high-precision real-time indoor 3D surface model reconstruction and mapping technologies. The RGB-D depth camera can provide depth information directly to the sensor while acquiring images, which improves the ability of color camera-pose estimation. It has been widely used in carriers, such as robots. In [42][43][44][45][46][47][48], the latest image feature extraction and image retrieval technologies were discussed, along with an analysis of the state-of-the-art methods for image location recognition, using deep learning and visual positioning based on traditional image features. It was concluded that the positioning success rate of neural network models based on deeplearning training needs to be improved and that it is difficult for the positioning accuracy to reach the decimeter level. Furthermore, the model training time was long, so the portability to different scenarios is limited. In addition, these methods have higher hardware occupation and requirements. However, there were advantages for specific objects, or when the training data were sufficient. Methods based on image invariant local features do not rely on training using big data, and have advantages in the cases where the image has more occlusion, the image color information changes sharply, or the texture is sufficient. In summary, the review of the relevant literature reveals that visual positioning approaches using local feature matching and deep learning still suffer from drift and positioning failure in different scenes, in spite of improved accuracy and robustness.
Furthermore, many studies have focused on the outdoors and dedicated special mobile terminals. Research on the indoor visual positioning of smartphones remains small. Meanwhile, the visual positioning methods based on local feature matching have absolute positioning accuracy advantage in the area with texture and can better adapt to the problem of indoor occlusion, which is very conducive to the application of precision positioning in indoor shopping malls, stations, and other environments. Therefore, based on the existing theories and techniques, this paper first studies an error elimination algorithm for a high-precision visual location feature library, an efficient feature library storage and retrieval strategy, and an algorithm for robust and accurate smartphone camera pose estimation. Then, an accurate indoor visual positioning approach based on a single image from a smartphone camera is proposed, which provides an effective method for the indoor visual positioning of smartphones, using local feature matching. Thus, this paper contributes to improving the level and status of visual positioning technology in indoor smartphone positioning applications.

Methodology
The proposed method uses images taken from an experimental indoor environment, using a Sony ILCE-5000 camera for the image database, and builds the positioning feature database by a precise database building strategy, implemented later in the paper. The positioning images are taken with smartphone cameras. By retrieving and registering with the positioning feature database, the position of the current image can be obtained. This paper proposes and implements a visual positioning method for smartphones, based on a single smartphone image under the C/S architecture. A workflow chart of an indoor visual positioning system under the C/S architecture is shown in Figure 1. In the positioning system, complete sequence images of the indoor scene are first obtained by an optical camera, following which, feature extraction and 3D object co-ordinate calculation are performed (on the server side) to establish an object feature library for visual positioning. Then, the user takes an image (as a positioning image) through the optical camera built into the smartphone, performs feature extraction on the smartphone end, and transmits the extracted image features and camera information to the server. The accurate pose of the positioning image is then calculated on the server side by using the positioning feature library established (on the server side) in advance. Finally, the accurate pose information of the positioning image is transmitted back to the user's smartphone and displayed, thereby realizing the self-positioning of the instantaneous pose of the smartphone camera.

Precise Positioning Feature Database Establishment
The use of pre-captured indoor images to establish the positioning feature database is a prerequisite for indoor visual positioning. The positioning feature database in this paper is a library file consisting of image point feature descriptors, image point co-ordinates, and 3D object point coordinates. It is used to provide 3D object points and image-matching information for positioning image matching and the EPnP algorithm, as shown in Figure 2. The establishment of the positioning feature database mainly includes image acquisition, bundle adjustment, and feature descriptor matching for SURF and 3D object co-ordinates. We first needed to take indoor images of the experimental environment before we established the positioning feature database. When shooting a complete indoor scene, we selected a commonly used camera-the Sony ILCE-5000 (Sony, Chonburi, Thailand)-which could capture photographs with a resolution of about 20 megapixels. It should be noted that, in order to reduce the influence of noise on image preprocessing, necessary texture information was needed in these images, and they needed to have a certain degree of overlap. Furthermore, camera calibration was done before shooting. After obtaining the indoor images, SfM [27][28][29] was used to preprocess these images, to achieve automatic bundle adjustment. Then, every camera pose of these images and every point's object space co-ordinates were obtained, and the projection matrix (PM) of every image could be simultaneously obtained [46]. According to the previous calculation, we could easily obtain the degree of overlap of the indoor images by using the pose information of every image; then, the images were selected in accordance with the principle of three-degree overlap (i.e., three adjacent images should have a certain overlap area), by the degree of overlap of the images. This served to effectively reduce redundant images from participating in subsequent image SURF feature extraction and matching.

Accelerated Image Feature Matching
In photogrammetric computer vision, high-precision image feature matching is a timeconsuming and difficult procedure when the number of images is large. A good image feature matching algorithm and a coarse-to-fine matching strategy have typically been utilized to improve the computational efficiency and accuracy [48,49]. SIFT [21], ORB [21], and SURF [21,23] are three typical and representative image-matching algorithms for invariant local features. Among them, SURF has comprehensive advantages in computing speed, rotation robustness, fuzzy robustness, illumination invariance, and scale invariance, which means it has good time efficiency and robustness in simultaneous image matching. Therefore, to solve the problem that indoor images are easily affected by light, shooting angle, and regional environment (which results in a poor matching effect and difficulty in local area matching), this paper proposes an improved high-precision image feature matching strategy based on the SURF operator. In the experiment, after using the SURF operator to extract and describe the feature points of the indoor image dataset, instead of using the brute-force matching method, the matching information between the images obtained by bundle adjustment was used to assist the SURF feature matching, thereby avoiding the time spent searching for all the feature points in the image set (due to the feature point matching process). However, it is difficult to obtain a good matching effect with a single constraint. In order to improve the matching accuracy, this paper introduces an epipolar constraint to further improve the matching results of the corresponding image points. In the experiment, the fundament matrix (FM) is calculated by using the PM obtained by bundle adjustment, following which the epipolar lines of the corresponding image points can be solved by using the FMs. In this way, the epipolar lines of all image feature points can be calculated and used to eliminate mismatch. Thus, by increasing the degree of matching constraints, high-quality matching sets can be obtained at the same time.

Multi-image spatial forward intersection
The poses of images from the indoor image database are calculated by bundle adjustment, and the information of matching point pairs in the images is obtained by SURF matching algorithm and our strategy. These images are then used as database images. Thus, the forward intersection can obtain 3D object points, as object points have more than two observations from the images. Multiimage spatial forward intersection is adopted, because some observation are outlines; thus, RANSAC is used to estimate the optimal solution, as it performs better than least squares when there are many outliers. As shown in Figure 3, the yellow points are the top view of the 3D object points, and the red points are the camera exposure points of the positioning image captured by smartphone cameras. After a geometry check, there were many outliers in the point cloud. After completing the above work, the positioning feature database can be established. It includes the 3D co-ordinates of the object points and descriptor information corresponding to each object point. The 3D co-ordinates of the object points can be expressed as Pn(Xn, Yn, Zn) and the corresponding feature descriptors can be expressed as (featurenn, featuren(n+1), featuren(n+2),… ), where n is a positive integer.

Online Smartphone Indoor Visual Positioning
The process of online smartphone indoor visual positioning based on a single image from a smartphone camera includes the following steps: First, a single image is taken by the smartphone camera. Then, feature point extraction and description are performed, and similar feature descriptors are searched for in the positioning feature database. Finally, the pose of the smartphone is calculated and returned. The smartphone indoor visual positioning procedure is shown in Figure 4. In the experiment, Kd-Tree+BBF [50] was used to retrieve the similar image descriptors. After SURF feature matching, using minimum distance, and geometry check, using the fundamental matrix and PROSAC [51] to select the inliers, the final matching result was further purified by our proposed method, based on Hough Transform voting.

SURF Feature Retrieval and Matching in positioning feature database
After extracting the SURF features in the positioning image and establishing the descriptors, the feature point set, P, of the positioning image and the SURF descriptor subset, D, corresponding to P, are obtained. As we have an established positioning feature database and the smartphone camera interior parameters can be obtained from the Exif (Exchangeable image file format) file, the instantaneous shooting position of the smartphone can be calculated by matching the SURF features of positioning image with the SURF features in the pre-established positioning feature database. In the experiment, for the SURF descriptor di (i=0, 1, 2…n, n is a positive integer) of a feature point pi (i=0, 1, 2…n, n is a positive integer) in the positioning image, if a brute force search method is used to search and match the descriptors in the positioning feature database, it traverses all descriptors in the positioning feature database each time, and the positioning time is greatly increased. Kd-tree is one of many high-dimensional spatial index structure and approximate query algorithms. It establishes an effective index structure by hierarchically dividing the search space, which greatly speeds up the retrieval. In image feature matching algorithms (e.g., SIFT and SURF), the standard Kd-tree index structure has been widely used for fast image feature comparison. However, its efficiency is closely related to the dimension of the feature vector. The higher the number of dimensions, the lower the efficiency. This is because the query completion process of each nearest neighbor eventually ends up falling back to the root node, resulting in unnecessary backtracking and node comparisons. When these extra losses occur in high-dimensional data lookups, the search efficiency becomes quite low. Incorporating BBF into the bilateral matching in the standard Kd-tree algorithm can significantly solve this problem. In short, its improvement to Kd-tree is to sequentially sort the nodes in the "query path" to shorten the search time.
In the experiment, the Kd-Tree+BBF similar feature search strategy was used for corresponding feature matching between the feature descriptors of positioning image and the feature descriptors in the positioning feature database. By traversing all the feature points in the positioning image and searching the corresponding matching descriptors for them in the positioning feature database, a series of matching feature point pair sets, M, can be obtained. After the query procedure, corresponding feature matching by minimum distance and geometry check is carried out by using the fundamental matrix and PROSAC to select the inliers. In the experiment, PROSAC was usedinstead of RANSAC-mainly because it can effectively reduce the number of iterations and time consumption when there are outliers in the matching points, as well as improving the time and robustness of the matching error elimination algorithm. In order to compare the effects of the two methods, this paper introduces the precision-recall curve, which is calculated by using Equations (1) and (2).
In Equations (1) and (2), TP is the number of real matching points which are predicted as matching points, FP is the number of real mismatching points which are predicted as matching points, and FN is the number of real feature matching points which are predicted as mismatching points.

Matching Error Elimination Based on Hough Transform Voting Idea
After matching the SURF feature descriptors extracted from the smartphone positioning image with the feature descriptors in the pre-established positioning feature database, a series of image points from pre-located smartphone images and their corresponding object points are obtained; that is, each matching point pair includes a two-dimensional image point of a smartphone positioning image and a corresponding object point in three-dimensional space.
As the similarity degree of the image feature descriptors is used in the point matching process to find the corresponding relationship between 2D image points and 3D object points, even if the feature matching results (as obtained in Section 3.2.1) eliminate a large number of mismatched points by matching optimization, there will still be a certain number of mismatched point pairs in the corresponding points due to similar textures and other factors in indoor space. If the matching error in the corresponding points is not eliminated and the subsequent smartphone camera pose is directly solved by PnP, the estimated camera pose may have a large error or may not even be solved. Therefore, in order to meet the smartphone camera-pose calculation requirements, this paper hopes to eliminate such mismatches as much as possible, to improve the success rate and accuracy of the smartphone camera-pose solving. A matching error elimination is proposed here based on Hough Transform Voting Idea (HTVI). This section concisely and clearly introduces the proposed mismatching elimination method based on the Hough transform voting idea, further purifying the matching point pairs (i.e., those obtained in Section 3.2.1).
The Hough transform is an image feature recognition and extraction technique which finds a particular type of shape by voting in the parameter space [52,53]. The simplest Hough transform is straight-line detection; a brief introduction follows. A straight line in two-dimensional space is shown in Figure 5a. The Equation of the line can be represented by polar co-ordinates： where r is the distance from the origin to the nearest point on the red straight line (called the polar path), and θ is the angle between the blue dashed line and the X-axis (called the polar angle). Each straight line corresponds to a pair of parameters (r, θ). This two-dimensional parameter space is a Hough space, which can be used to represent the collection of all two-dimensional straight lines. According to the principle of the Hough transform, if the co-ordinate of a two-dimensional point is known, then all straight lines passing through this point become a sinusoid in the Hough space. For ease of understanding, let us use an example to illustrate the Hough transform straightline detection method. Suppose there are three points (4, 3), (3,4), and (2, 5) in two-dimensional space. As shown in Figure 5a, these three points satisfy the collinearity condition. Converting these three points into the Hough space yields three sinusoids, as shown in Figure 5b. It can be seen from the figure that the sinusoids of the three points in the Hough space intersect at one point. According to such characteristics, the feature points extracted from the image can be converted into the Hough space, and the position of the intersection of the sinusoids gives the parameters of the straight-line equation. Therefore, finding the intersection of the sinusoids in a set of sinusoidal curves in the Hough parameter space is the key. Straight-line detection in Hough space essentially uses a voting idea, which can be divided into three steps: First, the Hough parameter space is quantized into a series of finite intervals (or accumulator boxes). Then, the points that may be straight lines are converted into a sinusoidal function in Hough space, and the number of votes in the corresponding accumulator box is increased, according to the areas where the sinusoids are distributed in Hough space. Finally, the object most likely to be a straight line is detected by looking up the local maximum value in the accumulator.
When a positioning image is taken with a smartphone camera, a line of light is formed from the object point, and the center of photography has an intersection with the image plane of the image. Moreover, this intersection point is the image point corresponding to this object point. Put simply, for an image, the center of the image at the time of image capture, the image point on the image, and the object point corresponding to the image point are on the same straight line. This idea is demonstrated in Figure 6. S is the photography center, P is the object point, and p is the image point According to the previous introduction, when a smartphone captures an image for positioning, it first needs to match the feature descriptors extracted from the smartphone positioning image with the feature descriptors in the positioning feature database. Then, matching point pairs in which the image points of positioning image are in one-to-one correspondence with the object space 3D points are obtained. The object point and the image point in each matching point pair can be connected to obtain a straight line. Then, the matching point pairs (of Section 3.2.1) correspond to straight lines in 3D space. In order to simplify the computational complexity, we project the 3D space onto the ground and simplify it into a 2D plane; that is, we project these straight lines in space onto the ground to obtain a set of 2D straight lines. In this paper, we denote such a straight-line sequence by L.
In The premise of this hypothesis is that most of the matching point pairs obtained are correct and that only a small number are mismatched. In fact, it is proved in the subsequent experiments that this assumption is true. In this way, we only need to remove the lines that do not pass through the photography center from the line sequence, L. This paper draws on the idea of voting in the Hough transform to perform mismatching point culling. The biggest different from the Hough transform is that the method proposed in this paper does not vote in the parameter space. Instead, the area where all the straight lines pass through is voted for directly in the projected two-dimensional plan space. The area with the highest voting value can be regarded as the area where the photography center is located. A line that does not pass through this area can be considered to be a mismatch for its corresponding matching point pair, which can be eliminated.

Single Image Positioning
In this section, using the matching feature points, the corresponding 3D object points of the feature points from the smartphone positioning image are obtained. The methods of camera-pose calculation can be used to calculate the extrinsic parameters of the positioning image. Perspective-npoint (PnP) is a method for solving 3D-to-2D point pair motion. It estimates the pose of the camera when shooting images by obtaining n 3D object spatial points and their projected positions in the image. A P3P problem is shown in Figure 7 that is one of the common methods for solving PnP problems. As it can obtain better motion estimation in few matching points, it has been considered to be the most important camera-pose estimation method. In Figure 7, O is the camera's optical center; and a, b, and c are the 2D projection points on the image plane corresponding to 3D object points A, B, and C, respectively.
Although P3P is an important and common method for solving PnP problems, it cannot make full use of information and is susceptible to noise and mismatching points. In order to solve this problem, a better improvement method is to use EPnP (Efficient PnP) for pose solving. It can make use of more information and optimize the camera pose in an iterative way, in order to eliminate the influence of noise as much as possible. In this paper, the EPnP algorithm and bundle adjustment (BA) are used to solve the camera pose. In addition, other methods, such as UPnP (Uncalibrated PnP), have been widely used to estimate camera pose in different situations. We compare them later, in the Experimental Analysis section.

Test Data and Experimental Environment
In the experiment, three different indoor scenes were selected as indoor experimental environments to evaluate the proposed method. Among them, two indoor scenes with decorative pictures of different materials were set up as experimental fields, as well as a real conference room scene. Figure 8 shows the decorated experimental rooms. Figure 9 shows the real conference scene. The database images were taken with a Sony ILCE-5000, where the image size was 5456 × 3632. There were 149, 151, and 100 images in the building positioning feature databases for Room 212, Room 214, and the conference room, respectively. To evaluate the precision of the positioning, the non-prism total station (Leica TS60) was used to measure the camera position of the smartphone in the experiment; the value measured by the total station was taken as the ground truth. It was difficult to measure the smartphone camera, as the surface of the camera was a glass material. Therefore, a particular ring crosshair was affixed to the camera for aiming and automatic tracking measurement by the TS60. Figure 10 shows the Leica measurement robot, the ring crosshair affixed on the smartphone, and the interface of experimental app. The purple circle in Figure 10c shows the solved instantaneous 2D co-ordinates when the positioning image was taken. In the experiment, we evenly selected the position when shooting the positioning image in the experimental rooms and held smartphones to capture the positioning images in these positions, while using the TS60 to measure the ring crosshair on the smartphones. After measuring the offset, the smartphone camera position was acquired.  The indoor area of each decorated experimental room was approximately 120 square meters; however, Room 212 has a little more space than Room 214. We selected 23 and 20 positioning image capture points in Rooms 212 and 214, respectively, and Rectangular Plane (X, Y) Co-ordinate Systems were established in the two rooms. Huawei Honor 10 and Samsung Galaxy S8 smartphones were used to capture the positioning images at these points to implement the smartphone positioning experiment, and two images were captured at different orientations at the same position of each point. Figure 11 shows some of the smartphone positioning images. In Room 212, the Huawei and Samsung smartphones each obtained 46 experimental positioning images; in Room 214, the Huawei and Samsung smartphones each obtained 40 experimental positioning images. The image resolution of the Huawei Honor 10 is 3456 × 4608, and the image resolution of the Samsung Galaxy S8 is 3024 × 4032. It should be noted that the current indoor positioning service is more concerned with our planar position on a certain floor of the indoor space, but not as concerned about the indoor height information. This is because, when people use a smartphone in a certain height space, its height space will fluctuate within a small range. People can easily estimate the height of a smartphone based on information such as their stature, and this estimate generally does not deviate too much from the true height. Therefore, when using a smartphone for positioning and navigation indoors, we often only need to know the spatial planar (i.e., ground) position. Hence, in the experiment, our focus was to verify the accuracy of the planar co-ordinate values for smartphone positioning in the indoor space. In addition, to prove how the present study advances the existing state-of-the-art, the original method using the image with the most matches to calculate the smartphone camera pose, as in [1], was used as a baseline method. We compared the two methods in a real scene. The experimental Desktop computing environment was the Windows 10 operating system with an Intel (R) Core (TM) i7-7820HK and 32 GB RAM.    Figure 13 is the precision-recall curve of RANSAC and PROSAC with the experimental images in Figure 12. It can be seen from Figure 13 that the PROSAC algorithm has advantages over the RANSAC algorithm, in terms of precision rate under the same recall rate, especially when the recall rate is between 0.55 and 0.8. In this range, there were enough interior points, and the precision rate was high. This was consistent with conclusions previously drawn in the literature-that the recommended recall value is around 0.65, which can ensure that the image-matching interior point set can satisfy both the requirements of number and quality [51].  Figure 14 shows the time-proportion in terms of interior points given by RANSAC and PROSAC on the pair of indoor matching images shown in Figure 12. It can be seen that the average time cost of PROSAC was significantly lower than that of RANSAC. This was because the random sampling in the RANSAC algorithm leads to more iterations. In general, the average number of iterations was greater than one, so the time cost of the RANSAC algorithm was relatively large. As PROSAC presorts the interior points, it can obtain better samples during the sampling process, such that the number of iterations is far less than that of the RANSAC algorithm. Generally, one iteration could obtain the correct model, and the corresponding number of iterations was small. As the percentage of interior points increased, the probability that RANSAC selected an interior point when randomly selecting samples became larger, and the success rate of obtaining the correct model increased correspondingly, such that the number of iterations decreased and the time cost became smaller. The running time of PROSAC was almost independent of the proportion of interior points, and it was more robust to sample error.  Figure 15 shows the 10 experimental smartphone positioning images in this experiment. Table 1 is a comparison of the planar positioning results obtained by the PnP method for the 10 smartphone positioning images, before and after the matching error elimination based on HTVI. If the location result is beyond the range of the test room, it is considered that the location result is wrong, and the case where the positioning result cannot be output is called positioning failure. In other cases, the positioning is successful.  From the results of Table 1, it can easily be found that the positioning success rate was 80% and the correct rate was 50% before using the matching error elimination based on HTVI detailed in Section 3.2.2. After using the proposed method, the positioning success rate and correct rate are both 100%, and the average error of positioning decreased from 0.98 to 0.61 m. Although there were many factors affecting the success and accuracy of positioning, this experiment reflects the effect and significance of further elimination of mismatching.

Comparison Experiment of Three Camera-Pose Estimation Methods
In order to compare the three most commonly used camera-pose estimation methods (i.e., PnP, EPnP, and UPnP), we carried out a relative pose recovery experiment, using the pair of indoor images presented in Figure 12. The experimental results obtained are shown in Table 2. In this experiment, n was 3. It can be seen that EPnP was superior to other methods in accuracy and time and so the EPnP method was selected for camera pose estimation in this paper.

Experimental Accuracy Evaluation with Decorated Indoor Scene
In the experiment, as the root mean square error (RMSE) can well reflect the precision of the measurement, this paper uses the RMSE for accuracy evaluation. The RMSE values of the X and Y direction and the total error were calculated, which are denoted by △X, △Y, and △D, respectively. In addition, the mean square error of a point (MSEP) is used to calculate the offset between the measured value of each positioning image and the truth value of the point where it is located. Equations (4) and (5) are their respective mathematical expressions: In Equation (4), Mi is the measured value and Gi is the ground truth corresponding to Mi. In Equation (5), Xmeasure,i and Ymeasure,i are the measured co-ordinate values of the positioning image, and Xground_truth,i and Yground_truth,i are the ground truth, corresponding to the measured co-ordinate values. The value of i ranges from 1 to n, where n is a positive integer. Tables 3 and 4 show the RMSE values of two smartphone positioning experiments in the experimental indoor spaces Room 212 and Room 214, respectively. From the perspective of overall co-ordinate accuracy, the positioning accuracy of the proposed method was at the decimeter or centimeter level, which is much better than other indoor positioning technologies, such as Bluetooth, PDR, and Wi-Fi. Of course, visual positioning is inherently a highly accurate positioning technique in an indoor space with sufficient image textures; the results of this paper also prove this. As mentioned above, in order to prove the effectiveness of the proposed method, we used different smartphones to capture the positioning image in different rooms, in different situations and environments, such as different viewpoints, positions, illumination, distance, indoor decorative textures, and materials. As shown in Tables 3 and 4, there were significant differences in the positioning accuracy of two brands of smartphones in different rooms, where the differences in positioning accuracy between the two brands of smartphones in the same room were much smaller. As can be seen from Figure 8a, in order to avoid the influence of outdoor ambient light in the rooms, we selected two symmetrical rooms on the same side of the building as the experimental spaces. Moreover, the indoor positioning images were taken under the same indoor lighting conditions, at the same time. Based on the above considerations, we believe that the main reasons causing the positioning accuracy in Room 214 to be significantly better than that in Room 212 were the factors of the interior decoration texture and room size. As the difference in the indoor space between the rooms was small, the most important influence on the positioning accuracy was the interior decoration texture. As shown in Figure 11, three pairs of positioning images were shown from two rooms Table 3. Accuracy evaluation of positioning results in Room 212. The decorative paintings posted in Room 212 were made of plastic paper and copper paper, and the decorative paintings in Room 214 were made of fabric. It is easy to see that the textures in Room 212 had noticeable reflections and that its wall decoration texture was not as rich as Room 214's. From the experimental results, these differences obviously affected the positioning accuracy. The difference in positioning accuracy between the Samsung Galaxy S8 and the Huawei Honor 10 is likely to be mainly due to the high imaging quality and resolution of the latter's camera. Therefore, the Huawei Honor 10 achieved slightly better positioning results in the experiment. It must be noted that, in terms of the RMSE metric, the proposed method achieves precision positioning results. Figures 16 and 17 show the co-ordinate offset between the visual positioning measurements and the ground truth for the different smartphones in Rooms 212 and 214, with the method proposed in this paper. In Figure 16a, the errors of points 1, 8, 10, 11, 13, 14, 16, 17, 21, and 22 were larger than 15 cm. In Figure 16b, the errors of points 4, 9, 16, and 18 were larger than 15 cm. In Figure 17a, the errors of points 12,14,15,19, and 20 were larger than 15 cm. In Figure 17b, the errors of points 4 and 8 were larger than 15 cm. The errors are given in Tables 5 and 6. In the corresponding smartphone positioning images in the experiment, the capture distances of these points were far and the image shooting angle was large. In addition, windows occupied the majority of the frame in some images, resulting in fewer available textures. These factors are problems that must be overcome in image matching, in order to conform to the fundamentals of image positioning technology. Although this paper has done a lot of work in image matching and proposed a reliable precision and fast imagematching strategy, it is still difficult to deal with all situations

Experimental Accuracy Comparison in Real Indoor Scene
To prove the effectiveness of our proposed method, we conducted further experiments in a real indoor scene and compared the accuracy and robustness with one of the existing similar methods. The baseline method was the original method, using the image with the most matches to calculate the extrinsic parameters of the smartphone camera (as in [1]), which was proposed by scholars at the University of California, Berkeley. As shown in Figure 9, the experimental environment was a real conference scene room with different kinds of furniture and furnishings. In this experimental scene, we took 32 positioning images, using the Huawei smartphone. The MSEP of each positioning image was calculated for both our method and the baseline method. A statistical table of positioning error results is shown in Table 7. It can be seen that the accuracy of our method was higher than that of the baseline method. In addition, there were two image-localization failures in the comparison method. This shows the effectiveness and robustness of our method.
In Figure 18, we can more easily see the accuracy difference between the two positioning methods. In the RMSE metric, the proposed method was 0.132 m, and the baseline method was 0.289 m. In addition, we added people to the positioning image, and the presence of people in the image made the captured positioning image more consistent with the real indoor conference scene. As shown in Figure 19, there were four control group images. Through this undesired occlusion phenomenon, we can evaluate the stability of local feature matching in actual image-based positioning.   Table 8 shows the positioning error results calculated by two comparison methods for four groups of control data. We can see that the positioning accuracy changed. In terms of the RMSE metric, after adding new occlusions, the overall positioning accuracy of the two methods was reduced. The RMSE values of our method were 0.084 and 0.135 m before and after the occlusion was added. The RMSE values of the Baseline method were 0.309 and 0.323 m before and after the occlusion was added.

Discussion
In the method that relies on the 3D point cloud, the accuracy of the positioning feature database has a great influence on the positioning result on the smartphones. Moreover, indoor environments are challenging for visual positioning because there are repetitive/similar texture, weak texture, or textureless regions. In the weak texture or textureless scenes, there is no result, or results are inaccurate. In the repetitive texture scene, because there are repetitive features, the number of overlap images is a key. Although the positioning accuracy and stability of the proposed work are proved in different experimental scenes, it is a prototype system for smartphone indoor visual positioning. Specifically, the processing of positioning feature database is offline. In this part, based on the existing classical matching algorithms and strategies, our main aim is to add an epipolar constraint based on the fundamental matrix and a matching image-screening strategy based on image overlap during construction of the positioning feature database, which is conducive to help reduce noise points in the feature point cloud. Matching images with the feature point cloud instead of database images improves the efficiency of the localization procedure [22]. In online Smartphone Indoor Visual Positioning, a strategy of Kd-Tree+BBF ensures the retrieval efficiency of the positioning image features and the PROSAC algorithm is used instead of the RANSAC algorithm for matching optimization. In addition, the final matching points is generated by our proposed novel mismatched elimination method based on HTVI, thus improved the inlier ratio, time cost, and matching point distribution. We can easily see these changes in Figures 12-14. At the same time, it can be easily found from Table 1 that the robustness and accuracy of the positioning were significantly improved after using the matching error elimination based on HTVI.
From the experimental results accuracy evaluation and analysis of decorated indoor scene, the position error is not uniform distributed, as shown in Figure 20. The X-axis in the figures is the location error in the range of, for example, 0-0.03 m, 0.03-0.06 m, 0.06-0.09 m, 0.09-0.12 m, and so on. The Y-axis is the positioning image number. In Figure 20a, the location error distribution is divergent. There exist large errors, such as 0.3 and 0.33 m, but the location error of 80.4 percent of all positioning images in the proposed method is smaller than 0.15 m. Among them, the location error of 73.9 percent of the Samsung smartphone's positioning images in the proposed method was smaller than 0.15 m, the location error of 87 percent of the Huawei smartphone's positioning images in the proposed method was smaller than 0.15 m, and the divergent errors contributed to a large RMSE. In Figure 20b From the comparison of different experimental environments and conditions in decorated indoor scenes, we further found that the positioning error of the Huawei smartphone is smaller than that of the Samsung in the same experimental scene, and the positioning error is also significantly lower in the experimental scene with richer texture. In the experiments of these two scenes, the location method and the image of establishing the positioning feature library are the same, and all the images were taken by following the same rules. The difference is that the positioning image uses from two different smartphone cameras, and the texture in the two scenes is different. The Huawei smartphone images not only have a higher resolution, but there are more feature points detected in its image; and the Huawei smartphone image scale is closer to that of the database images. Thus, better camera resolution and richer texture can get better positioning accuracy. This is also the reason why the location error can be significantly different in different situations when using the method proposed in this paper. In the experiment, the number of database images in Rooms 212 and 214 were 149 and 151, respectively. The size of the positioning feature datasets generated using the images of the two rooms are all about 30 MB. If the number of dataset images is larger, more time is needed for positioning. To reduce the computational time, a coarse position can be useful when only the adjacent images are compared in similar features lookup; alternatively, GPU acceleration can be used on the smartphone side. Using down-sampled images can also improve the computational efficiency in engineering.
From the experimental results' accuracy evaluation and analysis of real conference scene, it is easy to find that our method is more accurate and has a higher success rate than the baseline method. Moreover, after the control experiment of occlusion, although the positioning accuracy of both methods reduced, our method is still better than the baseline method. For further analysis, we compare the changes of the number of inliers used to calculate the camera pose by different methods in Table 9. It can be seen that the new matching point error elimination algorithm proposed by us played an important role. In all comparative experiments, although our method finally obtained fewer inliers, it has a better correct inlier rate, which helped to obtain more accurate positioning results. At the same time, we also found that the matching points obtained by the two methods had basically decreased after adding people occlusions; this is caused by occlusion. These results verify the effectiveness and robustness of the method and strategy proposed in this paper. It should be noted that the proposed method is only a prototype system for smartphone indoor visual positioning. When we transmit the positioning information through the 4G network for serverside positioning solution after the positioning image is captured, the total positioning time is 2-5 seconds. In some cases, it can reach eight seconds. When we download the positioning feature database to the smartphones, the entire positioning calculation is completed on the smartphone. After the positioning image is taken, the total positioning time is between 0.3 and 1 second. The time difference between the two positioning modes is mainly due to the server-side positioning time being heavily dependent on the efficiency of the 4G network in transmitting the positioning information.

Conclusions
In this paper, an efficient automatic smartphone indoor visual positioning method was proposed, using local feature matching, which uses images with known intrinsic and extrinsic parameters to locate smartphones indoors. For the establishment of a precise positioning feature database, the proposed method uses a modified and extended high-precision SURF feature matching strategy and the multi-image spatial forward intersection to obtain a point cloud. For online smartphone indoor visual positioning, a robust and efficient similarity feature retrieval method was proposed, in which a more reliable and correct matching point pair set is obtained through the use of a novel matching error elimination technology based on Hough transform voting. Finally, an online indoor visual positioning experiment for smartphones was realized by the fast and stable camerapose estimation algorithm in this paper. In decorated experimental scenes, the results show that 88.6 percent of the positioning images achieved location errors smaller than 0.15 m-there were only two positioning images with location errors exceeding 0.3 m-proving that the proposed method can achieve a precise positioning effect. Even in the more challenging scene of Room 212, 73.9 percent of the Samsung smartphone positioning images and 87 percent of the Huawei smartphone positioning images achieved location errors smaller than 0.15 m. For the real experimental scenes, the results show that the positioning accuracy of our method was more than double that of the comparison method. In terms of the RMSE metric, the overall positioning accuracy was still better than 15 cm. In addition, the success rate of our method was better than the baseline method. These all confirm the effectiveness and robustness of the proposed method.
Of course, the proposed method has some limitations. The object 3D points are obtained from feature point matching, as well as the relationships between the positioning image points and the object 3D points. Although much effort has been put into accurate and reliable image feature point matching to ensure that camera pose estimation is less affected by mismatched points, which has a good effect when we have similar indoor textures and small illumination and perspective changes. Thus, in weak or invalid texture regions, there will be either no result or inaccurate positioning results. Furthermore, when the interior decoration and furnishings change greatly, we need to update the location feature library in a timely manner; otherwise, the location will fail or the location accuracy will be poor. In future research, it is worth our consideration and exploration to improve the positioning accuracy and success rate by using more stable line features from indoor textures or indoor building frame structure information, which rarely changes.