1. Introduction
Urban localization is essential to the development of numerous IoT applications, such as the digital management of navigation, augmented reality, and commercial related services [
1], and is an indispensable part of daily life due to its widespread application [
2]. For indoor areas, Wi-Fi based localization has become extremely popular and many researchers are focused on this area [
3,
4,
5]. However, the use of Wi-Fi in urban areas is still highly challenging, and positioning is limited to an accuracy of tens of meters, even in strong signal conditions [
6]. As indicated in [
7], the calibration of Wi-Fi fingerprinting databases and the density of Wi-Fi beacons in urban areas pose a large number of challenges. As a result, Wi-Fi is mostly suitable for indoor positioning. In the context of outdoor pedestrian localization, the application of the global navigation satellite system (GNSS) is key to providing accurate positioning and timing services in open field environments. Unfortunately, significant improvement is needed in the positioning performance of GNSS in urban areas due to signal blockages and reflections caused by tall buildings and dense foliage [
8]. In these environments, most signals are non-line-of-sight (NLOS), which can severely degrade the localization accuracy [
9]. Hence, they cause large estimation errors if they are either treated as line-of-sight (LOS) or not used properly [
10]. Therefore, efforts have been devoted to developing accurate urban positioning systems in recent years. A review of state-of-the-art localization was published in 2018 [
11]. Each of these technologies has its own advantages and limitations. However, some of these solutions face other challenges, such as mobility, accuracy, cost, and portability. A pedestrian self-localization system should be sufficiently accurate and efficient to provide positioning information [
12]. Currently available personal smartphones are equipped with various embedded sensors, such as a gyroscope, accelerometer, and vision sensors. These sensors can be used for urban localization, and also satisfy the requirements of being inexpensive, easy to deploy, and user friendly.
With the increase in the development of smart cities, 3D city models have been developed rapidly and become widely available [
13]. An idea known as GNSS shadow matching was proposed to improve urban positioning [
14]. It first classifies the received satellite visibility by the received signal strength and then scans the predicted satellite visibility in the vicinity of the ground truth position. The position is then estimated by matching the satellite visibilities. Another method is the ray-tracing-based 3D Mapping Aided (3DMA) GNSS algorithms that cooperate with the pseudo-range has been proposed [
15]. The integration of shadow matching and range-based 3DMA GNSS is proposed in [
16]. The performance of this approach in multipath mitigation and NLOS exclusion depends on the accuracy of the 3D building models [
17]. In recent years, interest has increased in inferring positions using 3DMA and vision-integrated methods. The motivation is that these are complementary methods, which in combination can provide rich scenery information. This is largely because high-performance modern smartphones provide cameras, and computing platform for storage, data processing, and fusion, which can be easily exploited. The general idea behind most of these approaches is to find the closest image to a given query picture in a database of position-tagged images (three-dimensional position and three-dimensional rotation, adding up to six degrees of freedom [DOF]).
Research has demonstrated that it is possible to obtain precise positioning by matching between a camera image and a database of images. One popular approach uses sky-pointing fisheye camera equipment to detect obstacles and buildings in the local environment [
18]. When used in conjunction with image processing algorithms, this approach allows the matching of the building boundary skyplot (skymask) to obtain a position and heading.
To date, several studies have examined the use of smartphone images to estimate the position of the user. Google’s recently developed feature-based visual positioning system (VPS) identifies edges within the smartphone image and matches these with edges captured from pre-surveyed images in their map database [
19]. The position-tagged edges are stored in a searchable index and are updated over time by the users. Another area of study focuses on semantic information, such as identifying static location-tagged objects (doors, tables, etc.) in smartphone images for indoor positioning [
20]; however, reference objects are often limited in outdoor environments. Thus, other researchers have studied the use of skyline or building boundaries to match with smartphone images [
21,
22,
23,
24]. This provides a mean positional error of 4.5 m and rotational error of 2–5° in feature-rich environments [
21].
Although both methods are suitable in urban areas where GNSS signals are often blocked by high-rise buildings, the former requires features extracted from pre-surveyed images for precise localization, suffers from image quality dependency, and requires frequent updates using the cloud-sourced data supplied by users. By comparison, the latter suffers from obscured or non-distinctive skylines, which are prominent in highly urbanized areas where dynamic objects dominate the environment. Thus, detection based solely on the edges and the skyline may not be sufficient for practical use and precise positioning. From the perspective of pedestrian navigation, in addition to the identification of features and the skyline, humans also locate themselves based on visual landmarks that consist of different semantic information, for which each semantic has a material of its own. These high-level semantics are a new source of positioning information that does not require additional sensors, and many modern smartphones are already equipped with high-performance processors that can identify these semantics. These models are steadily improving in accuracy, and currently obtain accuracy of about 85% in city landscapes [
25].
Therefore, inspired by existing methods, our proposed solution applies the semantic VPS by utilizing different types of materials that are widely seen and continuously distributed in urban scenes. The proposed method offers several major advantages over the existing methods.
First, we take advantage of building materials as visual aids for precise self-localization, overcoming inaccuracies due to a non-distinctive or obscured skyline, which are common in urban environments.
Second, the semantic VPS uses building information modeling (BIM), which is widely available in smart cities, due to its existing use in construction, thus eliminating the need for pre-surveyed images. Hence, it is highly scalable and low cost.
Third, unlike storing feature data as 3D point clouds in a searchable index, the semantics of materials are stored as the properties of the objects in the BIM, enabling simple and accurate updates to be undertaken.
Finally, the proposed method identifies and considers dynamic objects in its scoring system, which have usually been neglected in previous studies.
Thus, this study comprises interdisciplinary research that integrates the knowledge of BIM, geodesy, image processing, and navigation. We believe this interdisciplinary research demonstrates an excellent solution to provide seamless positioning for many future IoT applications.
The remainder of this paper is organized as follows.
Section 2 explains the overview of the proposed semantic VPS approach.
Section 3 describes the candidate image generation, material identification, and image matching in detail.
Section 4 describes the experimentation process and the improvement of the proposed algorithm is verified with existing advanced positioning methods.
Section 5 presents the concluding remarks and future work.
2. Overview of the Proposed Method
An overview of the proposed semantic VPS method is shown in
Figure 1. The method is divided into two main stages: an offline process and an online process.
In the offline process, the building models are segmented into different colors based on the material, which ensures a perfect representation of the materials in the BIM (
Section 3.1). The segmented city model is used to generate cubic projections at each position (
Section 3.2), which are then converted into equirectangular projection images (
Section 3.3) for later comparison. By storing the images in an offline database within the smartphone, we can derive a memory-effective representation of accurate reference images suitable for smartphone-based data storage.
Based on the generated images, we propose a semantic VPS method for smartphone-based urban localization. In the online process, the user captures an image with their smartphone (
Section 3.4), with the initial position estimated by the smartphone GNSS receiver and IMU sensors. Then, candidates (hypothesized positions) are spread across a search grid based on the initial position (
Section 3.5). The smartphone image is then segmented based on the identified types of materials (
Section 3.6). The segmented smartphone image is transformed into the equirectangular projection image (
Section 3.4) to be matched with the candidate images using multiple metrics to calculate the similarity scores (
Section 3.7). The scores of each method are combined to calculate the likelihood of each candidate (
Section 3.8). The chosen position is determined by the candidate with the maximum likelihood among all the candidates (
Section 3.9). The details of the proposed method are described in the following section.
4. Experimental Results
4.1. Image and Test Location Setting
In this study, the experimental locations were selected within the Tsim Sha Tsui and Hung Hom areas of Hong Kong, as shown in
Table 2. Three locations were selected in challenging deep urban canyons surrounded by tall buildings where GNSS signals are heavily reflected and blocked. Three images were taken at each of the selected locations using a generic smartphone camera (Samsung Galaxy Note 20 Ultra 5G smartphone with an ultra-wide 13mm 12-MP f/2.2 lens) and a tripod. The experimental ground truth positions were determined based on Google Earth and nearby identifiable landmarks, such as a labelled corner on the ground. Based on the experience of previous research [
18,
32], the ground truth uncertainty of latitude and longitude was
and yaw was
. The pitch and roll angles were measured using the
XPRO geared head, Manfrotto, with
uncertainty.
The experimental images were chosen with the following skyline categorizations: distinctive, symmetrical, insufficient, obscured, and concealed. Categorizations were based on the difficulties experienced by current 3DMA GNSS and vision-based positioning methods. The smartphone was used to capture the images and to record the low-cost GNSS position and IMU rotation. The GNSS receiver within the smartphone was a Broadcom BCM47755. The IMU was a LSM6DSO MEMS and was designed by STMicroelectronics. Images were taken at each location with different combinations of scenic features to demonstrate the proposed semantic VPS method. The locations were chosen to test the following environments: dense foliage (Loc. 1), street (Loc. 2), and alleyway (Loc. 3).
4.2. Positioning Results Using Ideal Segmentation
The positioning quality of the proposed method was analyzed based on the ideal manual segmentation of the smartphone image. The experimental results were then post-processed and compared to the ground truth and different positioning algorithms as shown in
Table 3, including:
Proposed semantic VPS (Combination of Dice, Jaccard and BF Metrics)
Proposed semantic VPS (Dice only)
Proposed semantic VPS (Jaccard only)
Proposed semantic VPS (BF only)
Skyline Matching: Matching using sky and building class only [
21].
3DMA: Integrated solution by 3DMA GNSS algorithm on shadow matching, skymask 3DMA and likelihood based ranging GNSS [
33].
WLS: Weighted Least Squares [
34].
NMEA: Low-cost GNSS solution by Galaxy S20 Ultra, Broadcom BCM47755.
Loc. 1 is in an urban environment with dense foliage, which contains multiple non-distinctive medium-rise buildings. The results show the positioning accuracy of the proposed semantic VPS improves upon the existing advanced positioning methods. An error of approximately 5.56 m from the smartphone ground truth suggests that the semantic VPS can be used as a positioning method in foliage dense environments. Utilizing additional material information from buildings, this approach increases the performance of skyline matching by three-fold. The inability of skyline matching was due to the presence of foliage obscuring the skyline. Without an exposed skyline, a correct match cannot be obtained and the positioning error may be increased. 3DMA was shown to correct the positioning to a higher degree, ranking behind the proposed method. The positioning errors of WLS and NMEA were likely because of the diffraction of the GNSS signals passing under the foliage with the combination of high-rise buildings.
As shown in the heatmap in
Table 4, the proposed method using the Dice and Jaccard metrics have very large positioning errors, possibly due to the lack of distinctive materials captured in the smartphone image. The tested location is surrounded by buildings of the same shape, size, and material. Therefore, it is a very challenging environment for the proposed method because the candidate images share a common material distribution. It can be seen in this situation that using the BF achieves a higher positioning accuracy than the Dice and Jaccard metrics, because it calculates the material contour rather than the material region. Thus, with the combination of the three metrics, this foliage dense environment proved suitable for the proposed method, which successfully utilized materials as information for matching.
Loc. 2 is in a common street urban environment with high-rise buildings. The results show that the positioning accuracy of the proposed method improves the positioning accuracy to around two meters. In an environment where skyline matching should perform the best, the proposed method also improves skyline matching by more than three-fold. The matching of the diverse materials distributed in the scene, in addition to the distinctive skyline, significantly improved the positioning accuracy. 3DMA lagged slightly behind skyline matching, whereas WLS increased the positioning error. It should be noted that the estimated positioning error for the NMEA is around 8 m, which is significantly less than that of Loc. 1. This is likely due to the relative open area along the street, as shown in
Table 2.
The heatmap results shown in
Table 4 demonstrate that the metrics complement each other when combined. As shown in Loc. 2.1, in a scene with diverse materials, the Dice and Jaccard metrics have a higher positioning accuracy and achieve a higher likelihood than BF. Therefore, the combination of the three metrics supports regional-based similarities.
Loc. 3 is clearly the most challenging urban environment for the 3DMA GNSS and vision-based positioning methods due to the close and compact high-rise buildings and visually symmetrical features. It can be seen that all methods suffer in this environment, and most noticeably WLS. The results show that the positioning error of the proposed method is nearly 16 m and can be improved significantly. Nonetheless, it should be noted that this is a 35% improvement in positioning compared to skyline matching. Due to the lack of a distinctive skyline, skyline matching can potentially increase the positioning error if matched with the wrong image, as demonstrated at this position. 3DMA lags behind the proposed method and, as demonstrated, only the proposed method and 3DMA slightly improved the positioning accuracy.
The poor results can be explained by two conditions required for accurate positioning. Firstly, the images ideally should have no segmentation error. This error is not considered in the positioning results, because we are assessing the ideal image segmentation. Instead, we analyzed the segmentation error in relation to the positioning error in
Section 4.4. Secondly, ideally there should be no discrepancies between the smartphone image and the candidate image at ground truth. Loc. 3 suffers from the latter as shown in
Table 5.
This error is shown in the positioning results of Loc. 3, where many candidates share a common similarity and color. Thus, it is important to ensure the BIM is constantly updated to reflect reality.
4.3. Rotational Results Using Ideal Segmenatation
The three-dimensional rotational performance of the proposed method was analyzed based on the ideal smartphone image segmentation, then compared to the smartphone IMU as shown in
Table 6.
The results show that, in an urban environment with features, the materials of buildings can be used to estimate the rotation. The yaw, pitch, and roll have an accuracy of 2.3, 1.4 and 1.3 degrees, respectively. However, the smartphone IMU pitch and roll estimation is already very accurate compared to the proposed method, and thus the proposed method only degrades the estimation. Instead, the proposed method succeeds at predicting the yaw accurately, within an average of 2.3 degrees. Hence, the proposed method can be considered an accurate approach to estimate the heading of the user in an urban environment.
Therefore, it is suggested that the proposed method should use the already accurate altitude, pitch, and roll for position, and the yaw estimation. Eliminating the estimation of three dimensions will significantly reduce computational load because fewer candidate images are used for matching.
4.4. Segmentation Accuracy vs. Localization Results
To test the effect of the semantic segmentation accuracy on the localization results, we considered the two conditions required for accurate positioning. Ideally, there should be no segmentation error and no discrepancies between the smartphone image and the candidate image at the ground truth. We can therefore further classify these two types of errors: contour-based error and regional-based error. In our experiments, we tested whether discrepancies can contribute heavily to the positioning accuracy, as shown in
Table 4, where the smartphone image differs from the candidate image at the ground truth. Therefore, we can consider this as a regional-based error because the entire region differs between the images. We should also consider the contour-based error, which is not demonstrated in our experiments, but is reflected in a realistic output of a semantic segmentation neural network where the boundaries of a region are shifted. Contour error can be problematic for boundary related metrics, such as the BF metric, which focus on the evaluation along the object edges. Correctly identifying these edges is very important, because any shift in alignment can lead to a mismatch with another candidate image. Thus, we considered the candidate images at the ground truth to be the ideal images, because there are no regional-based or contour-based errors. We purposely mislabeled the ideal images by adding the two types of noise to model the amount of segmentation accuracy.
To model the two types of errors, we performed a Monte Carlo simulation. We elastically distorted the ideal image randomly to generate over 1000 distorted images described in [
35], each with a distinctive regional-based and contour-based error. We then compared the distorted image with the ideal image using two metrics, the combined Dice and Jaccard metric for regional-based error, and the BF metric for the contour-based error. We then used our proposed method to obtain a positioning error by comparing the positioning solution of the distorted image with the ground truth position.
Figure 2 shows the candidate image with the contour mislabeled using the elastic distortion algorithm.
Figure 3 shows the characteristics of position error in the presence of segmentation error.
The results show a good positioning accuracy at lower levels of segmentation error. It can be seen the positioning error in the 0 to 20% segmentation error range is approximately 0–5 m. However, the proposed method begins to suffer when incorrect segmentation reaches more than 20% for contour-based errors and 25% for regional-based errors. This is followed by a deteriorating positioning performance, where the positioning error increases to 10–20 m. At 40% contour- and regional-based errors, the matching algorithm fails to perform accurately, increasing the risk of greater positioning error. It can be seen at this segmentation error range, the distorted image matches with random incorrect candidate images; thus, the positioning error spreads across a wide region.
The Monte Carlo simulation results demonstrate the importance of a correct contour-based and regional-based segmentation and suggests that, to successfully utilize the proposed method with a high positioning accuracy, a semantic segmentation neural network with no less than 80% segmentation accuracy is preferred. The results also suggest disabling the proposed method when the smartphone image is matched with a candidate image with a segmentation difference of more than 20–25%. In such situations, relying on other advanced positioning techniques such as 3DMA would likely yield better positioning results.
4.5. Discussion on Validity and Limitation
The proposed method presented in this research permits self-localization based on material that is widely distributed among urban scenes. Provided that the smartphone image segmentation is ideal, experiments show that our approach outperforms the positioning performance of the current state-of-the-art methods by 45% and improves the yaw performance by eight-fold compared to smartphone IMU sensors.
The pitch and roll estimated by the proposed method, however, achieves a lower performance by half a degree compared to the smartphone IMU sensors. Hence, it is suggested that the proposed method uses the already accurate pitch and roll estimated by the smartphone IMU sensors. The elimination of altitude, pitch, and yaw estimation will significantly reduce computational load because fewer images are used for matching.
Another limitation is due to inaccurate segmentation. As demonstrated in this research, the BIM was out of date, leading to discrepancies between the smartphone image and images at the ground truth. It was shown that when the segmentation error is greater than 20–25%, the positioning performance deteriorates significantly. Therefore, it is necessary to frequently update the utilized 3D city model.