1. Introduction
Location-based services (LBS) play a crucial role in modern life [
1,
2]. Positioning is a fundamental technology utilized for LBS, as well as various applications within the Internet of Things (IoT) and artificial intelligence (AI) [
1]. Meanwhile, indoor areas (e.g., shopping malls, libraries, and subway stations) are the major activity areas of our daily life [
3]. Indoor positioning is extensively utilized in different kinds of applications, such as emergency rescue, tracking in smart factories, mobile health services, and so on. Thus, high-accuracy indoor positioning techniques are important for navigation in GNSS-denied areas [
4]. The increasing demands for indoor positioning-based services (IPS) contribute to the development of indoor positioning techniques, which provide reliable location information in areas with inadequate GPS coverage [
5].
With the rapid development of micro-electromechanical systems (MEMS), different kinds of sensors including accelerometers, gyroscopes, magnetometers, and so on are integrated into an intelligent platform (e.g., smartphones) to realize the positioning [
3,
6,
7]. Meanwhile, there are plenty of algorithms utilized for indoor positioning based on various sensors as well as their fusions [
8,
9]. These methods include Wi-Fi-based positioning [
10], UWB-based positioning [
11], blue-tooth-based positioning [
12], inertial navigation system (INS) [
13] and so on.
A visual positioning system (VPS) is also a promising solution for indoor positioning [
14,
15,
16]. The VPS utilizes cameras to capture and analyze visual data (e.g., images and videos). Based on features extracted from this visual data, the system can realize the high-accuracy localization in indoor areas [
17,
18]. Scientists have already dedicated a lot of effort to this topic. Ref. [
19] constructs a positioning system for the remote utilization of a drone in the laboratory. The positioning method consists of a parallel tracking and mapping algorithm, a monocular simultaneous localization and mapping algorithm (SLAM), and a homography transform-based estimation method. Both quantitative and qualitative experimental results demonstrate the robustness of the proposed method. A visual-based indoor positioning method utilizing stereo visual odometry and Inertial Measurement Unit (IMU) is proposed in [
20]. This method utilizes the Unscented Kalman Filter (UKF) algorithm to realize the nonlinear data coupling, which is utilized to fuse the IMU data and the stereo-visual odometry. The experimental results indicate that the accuracy and robustness of the mobile robot localization are significantly improved. Ref. [
21] proposed a visual positioning system for real-time pose estimation. It uses the IMU and an RGB-D camera to estimate the pose. The VPS uses the camera’s depth data extracts the geometric feature and integrates its measurement residuals with that of the visual features and the inertial data in a graph optimization framework to realize the pose estimation. The experimental results demonstrate the usefulness of this VPS.
Although there exists a variety of research on visual-based indoor positioning, most visual-based positioning techniques utilize the side-view visual data to realize high-accuracy positioning [
22,
23]. However, the side-view visual data are easily disturbed by surrounding environments. In our daily lives, there exist different kinds of obstacles within urban areas, including dynamic obstacles (e.g., surrounding pedestrians, pets, and other moving objects) and static obstacles (e.g., advertising signs and posters) [
24]. Additionally, even in the same area, the environments within the side-view perspective vary during different periods due to the decoration and layout changes. Therefore, the complexity and variability of real-world scenes enhance the difficulty of image processing, thereby decreasing positioning accuracy. On the other hand, the side-view visual data captured in urban areas usually contains surrounding pedestrians and places, which may cause troubles related to privacy and security. The pedestrians captured in side-view data may be related to personal privacy (e.g., portrait rights). Some special places in side-view data may be confidential and should not be disclosed to the public or captured.
To avoid these problems, we present the concept of up-view visual-based positioning. It is noted that up-view means the vertical perspective from bottom to top. Firstly, the scenes within up-view visual data are much simpler than those in side-view visual data. Secondly, the layout of the up-view scenes (e.g., lights installed on the ceilings) do not change as frequently as the side-view scenes. Also, the up-view scenes seldom raise concerns about privacy and security. Moreover, the up-view visual data can be captured by the smartphone’s front camera, which is more in line with the natural state of a pedestrian during positioning with a handheld smartphone.
In this paper, we innovatively propose an up-view visual-based indoor positioning algorithm using YOLO V7. This method is composed of the following modules: (1) A deep-learning-based up-view landmark detection and extraction module, which is used for landmark detection and edge accurate extraction; (2) an up-view visual-based position calculation module, which is utilized to realize the single-point position calculation; (3) and an INS-based landmark matching module, which is utilized to realize the matching between the detected landmarks within up-view data and the real-world landmarks. In our strategy, we utilize images as the visual data and the lights installed on the ceilings as the up-view landmarks. Firstly we utilize the front camera of a smartphone and a self-developed application to automatically capture up-view images. Subsequently, the deep-learning-based up-view landmark detection and extraction module is utilized to realize the light detection and edge accurate extraction. During this step, we utilize YOLO V7 to detect the light within an up-view image. Based on the detection results, we obtain the gross region of the specific light and pixel coordinates of the light centroid. Then, we compensate for the exposure of the gross region. The edge operators are employed to realize the accurate edge extraction based on the compensated image, outputting the accurate pixel size of the light. Next, we need to calculate the location of the smartphone based on the light detection and extraction results and the pre-labeled light sequence. Utilizing the light’s real size and the pixel size, we figure out the scale ratio K between the pixel world and the real world. Then, we calculate the smartphone’s location based on the scale ratio K, the light’s pixel coordinates within the image, and the light’s real-world size via the principle of Similar Triangle. Besides the single point up-view-based position calculation, it is also necessary to realize the landmark matching during the kinematic positioning. This means we need to find the light that matches the specific up-view image from the pre-labeled light sequence. Therefore, we propose an INS-based landmark matching method. Assuming that we successfully capture and extract the lights from the up-view images at and epoch. We also figured out the smartphone’s position at epoch based on the up-view visual-based positioning method. Then, the smartphone’s position at epoch can be estimated based on the accelerometer and gyroscope data. We can find the light closest to the estimated position at epoch from the pre-labeled light sequence. This light is the one that matches the up-view image captured at epoch and that was utilized to realize the up-view-based positioning at epoch.
The main contributions of this work are as follows.
- (1)
We innovatively propose an up-view visual-based indoor positioning algorithm, avoiding the shortcomings of the side-view visual-based positioning.
- (2)
We propose a deep-learning-based up-view landmark detection and extraction method. This method combines the YOLO V7 model with the edge detection operators, realizing the high-accuracy up-view landmark detection and precision landmark extraction.
- (3)
We propose an INS-based landmark matching method. This method uses IMU data to estimate the smartphone’s position and find the closest point from the pre-labeled light sequence, realizing the matching between up-view images and the light sequence.
- (4)
To verify the feasibility of the up-view-based positioning concept, we conducted static experiments in a shopping mall near Aalto University and the laboratory at Finish Geospatial Research Institute (FGI). We also conduct kinematic experiments in FGI’s laboratory to further verify the performance of the proposed positioning algorithm.
The rest of this paper is organized as follows.
Section 2 introduces the proposed up-view visual-based indoor positioning algorithm.
Section 2.1 illustrates the positioning mechanism.
Section 2.2 explains the deep-learning-based landmark detection and extraction method. In
Section 2.3, we introduce the up-view visual-based position calculation method.
Section 2.4 describes the INS-based landmark matching method.
Section 3 gives the experimental design and the results. Finally, the conclusion is given in
Section 4.