Object-Based Reliable Visual Navigation for Mobile Robot

Visual navigation is of vital importance for autonomous mobile robots. Most existing practical perception-aware based visual navigation methods generally require prior-constructed precise metric maps, and learning-based methods rely on large training to improve their generality. To improve the reliability of visual navigation, in this paper, we propose a novel object-level topological visual navigation method. Firstly, a lightweight object-level topological semantic map is constructed to release the dependence on the precise metric map, where the semantic associations between objects are stored via graph memory and topological organization is performed. Then, we propose an object-based heuristic graph search method to select the global topological path with the optimal and shortest characteristics. Furthermore, to reduce the global cumulative error, a global path segmentation strategy is proposed to divide the global topological path on the basis of active visual perception and object guidance. Finally, to achieve adaptive smooth trajectory generation, a Bernstein polynomial-based smooth trajectory refinement method is proposed by transforming trajectory generation into a nonlinear planning problem, achieving smooth multi-segment continuous navigation. Experimental results demonstrate the feasibility and efficiency of our method on both simulation and real-world scenarios. The proposed method also obtains better navigation success rate (SR) and success weighted by inverse path length (SPL) than the state-of-the-art methods.


Introduction
Over the last few decades, autonomous mobile robots have gained increasing attention for various applications, such as indoor service, surveillance missions, and search-andrescue. Safe and reliable autonomous navigation is of crucial importance for mobile robots to execute their main tasks in complex environments. Vision-based navigation has become a popular research area due to the richness and practicality of vision sensors [1]. Most existing practical indoor vision navigation methods focused on path planning with a prior precise metric map, such as an occupancy grid map [2] and dense map [3]. Generally, these maps are constructed with Simultaneous Localization And Mapping (SLAM) algorithms, which can perform well in a conditional ideal environment [4]. In spite of their remarkable results, some challenging environments, such as unstructured indoor areas and dynamic objects, pose great challenges for the performance of visual navigation methods. With the development of deep learning, the learning-based visual navigation methods demonstrate strong navigation performance to the above problems [5][6][7], while they need a large number of training datasets to improve generalization capabilities. Therefore, it is necessary to exploit a reliable and feasible visual navigation method.
In general, the performance of practical map-based visual navigation is mainly affected by the variations of the environment [8]. Many state-of-the-art algorithms [9][10][11][12] demonstrate that robust object feature detection is significant for improving the reliability of visual navigation. In [9][10][11], based on the precise metric map, they improved the reliability of path planning by fusing semantic object information to achieve safe and efficient visual navigation. Li Tang et al. present a topological local-metric framework, which achieves long-term autonomous navigation through the fusion of object-based topological associations and local metric information [12]. Thus, to achieve reliable visual navigation, in this paper, we propose a novel object-based topological path planning method, unlike the above approach, which tightly connects objects through a semantic topological graph structure and performs reliable global path planning based on the semantic guidance of the objects. The proposed method mainly includes two issues: effective global path searching and feasible trajectory generation.
On the one hand, for global path searching, an object-constrained topological pathsearching method is proposed. Firstly, a lightweight topological semantic map is constructed to reduce the dependence of visual navigation on a high-precision map, by modifying our previous work [13] with the graph memory and topological associations. Then, based on the constructed map, an object-based heuristic graph search method is proposed to effectively search global topological paths. We extract the highly-dimensional semanticgeometric features of objects based on 3D object detection and use the multi-attribute constraints on the feature to provide heuristic evaluation for graph search. What's more, a robot-centric relative topological association constraint is proposed to provide weights for graph search in the absence of global poses. With the presented method, an optimal and shortest global path is obtained, which improves the reliability of the visual navigation system.
On the other hand, for trajectory generation, to reduce the effect of the global cumulative error on visual navigation, we propose a novel segmented smooth trajectory generation and refinement method based on object guidance and Bernstein polynomial parameterization. In [14,15], it is also confirmed that segmented trajectories can improve the accuracy of path planning. Inspired by this, we convert the trajectory generation problem into a nonlinear programming problem. First, an active visual perception and object guidance strategy is proposed to achieve the effective segmentation of global paths. Then, the trajectory of segmentation is represented based on the multi-constraint property of Bernstein polynomials. Based on the proposed method, we can obtain smooth and dynamically feasible trajectories, and the reliability of the proposed visual navigation method is further improved.
The illustration of the proposed object-level topological path planning for visual navigation is shown in Figure 1. The main contributions are summarized as follows: • A novel object-based visual navigation method is proposed, where an object-constrained topological path-searching method is proposed for the first time to significantly release the dependence on a precise metric map and improve the reliability of visual navigation. • A segmented smooth trajectory generation and refinement method is proposed, based on the object guidance and Bernstein polynomial parameterization. We implement adaptive smooth trajectory generation to further improve the effectiveness and efficiency of global path planning. • Experimental results on both simulation and real-world scenarios validate the feasibility and efficiency of our methods. (a,b) shows two different indoor scenarios. The blue boxes represent the 3D object detection of object-level landmarks. The red dots indicate the nodes of the topological map. The yellow lines indicate the edges of the topological map. The green curve is the feasible navigation trajectory generated based on the proposed method.

Related Work
In this section, we briefly review visual navigation methods from the views of navigation map representation, path searching, and trajectory optimization.

Navigation Map Representation
For navigation map representation, classical metric-based navigation map representations are well established by accurately encoding the 3D geometric information of the environment. Multiple representations have been constructed by different methods, for example, in [16][17][18][19][20], they represent maps as precise and sparse 3D landmarks by specific visual features (such as point feature and line feature). However, they cannot perform global path planning and obstacle avoidance tasks. In contrast, dense representations attempt to provide high-resolution models of the 3D geometry [3]. Available sensors also facilitate the construction of highly accurate geometric maps, such as occupancy grid maps [2]. Although these models are more suitable for obstacle avoidance and path planning for mobile robots, they also typically require the storage of large amounts of data, and the high-precision maps are not easy to maintain and scale [5]. Other works try to improve the reliability of map representation by embedding semantic information [19][20][21][22]. Currently, topology maps are widely explored to represent the environment in abstract graphs, achieving a simple and compact lightweight representation [23][24][25]. However, the pure topological solution is not suitable for robot navigation that requires metric guidance [12]. Thus, some work constructs topological representations that are highly consistent with the metric map [26] or embeds local metric information [12]. Although it makes the map lightweight, it also limits the high scalability of topological maps. Moreover, due to the widespread successful application of deep learning, the learning-based map representation approach has also attracted a lot of interest. Some topological methods work with human-like exploration of pre-established topological maps [6,7]. It relies on a large number of labeled datasets and cannot be applied well to an unfamiliar environment. Other approaches build topological representations that incorporate semantics [27]. Inspired by the above work, our work focuses on making visual navigation lightweight, scalable, robust, and efficient. We intend to achieve this by integrating the high-dimensional semantic and geometric information of objects into the structure of the graph, which enables reliable navigation capabilities.

Path Searching
The topological representation of the environment already provides a discrete working space for path planning. Most works that perform path search in this type of workspace are also referred to as C-space search. Depending on the way of discretizing the C-space search, path-searching methods can be subdivided into two predominant groups [28]: the sampling-based method and graph search method. Rapidly Random Tree (RRT) [29] is the most famous sampling-based path search method, which stimulates the growth of a tree, starting from a starting point and dynamically creating branches [30]. Later, lots of path-searching methods have been proposed by improving the RRT method, such as RRT-Connect [31], Heuristic RRT (RRT*) [32], Probabilistic Roadmap Method (PRM) [33], and Fast Marching Tree (FMT*) [34]. The sampling-based path-searching algorithm is asymptotically optimal, which means that the number of samples may be larger over time to approach the global optimal solution [35] and requires the use of larger memory resources to store all the samples. The graph search-based path search algorithm completely or partially accesses the constructed topology graph until it finds a path connecting the initial point and the goal point. The classical graph search-based path search algorithm is the Dijkstra algorithm [36]. Currently, several improved versions are obtained by speeding up the search, reducing the computational complexity, including A* [37], Dynamic A* [38], Lifelong Planning A* (LPA*) [39], and Hybrid A* [40]. In this paper, inspired by the A* algorithm, we propose an object-based heuristic path-searching method, which performs heuristic evaluation by semantic-geometric features of objects and weight constraints based on topological associations between objects.

Trajectory Generation
Generally speaking, the existing initially searched paths cannot be directly executed by the robot due to dynamic constraints and the inherently poor smoothness of the paths. Therefore, a control function is required to parameterize the paths, which generates smooth trajectories and adjusts them to the robot's motion constraints. In the CHOMP proposed by Ratliff et al. [41], the optimal solution is approximated from the feasible path utilizing the gradient optimization technique. Van Den Berg et al. [42] optimized the obtained initial path by differential dynamic programming. In HOOP [43], quadratic programming is employed to transform the trajectory refinement into a higher-order segmented polynomial processing. In [44], they proposed a generative algorithm for minimizing snap that represents the trajectory as a piece-wise polynomial function. Savkin et al. [45] proposed a method for smooth trajectory generation based on curvature constraints. However, this method assumes that the obstacles are smooth and constrains the robot model, which makes it difficult to apply to real-world complex scenarios. In [46,47], Bernstein bases are used to perform trajectory optimization, and they directly generate safe and dynamically feasible trajectories, which confirms the feasibility of the algorithm. In this paper, we first perform the segmentation of global paths by active visual perception and object guidance. After that, the individual segmented paths are parameterized based on Bernstein polynomials to generate smooth and feasible trajectories via transforming the trajectory optimization problem into a nonlinear programming problem.

Overview of the Framework
In this paper, we propose an object-level topological path planning method that enables mobile robots to perform reliable visual navigation tasks in the absence of a precise navigation map. Inspired by the idea of human-like navigation, we allow the robot to observe its surroundings through vision sensors and estimate ego-motion based on semantic-geometric information about objects and association information between objects in the environment.
The overall framework of the proposed object-level topological path planning is described in Figure 2. We take the outputs of the lightweight navigation mapping, global graph search, as well as trajectory generation and refinement modules to produce dynamically feasible global paths. The whole framework is composed of two sections. Firstly, an object-constrained topological path-searching method is applied to object-based visual navigation, which includes topological semantic mapping and object-based heuristic graph search, as described in Section 4. Topological semantic mapping provides a lightweight environment model for visual navigation (Section 4.1). Object-based heuristic graph search provides an optimal shortest initial path for visual navigation (Section 4.2). Secondly, on the one hand, a global path segmentation strategy based on active visual perception and object guidance is proposed to divide the extracted initial global topological path into multiple segments, which reduces the global cumulative error in navigation, as shown in Section 5.1. On the other hand, Bernstein basis [47] is used to generate smooth and dynamically feasible trajectories, which improves the effectiveness and efficiency of path planning (Section 5.2).

Object-Constrained Topological Global Path Searching
In this section, to extract the initial global path, we achieve the object-constrained topological global path searching via the lightweight object-level topological semantic mapping and an object-based heuristic graph search. In this paper, the overall pathsearching cost is ( f n = g n + d n ). g n is a heuristic evaluation by matching the multi-attribute constraint based on the semantic-geometric feature of the goal object. d n is the relative topological association constraint of the topological edge. Finally, a global topological path is obtained, which is a unidirectional graph composed of adjacent topological edges concatenated together, as expressed in Equation (1). The illustration of the object-based heuristic graph search method is shown in Figure 3. chair   table   table   chair   bookshelf   robot   sofa   table   chair   table   table   chair

Representation of the Object-Level Topological Semantic Map
The map is a fundamental representation of aspects of interest (e.g., landmarks, obstacles) describing the environment of robot operates [4]. In vision navigation, the robot needs to compute the position information of landmarks from arbitrary locations to assess the perceptual quality of candidate landmarks. Objects are the basic constituents of the environment. Inspired by the way humans navigate, we pay more attention to recent relative motion between objects other than the highly accurate global position for global navigation. Therefore, in this paper, referencing our previous work [13], we represent the environment by using a lightweight abstract topology graph that records the relative associations between objects. The basic structure of this map is a graph defined as G = {N, E}, where N and E denote the nodes and edges of the graph, respectively. The representation of the map is shown in the semantic scene sub-graph in Figure 4. Node Representation: We take the objects in the scene as topological nodes and define the nodes with the semantic properties of the objects themselves, such as class and color. Thus, for each node N i belonging to N, the corresponding properties are defined as N i = {ID, class, center, n i }. ID is the serial number added to the graph in order. center(x, y, z) is the coordinate of the object center point obtained by fusing deep information. n i is defined as a proxy for additional properties of the node. For example, color, functional and operational properties, 6D pose (position and orientation), etc. To achieve accurate matching between objects, different from traditional visual features, we propose a multi-attribute high-dimensional semantic-geometric feature for object-level nodes, which includes attributes such as category, color, and 3D geometric center. The multi-attribute high-dimensional semantic-geometric feature is obtained as described in Section 4.2.1.
Edge Representation: For topological edges, the associated object-level node relative relationship attributes are used to define, for example, the relative direction and distance between nodes. Thus, for each edge E ij , connecting the neighboring nodes N i and N j , belonging to E, the corresponding properties are defined as E ij = dis ij , yaw ij , e ij . dis ij is a rigid relative distance between node N i and N j , which is also used as the weight of the topology graph. yaw ij is a relative direction between node N i and N j in the geomagnetic coordinate system. We introduce IMU to calculate the magnetic declination angle. e ij is defined as a proxy for additional properties of the edge.

Heuristic Evaluation Based on Semantic-Geometric Feature
In this section, to meet the requirements of object-based heuristic path searching, a highly robust semantic-geometric feature is constructed, which fuses the high-dimensional semantic properties of objects with geometric spatial properties. On the one hand, the multiattribute constraint matching based on semantic-geometric features between objects can provide heuristic evaluation (g n ) for path searching and improve the efficiency of path searching. On the other hand, semantic-geometric features of objects can effectively improve the robustness of visual navigation under scene changes, such as illumination and occlusion. The illustration is shown in Figure 5.  For the above reasons, we use VoteNet [48], the 3D object detection method that achieves advanced performance, to extract semantic information and geometric spatial features of the object (scale size, centroid coordinates, pose, etc.). VoteNet is proposed inspired by the Hough voting-based 2D object detector, which votes on virtual centroids of objects directly from a point cloud and generates a set of high-quality 3D object proposals by aggregating voting features [49]. Thus, it reaches the state-of-the-art performance on two large indoor benchmark datasets (ScanNet [50], SunRGBD [51]). In particular, to improve the differentiation of objects of the same class, we modified VoteNet by making it possible to output the color properties of objects directly. Firstly, we select the smallest outer rectangle of the object based on the projection of the 3D box onto the 2D image to extract the salient area of the object and eliminate the background interference. Then, we maintain color invariance by converting the original RGB space into a more robust HSV space [52]. Finally, we select the color with the largest area percentage as the object color property. The experimental results are shown in Figure 6. In addition, to increase the detection accuracy of VoteNet in the real-world environment, we collect and add some data for training based on the data format of the SunRGBD dataset. Since there is no open-source annotation tool for the existing indoor dataset, we make a homemade tool for indoor 3D point cloud annotation inspired by the PNP approach [53]. The annotation process is shown in Figure 7. Marking is divided into three main steps. Firstly, a cube frame is surrounded by the road marker to match its scale before annotation. After that, a corner point of the cube frame is used as the origin of the world coordinate system. The world coordinates and pixel coordinates of the extracted corner points are converted to the cubic frame's pose by PNP and used as the pose of the landmark. Finally, all the point cloud data are corrected to the horizon level. We will open-source the income of this tool soon.

L，W，H AxisCorner3D
Corner2D, Class PnP Corner3D The result of 3D object detection in the real-world environment is shown in Figure 8. The experimental results show that the 3D object detection method has high robustness.

Robot-Centric Relative Topological Association Constraint
The high-dimensional semantic-geometric feature detection of objects has provided highly robust visual features for environment modeling and path searching. However, the semantic information learned by the model has not yet been considered relevant to autonomous robot movement. The information from object detection still needs to be translated into information in the robot coordinate system. To ensure effective path searching through the constructed topological map without global poses, we propose a robot-centric relative topological association constraint to obtain the weight of the shortest path searching. The topological association information of the environment is shown in Figure 9a. In the constructed topological map, we represent the relative directions between adjacent nodes as direction vectors d = (x 2 − x 1 , y 2 − y 1 , z 2 − z 1 ) in the calculation by converting the landmark and robot rigid body coordinate as well as the robot rigid body coordinate and the geomagnetic coordinate. As shown in Figure 9b, since the direction of the geomagnetic coordinate system is usually constant, based on the principle of coordinate invariance of vectors, the relationship between adjacent fixed nodes does not change with time and space. The specific implementation of the pseudocode is shown in Algorithm 1. Finally, d n is obtained by calculating the Eulerian distance (d n = (x 2 − x 1 ) 2 + (y 2 − y 1 ) 2 + (z 2 − z 1 ) 2 ) of the topological edge.

Object-Guided Topological Trajectory Generation and Refinement
To obtain smooth and feasible trajectories, in this section, the initial topological path is refined based on real-time object guidance and Bernstein polynomials. Firstly, the topological path is segmented based on the spatial information of the local objects detected in real time. Then, the segmented path is refined by parameterizing them with Bernstein polynomials to form a smooth trajectory as the final global trajectory.

Object-Guided Trajectory Segmentation and Refinement Strategy
To reduce the impact of global cumulative error, a global path segmentation strategy based on active visual perception and object guidance is proposed, which divides the initial global path into multiple segments to maintain the navigation error in the local area and reduce the computational complexity of the trajectory generation process. Firstly, as the navigation case illustrated in Figure 10, for a given global topological path, when the robot navigates in the map, the robot body is used as the starting point, and the active visual observation and the matched objects are used as the intermediate guidance point to guide the robot to the next unknown target point through positional permutation. The above process is executed cyclically until the goal location is navigated. In addition, we conducted several experiments to verify the effectiveness of the global segmentation strategy proposed above, as shown in Figure 11. On the one hand, we plot the variation curve of the global cumulative translation error of the goal object position with the increase of the number of landmarks on the global path, as shown in Figure 11a. The experimental results show that the cumulative translation error will have an exponential growth trend with the increase of the number of landmarks on the global path. Thus, the proposed global segmented path strategy is beneficial to eliminate the effect of global cumulative translation error on path planning. On the other hand, we plotted the variation curve of the computation time of the Bernstein polynomial with the increase of the number of nodes, as shown in Figure 11b. The experimental results show that the time cost of computation tends to increase exponentially with the increase in the number of nodes. Therefore, the segmentation of paths facilitates the execution of online global path planning.

Landmarks Unknown
Robot Figure 10. Illustrates of the object-guided topological trajectory generation and refinement. Secondly, it is important to generate a smooth and dynamically feasible trajectory for robust and accurate autonomous navigation of the robot. Thus, in this paper, we propose a novel trajectory refinement strategy by turning global path planning into a nonlinear planning problem and using properties useful for Bernstein polynomials to accomplish these tasks. According to the Bernstein polynomial theorem, all continuous functions on the interval [a, b] can be approximated by polynomials, as shown in Figure 12a. In the case of this paper, it is also known as a Bezier curve B(t), which has some special constraint properties [47]: The schematic diagram of the proposed global segmented smooth trajectory generation is shown in Figure 13.

Bernstein Basis Segmental Trajectory Formulation
A two-dimensional, nth-order Bernstein polynomial, B n t , is defined as: where p i n ∈ R 2 , (i = 0, . . . , n) is the set of control points of the nth piece of the Bezier curve. b i n (t) is the Bernstein polynomial basis, which is defined as: where n i = n! i!(n−i)! is the binomial coefficient. Our online segmented trajectory generation is based on the end-point constraint property of the Bessel curve. The robot body is used as the first control point in each segment. Based on the order of the nodes in the one-way graph, the first node in the one-way diagram that is currently not detectable in the current local view range is used as the last control point, as shown in Figure 13. Therefore, according to the real-time local environment, the nodes in the one-way graph visible during navigation are used as current control points, and thus, the control points within each segment are dynamically variable, p i n,m = p i n,1 , p i n,2 , . . . , p i n,m ∈ R 2 , m denotes m-segments.
According to the number of shortest path nodes and the number of visible range nodes, the set of m-segments can be expressed as shown in Equation (4): Based on the de Casteljau algorithm constraint property, the overall formula can be expressed as: It is worth noting that when the last node is seen, it will become a one-order Bernstein polynomial.

Experiments
In this section, to evaluate the proposed method well, we perform a series of experiments on both simulation and real-world environments. We mainly compare our method with two types of navigation methods: a learning based navigation method and a classical metric-map based navigation method. All experiments have been run on a PC with Intel i7-8700K CPU 3.7GHz, NVIDIA GTX1080GPU, and 24GB RAM. The experimental video link is at https://github.com/CASHIPS-ComputerVision/Paper-videos (accessed on 2 March 2022).

Multi-Constrained Local Path Planning Strategy
To enable the proposed object-level topological path planning approach to perform practical navigation tasks, we first integrate a multi-constrained local path planning strategy into the overall framework to achieve local obstacle avoidance, as shown in Figure 14.
The multi-constrained local path planning strategy takes the endpoint of each global segmented trajectory as the stage goal point of local planning. To obtain the local linear and angular velocities, as shown in Equation (6), it determines the optimal local trajectory by calculating the overall loss cost (C n ) of the sampled trajectory. The overall loss cost includes the global segmented Bessel trajectory cost constraint (Gdist), the velocity cost constraint ( 1 x 2 ), the pointing target cost constraint (Ddist), and the obstacle cost constraint (obs). The velocity cost constraint ( 1 x 2 ) is the translational component of the sampled velocity on the current trajectory, which keeps the robot's velocity within a certain range.

Visual perception Global segmented smoothing path
Global path cost Local obstacles cost Goal front cost Velocity cost Figure 14. The block diagram outlines the specific modules in the proposed multi-constrained local path planning strategy.

Environment
• Simulation Experiment Setup: We build the simulation environment with the Gibson dataset [54]. The Gibson dataset is visually realistic, since it consists of reconstructions of real-world scenes [5]. As shown in Figure 15, we finally selected nine simulation scenes by excluding scenarios that contain multiple floors and empty rooms. In the simulation experiments, the baseline methods are the state-of-the-art learning-based navigation methods, including Neural Topological SLAM (NTS) [5], Active Neural SLAM (ANS) [6], and Metric Spatial Map + RL (MSMRL) [7]. All of these methods use RGBD camera settings. MSMRL is an end-to-end navigation method based on the local metric map constructed by geometric projections of depth images, and it performs navigation decisions using Reinforcement Learning (RL). ANS is a baseline that integrates metric map and learning-based navigation method to perform agent movement control. NTS models the environment as a topological map. However, different from our method in this paper, it performs navigation through retrieval image goals. Following the method in NTS [5], the test scenarios in this paper are classified as easy, medium, and difficult, depending on the distance between the start and end locations, which are: Easy (1.5-3 m), Medium (3-5 m), and Hard (5-10 m). • Real-World Experiment Setup: As shown in S1, S2, and S3 of Figure 16, in the realworld experiment, we choose two typical indoor environments to evaluate our method, including weakly textured corridors and offices. During the experiments, we set up three navigation tasks with different difficulties, named Test 1, Test 2, and Test 3, as shown in Figure 16. Test 1 is continuous navigation with no obstacles and has multiple landmarks. Test 2 is a more challenging navigation with dynamic obstacles that do not exist in the constructed map. Test 3 is the long-distance navigation for large scenarios, which is the most challenging task for most existing navigation methods. We use a four-wheeled mobile robot to record RGB-D images and inertial measurement unit (IMU) measurements, as shown in Figure 17. It is equipped with an Xtion RGBD camera and an inertial measurement unit (IMU). The RGBD camera returns a regular RGB image and a depth image that is used for real-time semantic landmark detection.
The IMU returns high-frequency inertial guidance data for magnetic declination detection.
In the real-world experiment, to further evaluate the proposed navigation method, we compare our method to a classical navigation method (called OGMADWA), which is combined by a high-precision Occupancy Grid Map [7], the global path-searching method A* [37], and the local path planning method DWA [55]. (S1) Figure 16. Example of navigation in the real world. (S1-S3) show the test scenes of our method in the real world. (T1-T3) correspond to the topological semantic maps of the three scenes, respectively, which are generated by our previous work [13]. (G1-G3) correspond to the high-precision occupancy grid maps of the three scenes, respectively, which are generated by the OGMADWA method. Figure 17. A four-wheeled mobile robot platform called BootBot. It is equipped with an Xtion RGBD camera and an inertial measurement unit (IMU). These sensors are placed 20 cm above the middle of the body.

Evaluation Metrics
In this paper, we use the navigation Success Rate (SR) and Success weighted by inverse Path Length (SPL) as the evaluation metrics to compare the performance of our method with other methods. SPL is one of the most important metrics for evaluating navigation performance, which considers the effectiveness and efficiency of the agent in accomplishing the navigation task. It is worth noting that only when the motion of the mobile robot in the state space is continuous, the experiment is determined effectively. For a success rate, we set the "arrived" signal, and only if the robot takes a stop action within a 1 m radius of the goal location is it considered to have navigated successfully. In addition, there must be no collisions and no loss during the continuous movement. According to [56], the SPL takes into account the efficiency of the robot in reaching the goal; for detailed descriptions, see Appendix A.

Evaluation on Simulation Data
We first compare the SR and SPL of our method with those of NTS [5], ANS [6], and MSMRL [7] on simulation data. The experimental results are shown in Table 1. Their bar graph representations of SR and SPL in Table 1 are shown in Figure 18a,b, respectively. It can be seen that our method outperforms all methods in all scenarios in terms of both SR and SPL. In addition, the performance improvements of our approach over the baseline increase along with the difficulty of experimental settings (0.06/0.05, 0.31/0.25, and 0.32/0.37). Our proposed method outperforms the MSMRL and ANS because of two potential reasons. First, we perform efficient and accurate object-based heuristic graph search to select paths instead of the image matching-based path retrieval method. The end-to-end path searching used by MSMRL and ANS has high computational complexity, especially in the largescale environment. Second, compared to the metric map-based MSMRL and ANS, our proposed 3D environment modeling approach provides more reliable semantic-geometric features for navigation. The proposed trajectory generation and refinement strategies also improve the effectiveness of visual navigation. We can also see from Table 1 and Figure 18 that our method has better performance than NTS. It is mainly because the NTS is an image-based target-driven navigation method, in which the images do not provide directional information that can be explored for navigation [5], while our method provides multi-attribute information for navigation including the direction and pose of the object. In conclusion, these results of simulation experiments demonstrate the effectiveness of our proposed system. Figure 19 shows an example of our navigation visualization.  Step 1 Step 2 Step 3 Step 4 Goal Start Object 1 Object 2

Achieved goal
Moving toward the goal Figure 19. Visualization results for navigation in simulation scenario. The first step acquires global topological paths through active visual perception and object guidance, and generates segmented smooth trajectories. The second step determines the goal position by goal object evaluation and it generates segmented smooth trajectories. The third step moves toward the target location. The fourth step moves to within 1 meter of the goal object and publishes an "arrived" command to end the navigation.

Evaluation on Real-World Data
We further evaluate the performance of our method by comparing it with the wellknown OGMADWA method in the real-world environment. Figure 16 visualizes the navigation environment. T1, T2, and T3 in Figure 16 correspond to the topological semantic maps of the three scenes, respectively, which are generated by our previous work [13]. G1, G2, and G3 in Figure 16 correspond to the high-precision occupancy grid maps of the navigation tasks respectively, which are generated by the OGMADWA method. Figure 20 shows the results of the visualization with the example of navigation in the S1 scene. We still choose SR and SPL as the evaluation metrics in the real-world experiment. For all experimental results, we report the median of 20 times.
The SR and SPL are shown in Table 2 and Figure 21. According to [56], the value of SPL greater than 0.5 indicates good navigation performance. It can be seen from Table 2 and Figure 21 that the SPL of both methods are greater than 0.5, which indicates that both methods have good navigation performance. We can still observe that our method outperforms the OGMADWA method. For example, as can be seen from the "Overall" column in Table 2, our method has similar SR with the OGMADWA method but has higher SPL than OGMADWA. This demonstrates the effectiveness and advantages of the proposed navigation method, since we do not need precise metric maps. We can further concretely observe the superiority of our method. In Test 1 and challenging Test 2, these two methods have the same SR, but our method has better performance in terms of SPL. There are two reasons. On one hand, we perform object-based topological path planning instead of prise metric map-based path planning. For the OGMADWAW method, when it comes to the multiple objects in S1 and dynamic objects in S2, the accuracy of the metric map decreases, and the complexity of the path planning method increases. Thus, its navigation performance decreases. As for our method, the proposed topological semantic map uses the semantic-geometric information and topological associations of objects, which has better scalability. On the other hand, our proposed object semanticgeometric feature-based heuristic evaluation strategy and Bernstein's polynomial trajectory refinement method improve the efficiency of visual navigation. Particularly, unlike the A* algorithm, the proposed object-based heuristic path search method avoids over-dependence on high-precision metric maps. It is worth noting that for Test 3, we successfully navigate in a 26.2 m long weakly textured corridor, as shown in Figure 16S3. It can be seen that both methods achieve similar navigation performance, while OGMADWA achieves slightly higher on SR/SPL than our method. It is mainly because there are fewer available objects from the starting location to the goal location, which greatly challenges our method.
In summary, the above results show that the proposed method has a better overall performance compared to OGMADWA for the same configuration. Step 1 Step 2 Step 3 Step 4 Object 1 Object 2 Object 3 Goal Object Figure 20. Visualization results for navigation in the S1 scene. We select an object-level topological path after constructing the lightweight topological semantic map. Then, the robot is guided from step one to step four to generate a segmented smooth trajectory to the next invisible object based on the currently visible object until it moves within 1 meter of the goal object and publishes an "arrived" command to end the navigation.

Discussion
From the above results, we have thoroughly verified the advantages of object-based global path planning. Then, for some special cases, we can discuss the limitations of our current approach. We represent the environment as an abstract topological graph in which the topological node and topological edge represent object-level landmarks and the semantic association properties between object-level landmarks, respectively. We model the environment mainly based on long static object-level landmarks. If the object in the model is dynamic and movable, then the topological relationship between the currently moving object and its neighbors changes, and the robot is at risk of being lost.
To effectively discuss and analyze the above question, we set up three cases to carry out navigation experiments, as shown in Figure 22. Case 1 shows that the robot can renavigate through the surrounding red chairs after moving the yellow chair. The difference between Case 2 and Case 3 is that the distances between the goal object (blue chair) and the object being moved (yellow chair) are different. Case 3 shows that there is no detectable object around the robot and there is a risk of navigation failure after moving the yellow chair. Figure ( Figure 22 indicate scenes after moving the yellow chair. We performed navigation experiments in these three cases and statistically calculated the SR and SPL of navigation respectively, as shown in Table 3. Their bar graph representations of SPL in Table 3 are shown in Figure 23. It can be seen from the experimental results that in Case 1 and Case 3, when the goal object in the map is moved, the SR and SPL of navigation decrease. In Case 1, although the topological relationship between the yellow chair and its neighboring objects disappears, the remain objects still maintain the original topological relationship well. Therefore, the robot can re-plan paths and regenerate a trajectory to the location of the goal object by searching for the surrounding red chair or goal object. In Case 2, when observable objects exist around the goal object, the robot can replan and re-navigate by searching for surrounding objects. In Case 3, when no observable objects exist around the goal object at all, the robot is lost, just as humans become lost. For the above problem, to perform effective navigation, we set five states for the robot, including initialization, waiting for the goal object, regenerating the trajectory, re-planning the path, and moving. The robot can online switch states according to the actual navigation situation. . Navigation examples when objects in the environment models are moved. Yellow lines indicate the topological associations between objects. Case 1 shows that the robot can re-navigate through the surrounding red chairs after moving the yellow chair. The difference between Case 2 and Case 3 is that the distances between the goal object (blue chair) and the object being moved (yellow chair) are different. Case 3 shows that there is no detectable object around the robot and there is a risk of navigation failure after moving the yellow chair. (a,c,e) indicate the scenes before moving the yellow chair. (b,d,f) indicate the scenes after moving the yellow chair.

Conclusions and Future Work
In this paper, we propose an object-based visual navigation method that can greatly improve navigation performance under the large-scale challenging environment. Firstly, a novel object-level semantic topological map is constructed to model the environment in a lightweight way. Then, we propose an object-based heuristic graph based path-searching method to obtain an optimal global topological path by combining semantic-geometric feature-based heuristic evaluation and topological association-based weight constraints. After that, to reduce the cumulative error of global navigation, active visual perception and object guidance are combined to segment global topological paths. Finally, we propose a new Bernstein polynomial refinement strategy to generate smooth and feasible navigation trajectories by parameterizing segmented paths. Experiments on 3D object feature detection in complex real-world scenes such as illumination changes and partial occlusions verify the high robustness of the proposed method. In addition, the effectiveness and reliability of the proposed global path planning method is verified by cumulative translation error and computational time cost experiments. The reliability of the proposed method is further proved by comparison experiments with SR and SPL of state-of-the-art methods under simulation and real-world scenarios. Despite the great results, some promotions are needed in the future. We will use a lightweight object detection network to improve the accuracy and reduce the computational complexity of the system as well as deploy the system on an embedded platform. We will also continue to improve the robot's navigation performance in the case of dynamic changes in relationships between objects. Furthermore, we will explore the proposed method in more large-scale real-world navigation tasks.