Next Article in Journal
Introduction to the E-Sense Artificial Intelligence System
Previous Article in Journal
GT-STAFG: Graph Transformer with Spatiotemporal Attention Fusion Gate for Epileptic Seizure Detection in Imbalanced EEG Data
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Fusing Horizon Information for Visual Localization

1
College of Electronics and Information Engineering, Tongji University, Shanghai 201804, China
2
Lotus Robotics, Hangzhou 310051, China
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
AI 2025, 6(6), 121; https://doi.org/10.3390/ai6060121
Submission received: 24 April 2025 / Revised: 22 May 2025 / Accepted: 5 June 2025 / Published: 10 June 2025

Abstract

Localization is the foundation and core of autonomous driving. Current visual localization methods rely heavily on high-definition maps. However, high-definition maps are not only costly but also have poor real-time performance. In autonomous driving, place recognition is equally crucial and of great significance. Existing place recognition methods are deficient in local feature extraction and position and orientation errors can occur during the matching process. To address these limitations, this paper presents a robust multi-dimensional feature fusion framework for place recognition. Unlike existing methods such as OrienterNet, which homogenously process images and maps at the underlying feature level while neglecting modal disparities, our framework—applied to existing 2D maps—introduces a heterogeneous structural-semantic approach inspired by OrienterNet. It employs structured Stixel features (containing positional information) to capture image geometry, while representing the OSM environment through polar coordinate-based building distributions. Dedicated encoders are designed to adapt to each modality. Additionally, global relational features are generated by computing distances and angles between the current position and building pixels in the map, providing the system with detailed spatial relationship information. Subsequently, individual Stixel features are rotationally matched with global relations to achieve feature matching at diverse angles. During the BEV map matching process in OrienterNet, visual localization relies primarily on horizontal image information. In contrast, the novel method proposed herein performs matching based on vertical image information while fusing horizontal cues to complete place recognition. Extensive experimental results demonstrate that the proposed method significantly outperforms the mentioned state-of-the-art approaches in localization accuracy, effectively resolving the existing limitations.

1. Introduction

The innovation of localization technologies has propelled the transformation of autonomous driving and travel modalities. There has been a continuous evolution from GPS to high-precision sensor positioning technologies. The Global Navigation Satellite System (GNSS) is capable of efficiently and precisely locating vehicles; however, achieving higher precision incurs greater costs and is readily subject to environmental disturbances. To reduce the dependence on GNSS and improve the positioning reliability in complex environments, positioning methods based on sensors such as vision and LiDAR [1] have been widely studied. Visual localization determines the location and orientation of the camera in the scene through the given query image, which has become a fundamental task in computer vision and robotics and is very important in automatic driving, robot navigation, augmented reality, and other scenes [2].
Sarlin et al. first proposed an end-to-end algorithm called OrienterNet [3], which estimates the location and orientation of the query image by matching the Bird’s-Eye View (BEV) with the OpenStreetMap (OSM), without a HD map [4]. Due to the advantages of OSM, such as global consistency, a lightweight map-building process, and a large amount of publicly available road information, Mu et al. used OSM information as the environmental representation for global planning during the global navigation process [5,6].
However, as shown in Figure 1, the traditional deep learning algorithm cannot effectively extract the deep information of the image in the process of feature extraction, the resolution of the deep feature is greatly reduced, some key details are lost, and the feature map shows a relatively large fuzzy area. In addition, the OrienterNet algorithm has the problem of position and orientation errors in the process of BEV and OSM map matching.
We propose a new algorithm to solve the above problems. In the process of feature extraction, the input image is deeply analyzed to extract the Stixel [7] feature, and the corresponding Stixel feature is extracted from the OSM. This not only enhances the feature representation ability but also makes the features more discriminative, which is beneficial for the subsequent matching process. Then, a partitioning method based on distance and angle is proposed. In the process of feature matching, according to different distances and angles, the matching process is divided into different intensity levels. The single rod pixel feature and the global relationship feature are rotated to achieve feature matching from different angles. The relationship between the image and the global feature is considered more meticulously, and the information contained in the distance and angle is fully used, so as to achieve more accurate and reliable matching effects.

2. Related Work

In autonomous driving, visual localization provides the position and attitude information of autonomous vehicles in the global environment, which is like establishing a “coordinate system” for the vehicles, enabling them to know where they are and in which direction they are heading. Traditional visual positioning may mainly establish a global reference system and determine the position and direction of the vehicle based on the feature comparison between the camera images and the map or model. However, it has deficiencies in the perception of local environmental details. As a result of the research, it was found that the Stixel representation can be utilized to focus on the perception of the local environment around the vehicle. By extracting features from camera images, it can identify road obstacles and obtain road feature information. It is a further refined analysis of the vehicle’s surrounding environment based on the global position determined by visual localization. It can be said that visual localization tells the vehicle “where I am”, while Stixel representation further answers “what is around me”, providing more specific environmental information for the vehicle’s safe driving and path planning, and complementing the deficiency of visual localization in terms of environmental details. Feature matching accurately determines the position coordinates of the vehicle by comparing the features in the sensor data with those in the map or model. It relies on the global position framework provided by visual localization and the local environmental features extracted by Stixel representation and is a further optimization and refinement of the vehicle’s position information. Feature matching uses the information provided by the former two and, through more meticulous feature comparison and matching algorithms, can more accurately determine the position of the vehicle in complex environments.

2.1. Visual Localization

Sarlin proposed OrienterNet [3], which estimates the pose of query images by matching the Bird’s-Eye View with the available maps in OpenStreetMap, achieving certain results in positioning based on lightweight navigation maps. Samano et al. [8] utilized low-dimensional embedding spaces to geolocate panoramic images on two-dimensional navigation maps [9]. Park et al. [10] significantly enhanced the model’s spatial recognition and structural understanding ability of lane lines by using BEV features [11]. However, the 2D projection of BEV based on the orthographic projection principle inevitably loses height information, rendering it unable to accurately identify multi-layered structures. Moreover, large nearby occluding objects create extensive blind areas, severing the transmission of spatial information and undermining the integrity of far-end features, Therefore, simply using the features of BEV images for matching will still result in certain directional errors, like OrienterNet [3]. He et al. proposed an end-to-end positioning network called EgoVM [12], which uses lightweight vectorized maps to achieve precise self-positioning instead of heavy point-based maps [13]. While it is true that in many cases vector maps are useful, in some scenarios vector maps have difficulty adapting to complex road conditions such as temporary buildings, lack the ability to adapt to the environment, and the semantic information of the maps fails to match the ever-changing scenarios. Wu et al. [14] adopted coarse-to-fine feature matching and hierarchical strategies to achieve sub-meter level positioning using only navigation maps, fusing surround-view images and navigation maps and aligning BEV and map features. Sarlin et al. [15] proposed a self-supervised neural map method for visual localization and semantic understanding. Through self-supervised learning, the model can automatically learn the feature representation of the environment without a large number of annotation data, so as to achieve accurate location estimation and in-depth understanding of scene semantics [16]. Zhao et al. proposed a visual localization framework, PNeRFLoc [2], that combines the neural radiance field with the visual localization task. By optimizing the parameters of the neural radiance field, the model can accurately estimate the localization and attitude of the camera in the scene according to the input image. Using the strong modeling ability of the neural radiance field, the positioning accuracy and robustness are improved. Recently, some methods have leveraged Transformer architectures to address localization problems. TransGeo [17], the first pure Transformer-based model for cross-view localization, employs attention-guided cropping to optimize the alignment between street-view and aerial images, achieving high accuracy but suffering from weak cross-domain generalization and heavy data dependency. In contrast, LoFTR [18] uses Transformer to enable detector-free dense feature matching, solving the challenge of feature matching in low-texture or occluded scenes, yet it shows limited adaptability to extreme scale variations or irregular viewpoints. LocNet [19] achieves efficient matching between LiDAR point clouds and static maps via Siamese networks, but it suffers from poor robustness in dynamic environments, sensitivity to environmental interference, and ambiguity in low-texture scenarios. TransFusion [20] achieves cross-modal spatial alignment between images and point clouds via Transformer self-attention, but it suffers from failure in sparse scenarios and insufficient semantic fusion.

2.2. Stixel Representation

In the complex scenarios of autonomous driving, extracting Stixel features from camera images is of great significance. It can efficiently identify various potential obstacles on the road, and the effect is even better when combined with a binocular vision system. It can acquire stereo information to assist in estimating depth and distance, improving the accuracy and reliability of obstacle detection and ensuring the safety of vehicles. In the process of road environment perception, dividing the road area into Stixels can extract key features such as the texture and color of the road, helping vehicles to recognize information such as lane markings and providing basic data for path planning and navigation. Moreover, Stixels show strong robustness in the face of complex situations such as illumination changes and partial occlusions.
On the one hand, as shown in Figure 1, the resolution of high-level features in traditional deep learning decreases, and key details are lost. However, the Stixel feature map can better handle depth, edge, and height information; distinguish regions through colors; and has relatively better expressiveness. On the other hand, compared with the static map information of OSM, the environmental information it can provide is relatively simple, while the Stixel representation can more comprehensively reflect the environmental features. Thus, the selection of Stixel features derived from both the input image and the OSM (OpenStreetMap) image not only strengthens the feature representation capability but also enhances the distinctiveness of these features for more effective matching.
Moreover, for the angular matching of Stixel features, it can more meticulously consider the relationship between the image and the global features. The angular feature matching has good perspective invariance and scale adaptability, performs better in handling distant objects and maintaining boundary details, effectively weakens the influence of local errors, and enhances the adaptability to occluded and sparse scenes.

2.3. Feature Matching in Place Recognition

Positioning technology provides a framework and range for place recognition. Before place recognition, it is necessary to know the approximate area in order to search for features. Feature matching determines the position coordinates by comparing the features in the sensor data with those in the known map or model. In autonomous driving, the real-time point cloud data [21] is matched with the high-precision map data. For example, the point cloud models of objects such as roads have been stored in the map. LiDAR distinguishes different objects according to the reflection intensity of laser signals. In addition to the geometric features, the reflection intensity features can also be used in the matching process. In the method we have proposed, the crucial step of feature extraction is first carried out [22]. Specifically, an in-depth analysis is conducted on the input image to extract the Stixel features therein. At the same time, the corresponding Stixel features are also extracted from the global data of OpenStreetMap (OSM). Subsequently, in the critical stage of rotational matching between a single image and the global features, we introduced a partitioning method based on distance and angle differences. According to different distance and angle situations, the matching process is divided into different intensity levels. In this way, the relationship between the image and the global features can be considered more meticulously, and the information contained in the distance and angle can be fully utilized, so as to achieve a more accurate and reliable matching effect.

3. Method

The visual localization method proposed in this paper effectively fuses multiple types of information to achieve precise positioning through four closely integrated modules. As illustrated in Figure 2, the framework incorporates two critical matching processes. First, the BEV-Map Matching module aligns the Neural BEV (generated by Image-CNN) with the Neural Map (encoded by Map-CNN) to compute feature correspondence and similarity. Second, the Angle Rotate Matching module leverages the OSM Relationship Map—encoding angular and distance relationships between the current position and map elements—to perform rotational feature matching, optimizing the alignment across different viewpoints. Firstly, the Image-CNN module extracts key features from the input image and converts them into the orthographic BEV form T. The Map-CNN module encodes the OSM map to generate the neural map F, which contains rich geographical and semantic information. Subsequently, the BEV-Map Matching module conducts a comprehensive and detailed matching between T and F, and then infers the camera pose, providing a basis for device positioning. Finally, the Image-OSM-Stixel module processes the image to obtain Stixel pixels and their features. By combining with the OSM map, it calculates the spatial relationship features and realizes feature matching at multiple angles through rotational operations. These four modules cooperate with each other to jointly construct a multi-modal feature fusion and matching framework. In the process of matching images of different dimensions, the localization problem is transformed into the estimation of the Three Degrees of Freedom (3-DoF) pose.
In our method, we use Fourier-domain convolution for feature matching instead of direct spatial correlation. This offers key advantages: computational efficiency improves from O ( n 2 ) to O ( n log n ) via the Fast Fourier Transform, crucial for high-dimensional feature maps; the Fourier transform’s magnitude spectrum provides better rotation invariance for handling viewpoint changes; multi-scale analysis is possible by examining different frequency components; and a single operation can search all possible translation parameters. These enable more robust and efficient feature matching, especially in complex urban environments with large viewpoint changes and partial occlusions.

3.1. Image-CNN

The CNN Φ image extracts a feature map X R U × V × R of size U × V from the image. Then, the image features are converted into a polar coordinate representation and subsequently mapped onto a Cartesian grid; the coordinate transformation process can be seen in Figure 3. Finally, the small CNN Φ BEV processes it to obtain the neural Bird’s-Eye View (BEV) T and confidence C. By predicting for each polar coordinate unit ( u , d ) , the image is regarded as being composed of multiple columns, and the pixel probability distribution is α u,d. For each column, calculate the angle between it and the center of the image, as well as the distance from the pixels on this column to the center of the image, and determine the corresponding ray in the polar coordinate system, thus obtaining X R U × D × V . Then, convert each pixel in the image into the corresponding point in the polar coordinate system according to its column and row positions. Next, resample the polar coordinate grid into a Cartesian grid. The size of the Cartesian grid is L × D = 32 × 32 m, and the resolution Δ = 50 cm. The CNN is ResNet101; when converting the image features into a polar coordinate representation, the sampling of D depth planes is taken into account, which involves depth-related calculations. Therefore, depth features will be generated.

3.2. Map-CNN

Here, we focus on the OSM encoding part. The Map-CNN module encodes OSM into a neural map F containing semantic and geographic information useful for localization. In this stage, geographical elements such as areas, lines, and points in OpenStreetMap are rasterized into three-channel images at a fixed ground sampling distance of 50 cm/pixel. Subsequently, each semantic class is associated with an N-dimensional embedding vector obtained through learning. On this basis, a feature map is obtained with dimensions W × H × 3N.
Finally, a convolutional neural network Φ map , constructed based on the VGG-16 encoder, is used to encode the feature map. As a result, a neural map F with dimensions W × H × N is obtained. Meanwhile, Φ map predicts a unary location prior Ω for each cell of the map, with dimensions also being W × H .

3.3. Image-OSM-Stixel

We propose a hybrid voxel-pixel method that combines traditional rule-based voxel-pixel generation and modern deep learning-based feature extraction techniques. Specifically, the generation of voxel-pixels employs a traditional segmentation method based on depth variation, maintaining computational efficiency and interpretability. For feature extraction, a combination of convolutional neural networks, statistical analysis, and Fourier transform is adopted to enhance the feature representation ability.
First, the initial step is to obtain the image-Stixel representation. The image I is input, and a depth map D ( x , y ) is generated through edge detection, thereby obtaining the distance information of each pixel. In the specific implementation, we apply the Canny operator (with thresholds 100 and 200) to detect edges. Then, we perform a dilation operation with a 5 × 5 kernel on the detection results. The relative depth value D ( x , y ) of each pixel is calculated through the formula D ( x , y ) = 255 edge_density ( x , y ) 255 , thus establishing a mapping between the image structure and the depth relationship. The image I is divided into vertical strips according to the width. For each vertical strip t, the depth value d t of the current column is considered, and the depth discontinuity points are detected. The rate of depth change is Δ d = | D ( x , y ) D ( x , y + 1 ) | . The average of the depth values and color values of the current column are taken to obtain the depth distribution and color distribution, which respectively represent the average depth and color of each pixel in the current column. According to the boundary points of depth changes, the current column is divided into several vertical regions, and each region represents a Stixel; a Stixel sequence A i = { a 1 , a 2 , , a k } is generated, where a j = ( x , y bottom , y top , w , depth , color ) .
The color classification process of a Stixel is to classify it into ground, near objects, medium-distance objects, or distant objects according to the depth value and position attributes of each Stixel.
color = ( 0 , 255 , 0 ) , if y bottom = height 1 ( ground ) ( 255 , 0 , 0 ) , if depth < 0.3 ( near ) ( 0 , 0 , 255 ) , if 0.3 depth < 0.6 ( medium ) ( 255 , 255 , 0 ) , if depth 0.6 ( distant )
The process of converting Stixel representation into tensors is as follows:
T i = [ x i , y bottom , i , y top , i , width i , depth i , R i , G i , B i ]
Use CNN to extract features of A:
F CNN = CNN ( A ) R c
The statistical feature extraction process is as follows:
F STAT = [ mean , std , median , , percentile 25 , percentile 75 ] R s
The image-Stixel feature can be expressed as:
F stixel = F CNN F STAT
Secondly, the next step is to obtain the OSM-Stixel features. Due to the existence of a prior position P ( u , v ) , a grid map M, with dimensions L × H × W , can be received. The orientation angle φ is obtained; it is calculated from the prior position P ( u , v ) , and the grid map M, and the positions of building pixels are extracted from the grid map to generate a set of building coordinates B. Through polar coordinate conversion and calculation, Θ = [ θ 1 , θ 2 , , θ n ] and R = [ r 1 , r 2 , , r n ] are obtained. To comprehensively capture the spatial distribution characteristics of buildings, we divide the 360° space into 36 uniform intervals, each with a span of 10°. For each angular interval, k [ 0 , 35 ] . The weighted angular distribution feature is defined as
O w [ k ] = θ i k Δ θ , ( k + 1 ) Δ θ r i
The unweighted angular distribution feature is defined as:
O u [ k ] = count ( θ i ) , θ i k Δ θ , ( k + 1 ) Δ θ .
The normalized weighted distribution is given by:
O n [ k ] = O w [ k ] j = 0 35 O w [ j ] .
Within its spatial distribution range, for each angular interval, the maximum distance distribution is calculated as:
R max [ k ] = max r i θ i [ k · Δ θ , ( k + 1 ) · Δ θ ) .
According to the current orientation φ , the distance feature d φ in the directional feature and θ * obtained through weighted calculation are derived.
For the k-th angular interval [ k · Δ θ , ( k + 1 ) · Δ θ ) :
F osm [ k ] = θ i [ k · Δ θ , ( k + 1 ) · Δ θ ) r i ( i ) count ( θ i ) , θ i [ k · Δ θ , ( k + 1 ) · Δ θ ) ( ii ) θ i [ k · Δ θ , ( k + 1 ) · Δ θ ) r i j [ 0 , 35 ] θ i [ j · Δ θ , ( j + 1 ) · Δ θ ) r i ( iii ) max { r i θ i [ k · Δ θ , ( k + 1 ) · Δ θ ) } ( iv )
where:
(i)
Distance-weighted sum of building pixels
(ii)
Pixel count in angular interval
(iii)
Normalized orientation distribution
(iv)
Maximum distance in angular interval
To ensure that the Stixel feature map and the OSM feature map are consistent in dimensions, we use the dim ( F stixel ) = dim ( F osm ) . A linear projection layer is employed to map features of different dimensions to a common feature space. After the two types of features are projected to a common dimension, batch normalization is applied to ensure that the features of different modalities are comparable in terms of the numerical range.
After alignment, one of the following methods is adopted for feature fusion:
Feature concatenation: F f u s e d = [ F s t i x e l | | F o s m ] ,
Weighted fusion: F f u s e d = α · F s t i x e l + ( 1 α ) · F o s m ,
Attention mechanism: F f u s e d = A t t e n t i o n ( F s t i x e l , F o s m ) .

3.4. Match

The distance features have significant disadvantages in the matching process, such as strong dependence on the viewing angle, sensitivity to scale, reduced accuracy at long distances, and vulnerability to environmental interference. These lead to large feature differences of the same object under different viewing angles, unstable features of near and far objects, and low quality of feature extraction. In contrast, angle feature matching successfully overcomes these defects with its good viewing angle invariance and scale adaptability. It not only enables the same object to maintain stable feature expressions under different viewing angles and distances but also performs excellently in handling distant objects and maintaining boundary details.
Suppose the probability distribution of the camera pose is denoted as ξ . For instance, in a complex environment, the camera may have several different reasonable positions and orientations. Each position and orientation corresponds to a possible pose, and these different possible poses form multiple peaks with relatively high probabilities. This multi-modal probability distribution enables us to easily fuse the camera pose estimation with other additional sensors, such as GPS. Calculating this probability distribution is relatively straightforward. This is because the pose space of the camera has been reduced to three dimensions. Moreover, this probability distribution is discretized to each map location, which means that we discretize the continuous space, making the calculation simpler. Sampling the rotation K times at a fixed interval avoids complex continuous calculations and improves the calculation efficiency. Therefore,
P ( ξ | I , map , ξ prior ) = P ( ξ ) .
The neural map F and the BEV T are thoroughly matched to obtain the score M. Each element is calculated by correlating F with the correspondingly transformed T after pose transformation. ξ ( p ) is equivalent to a coordinate transformation that can convert the point p from the BEV coordinate system to the coordinate system corresponding to the coordinate frame of the map. The confidence level C can play a screening role. In the BEV space, there may be some parts that are unreliable and can cause interference. Through the confidence level, the correlation between these parts and other parts can be reduced. T is rotated for K times and a single convolution is performed in the Fourier domain, which can achieve efficient parallel computing.
M [ ξ ] = 1 U Z p ( U × Z ) F [ ξ ( p ) ] ( T C ) [ p ] .
H stixel counts the directions of the rod-like features of the entire image, with each Stixel representing the vertical edge of a building. The number of Stixels in the image is M, the orientation angle of each Stixel is β i , i = 1 , 2 , 3 , , M , and we divide the orientation range [ 0 , 360 ] into K intervals (for example, when each bin is 10 , then K = 36 ). Based on the Stixel centroid ( x c , y c ) and image center ( x center , y center ) :
β i = arctan 2 ( y c y center , x c x center ) .
Therefore, the Stixel feature direction of the entire image is
H stixel [ j ] = 1 M i = 1 M δ j , β i Δ β , j = 1 , 2 , , K .
The symbol δ here represents the Kronecker delta. It is equal to 1 when the two parameters match and is 0 otherwise. It is used to count the number of times an angle falls into a specific interval.
H osm counts the directions of the building edges in the O S M map and extracts the main orientations of the building outlines. The building boundaries in OSM are represented as boundary line segments of polygons. Each boundary line segment is uniquely determined by its two endpoints: ( x 1 , y 1 ) and ( x 2 , y 2 ) . The orientation angle θ (i.e., edge direction) of each boundary line segment can be computed using:
θ = arctan 2 ( y 2 y 1 , x 2 x 1 ) .
H osm [ j ] = i = 1 N δ j , bin ( θ i ) , j { 1 , 2 , , K } .
H osm ( i + θ ) , where θ represents the rotation angle, performs a cyclic shift on the OSM angles; w i represents the weight factor applied to each bin during the angular histogram matching process.
S angle ( θ ) = i w i min ( H stixel ( i ) , H osm ( i + θ ) ) .
S total ( x , y , θ ) = ( 1 α ) M ( x , y , θ ) + α S angle ( θ ) .
Lastly, we fuse the two sets of features after matching.
The angle matching method based on Stixel features achieves the enhancement of features from local to global by introducing angle distribution. It naturally combines the directional information of the Stixel itself with angle matching, effectively weakening the influence of local errors and enhancing the adaptability to occlusion and sparse scenes. Meanwhile, this method can realize the efficient cross-modal alignment between the Stixel features of images and OSM maps. It quickly solves the rotation problem through angle distribution, reducing the computational complexity. It is particularly suitable for distinguishing scenes with significant directional differences. Moreover, the lightweight scheme using sparse features in combination with angle distribution significantly reduces the computational and storage costs.
Processing Mechanism for Non-Map Objects in the Scene: This paper implements a multi-level anomaly detection and filtering architecture to handle interfering objects that appear in the scene but are not present in the OSM map. This mainly includes feature confidence weighting based on semantic reliability, an adaptive outlier detector utilizing matching probabilities, graph-model filtering based on k-nearest neighbor spatial consistency, and a multi-hypothesis matching and verification strategy inspired by the RANSAC concept. These mechanisms work in concert to ensure that the system can achieve robust feature matching in complex scenarios containing a large number of dynamic objects (such as vehicles and pedestrians) and unmodeled structures.

3.5. Loss

Throughout the training process, in order to accurately estimate the pose, we utilize the Negative Log Likelihood (NLL) loss. After training and calculation, the predicted pose is p and its corresponding ground truth pose is p * .
The formula for the localization loss is as follows:
L loc = log P p * | p
To address the inherent uncertainties in pose estimation, Gaussian smoothing is employed:
L loc = log N ( x * x , σ x y ) · N ( y * y , σ x y ) · N ( θ * θ , σ r ) .
where σ x y and σ r are hyperparameters representing the standard deviations of the position components ( x , y ) and the rotation component r in the positioning uncertainty model, respectively.
After feature matching, the obtained similarity loss is as follows:
L sim = 1 D i f stixel i · f osm i f stixel i f osm i .
Therefore, in order to obtain a more accurate pose, the loss is updated as follows:
L total = L loc + λ · L sim .
Systematic dataset experiments show that λ = 0.2 yields the steadiest training curve, fastest convergence, oscillation-free training, and optimal positional-angular accuracy balance.
Algorithm Fusing Horizon Information for Visual Localization
  • Input:
    • image,
    • map,
    • camera parameters,
    • bev_valid
  • Initialization:
    • Initialize image encoder, map encoder
    • Initialize scale classifier
    • Initialize coordinate transformation and projection modules
    • Initialize OSM feature extractor
  • Update:
    • Image feature extraction:
    • f_image = image_encoder(image)
    • Stixel feature extraction:
      (a)
      depth_map = estimate_depth(image)
      (b)
      stixels = generate_stixels(image, depth_map)
      (c)
      stixel_representation = stixels_to_feature_input(stixels)
      (d)
      stixel_features = stixel_extractor(stixel_tensor)
    • Feature projection:
      (a)
      f_polar = projection_polar(f_image, scales, camera)
      (b)
      f_bev = projection_bev(f_polar)
    • OSM feature extraction and similarity calculation:
      (a)
      angles, distances = visualize_map_with_pose_and_tile(map)
      (b)
      osm_features = extract_polar_features(angles, distances)
    • Similarity-weighted score calculation:
      (a)
      stixel_angle_features = Compress Stixel Angular Features using Fourier Transform
      (b)
      osm_angle_features = Normalize OSM Angular Features
      (c)
      best_angle, best_similarity = find_best_rotation(stixel_normalized, osm_normalized)
    • Score calculation:
      (a)
      scores = exhaustive_vote(f_bev, f_map, valid_bev, confidence, similarity)
    • Loss calculation:
      (a)
      nll = nll_loss_xyr(log_probs, xy_gt, yaw_gt)
      (b)
      similarity_loss = -mean(similarity)
      (c)
      total_loss = nll + λ · similarity_loss

4. Experiments and Results

Experiment device
In the experimental setup, the relevant hardware and software information is presented as follows:
  • Hardware: NVIDIA GeForce RTX 3090
  • Code Language: python-3.9.21
  • Framework: torch-2.5.1
Dataset Selection
The Mapillary Geo-Localization (MGL) dataset [23] and KITTI dataset [24] are used in this experiment, and they have the following characteristics:
  • Data Source and Scale: MGL is collected from the Mapillary street view platform; it widely covers 12 major cities in Europe and America, with a total of 760,000 street view images. KITT is sourced from Karlsruhe, Germany, and its surrounding areas and was first released in 2012. It contains over 6 h of traffic scene recordings, covers a driving distance of about 39.2 km, and has over 200,000 3D-labeled objects and more than 30,000 pairs of stereo images.
  • Capture Devices and Scenes: The capture devices of the MGL dataset are rich and diverse, including hand-held cameras, vehicle-mounted cameras, and bicycle-mounted cameras, thus capturing a variety of scenes, such as city streets, various buildings, and transportation facilities. Moreover, the perspectives of the images also show diversity in height and angle. The KITTI is collected using a modified Volkswagen Passat station wagon equipped with 2 PointGrey Flea2 color cameras (1392 × 512 pixels) and 2 PointGrey Flea2 grayscale cameras (1392 × 512 pixels), 1 Velodyne HDL-64E LiDAR, and 1 OXTS RT3003 high-precision GPS/IMU system. The scenes include urban roads, residential areas, highways, and rural roads and cover different weather and lighting conditions.
  • Currently, studies focus on the localization technology between ground images and satellite images from different perspectives. Therefore, in addition to using OpenStreetMap (OSM), we have also employed satellite maps.
  • Annotation Information and Acquisition Method: It has 6-Degree-of-Freedom pose information (6-DoF), including 3 translation parameters (x, y, z) and 3 rotation parameters (roll, pitch, yaw). This pose information is obtained through the combination of Structure from Motion (SfM) technology and GPS data fusion.
Evaluation Metrics
  • Position Recall: The evaluation thresholds are set as 1 m, 3 m, and 5 m. The proportion of samples whose predicted position is within the corresponding threshold distance from the ground truth position to the total samples is calculated to evaluate the spatial accuracy of the model’s positioning.
  • Orientation Recall: 1 degree, 3 degrees, and 5 degrees are used as the evaluation thresholds. The proportion of samples whose angle difference between the predicted orientation and the ground truth orientation is within the threshold is calculated to measure the accuracy of the model’s orientation estimation.
  • Error Calculation: The position error is measured by the Euclidean distance between the predicted position and the ground truth position, and the orientation error is represented by the angle difference between the predicted orientation and the ground truth orientation.
Baseline Model Setup
OrienterNet is selected as the baseline model and adjusted and adapted according to the research method to ensure effective comparative analysis in the experiment.
The experiment took one and a half months to complete and used different autonomous driving scenario datasets and various positioning methods; the results are as shown in Table 1, Table 2 and Table 3. When trained on the KITTI dataset, in terms of lateral and longitudinal positional errors, the localization recall rate of our proposed FHVL method is higher than that of other methods. When trained on the MGL dataset, in terms of positional errors, it is also slightly higher than other methods. Additionally, as depicted in Figure 4, in terms of the recall rate of orientation positions on both datasets, the FHVL method outperforms other methods more significantly and exhibits more robust performance. As shown in Figure 5, the model can accurately capture features such as building contours in typical urban scenes, with minimal error between the predicted position and the ground truth, further validating the improvement of positioning accuracy through multi-modal feature fusion. These experimental indicators are all kept consistent with those of the baseline experiment.

5. Algorithm Complexity Analysis

Time Complexity
  • OrienterNet: The dominant computational step is exhaustive voting with complexity O ( R M 2 N 2 ) . Other steps include feature extraction O ( C H W ) , BEV projection O ( C D H ) , etc.
  • Our Method: Additional computations introduced on top of OrienterNet:
    Stixel generation and feature extraction: O ( W D s + S · C s )
    OSM feature extraction: O ( B )
    Similarity computation: O ( C f )
    Additional voting step: O ( R M 2 N 2 )
    Overall Complexity: O ( 2 R M 2 N 2 + C H W + C D H + C M 2 + other minor terms )
  • Theoretical Complexity: The theoretical complexity is approximately twice that of the original method, dominated by the two voting operations.
  • Actual Runtime: The average measured increase is 46%, attributed to:
    Reuse of intermediate results through hardware acceleration
    Parallel execution of Stixel and OSM feature extraction
    Reduced computational load via compressed feature representations
Space Complexity
  • OrienterNet: Major memory overheads include BEV features O ( C N 2 ) and voting templates O ( R C N 2 ) .
  • Our Method: Additional memory requirements:
    Depth map: O ( H W )
    Stixel and OSM features: O ( S W + C o )
    Second score tensor: O ( R M 2 )
    Overall Complexity: O ( R C N 2 + C M 2 + 2 R M 2 + other minor terms )
  • Summary: The primary increase comes from storing the second score tensor, which is relatively small compared to the original method’s total memory footprint.
Model Parameters
  • Additional Stixel feature extractor: ∼68 K parameters
  • Extra feature projection layers: ∼4 K parameters
  • Total increase: 0.5 % relative to OrienterNet’s 14.5 M parameters

6. Conclusions

Firstly, the model we proposed incorporates a multi-modal feature fusion mechanism, effectively integrating BEV features, Stixel structural features, and OpenStreetMap (OSM) features, providing a more comprehensive scene understanding ability. Secondly, by introducing a geometric-constraint-based similarity calculation mechanism, the model can more accurately evaluate the consistency between different features, thereby improving the positioning accuracy. These improvements enable the model to better handle complex real-world scenarios, reduce the dependence on a single feature, and enhance the model’s adaptability to environmental changes. Experimental results show that the model we proposed has achieved significant improvements in positioning accuracy, orientation estimation, and scene understanding, demonstrating stronger practical value and generalization ability. Future optimizations focus on three aspects: integrating multi-sensor data (e.g., LiDAR, radar) into the heterogeneous framework via customized extractors to enhance robustness in harsh environments; introducing temporal similarity metrics and feature accumulation to improve continuous localization consistency/accuracy for autonomous driving/robot navigation; developing dynamic feature importance evaluation to adaptively adjust feature dependency based on environmental changes, strengthening generalization/stability across scenarios.

Author Contributions

Conceptualization was carried out by C.Z. and Y.Y. Methodology, software, validation, and the preparation of the original draft were undertaken by C.Z. and Y.Y. The review and editing work were completed by Y.W. and H.Z. The visualization work was conducted by Y.W. Project administration was the responsibility of G.L. Funding acquisition was assumed by G.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The source of MGL is the article “Mapillary planet-scale depth dataset” published by M. L. Antequera and others in the ECCV in 2020 (the pages of the article are 3 and 4). In addition, the KITTI dataset is sourced from the article “Vision meets robotics: The KITTI dataset” by A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, which was published in The International Journal of Robotics Research.

Conflicts of Interest

Authors Yuchan Yang, Yiwei Wang and Helu Zhang were employed by Lotus Robotics. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potenti al conflict of interest.

References

  1. Zhang, J.; Singh, S. LOAM: Lidar Odometry and Mapping. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Chicago, IL, USA, 14–18 September 2014; pp. 1271–1278. [Google Scholar]
  2. Zhao, B.; Yang, L.; Mao, M.; Bao, H.; Cui, Z. Pnerfloc: Visual localization with point-based neural radiance fields. Proc. AAAI Conf. Artif. Intell. 2024, 38, 7450–7459. [Google Scholar] [CrossRef]
  3. Sarlin, P.; DeTone, D.; Yang, T.; Avetisyan, A.; Straub, J.; Malisiewicz, T.; Bulo, S.; Newcombe, R.; Kontschieder, P.; Balntas, V. OrienterNet: Visual localization in 2D public maps with neural matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 21632–21642. [Google Scholar]
  4. Li, Q.; Wang, Y.; Wang, Y.; Zhao, H. Hdmapnet: An online hd map construction and evaluation framework. In Proceedings of the 2022 International Conference on Robotics and Automation (ICRA), Philadelphia, PA, USA, 23–27 May 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 4628–4634. [Google Scholar]
  5. Muñoz-Bañón, M.A.; Velasco-Sánchez, E.; Candelas, F.A.; Torres, F. Openstreetmap-based autonomous navigation with lidar naive-valley-path obstacle avoidance. IEEE Trans. Intell. Transp. Syst. 2022, 23, 22284–22296. [Google Scholar] [CrossRef]
  6. Barros, T.; Garrote, L.; Pereira, R.; Premebida, C.; Nunes, U.J. Improving localization by learning pole-like landmarks using a semi-supervised approach. In Proceedings of the Robot 2019 The Fourth Iberian Robotics Conference, Porto, Portugal, 20–22 November 2020; Silva, M.F., Lima, J.L., Reis, L.P., Sanfeliu, A., Tardioli, D., Eds.; Springer International Publishing: Cham, Switzerland, 2020; pp. 255–266. [Google Scholar]
  7. Levi, D.; Garnett, N.; Fetaya, E.; Herzlyia, I. Stixelnet: A deep convolutional network for obstacle detection and road segmentation. In Proceedings of the British Machine Vision Conference 2015, Swansea, UK, 7–10 September 2015; Volume 1, p. 4. [Google Scholar]
  8. Samano, N.; Zhou, M.; Calway, A. You are here: Geolocation by embedding maps and images. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Springer: Cham, Switzerland, 2020; pp. 502–518. [Google Scholar]
  9. O’Keefe, J.; Nadel, L. The Hippocampus as a Cognitive Map; Clarendon Press: Oxford, UK, 1978. [Google Scholar]
  10. Park, C.; Seo, E.; Lim, J. Heightlane: Bev heightmap guided 3d lane detection. arXiv 2024, arXiv:2408.08270. [Google Scholar]
  11. Wang, R.; Qin, J.; Li, K.; Li, Y.; Cao, D.; Xu, J. Bev-lanedet: A simple and effective 3d lane detection baseline. arXiv 2023, arXiv:2210.06006. [Google Scholar]
  12. He, Y.; Liang, S.; Rui, X.; Cai, C.; Wan, G. Egovm: Achieving precise ego-localization using lightweight vectorized maps. In Proceedings of the 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Abu Dhabi, United Arab Emirates, 14–18 October 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 12248–12255. [Google Scholar]
  13. Lin, Z.; Wang, Y.; Qi, S.; Dong, N.; Yang, M.H. Bev-mae: Bird’s eye view masked autoencoders for point cloud pre-training in autonomous driving scenarios. arXiv 2024, arXiv:2212.05758. [Google Scholar] [CrossRef]
  14. Wu, H.; Zhang, Z.; Lin, S.; Mu, X.; Zhao, Q.; Yang, M.; Qin, T. Maplocnet: Coarse-to-fine feature registration for visual re-localization in navigation maps. In Proceedings of the 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Abu Dhabi, United Arab Emirates, 14–18 October 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 13198–13205. [Google Scholar]
  15. Sarlin, P.; Trulls, E.; Pollefeys, M.; Hosang, J.; Lynen, S. Snap: Self-supervised neural maps for visual positioning and semantic understanding. Adv. Neural Inf. Process. Syst. 2023, 36, 7697–7729. [Google Scholar]
  16. Lobben, A.K. Navigational map reading: Predicting performance and identifying relative influence of map-related abilities. Ann. Assoc. Am. Geogr. 2007, 97, 64–85. [Google Scholar] [CrossRef]
  17. Zhu, S.; Shah, M.; Chen, C. TransGeo: Transformer Is All You Need for Cross-view Image Geo-localization. arXiv 2022, arXiv:2204.00097. [Google Scholar]
  18. Sun, J.; Shen, Z.; Wang, Y.; Bao, H.; Zhou, X. LoFTR: Detector-Free Local Feature Matching with Transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 1–9. [Google Scholar]
  19. Yin, H.; Wang, Y.; Tang, L.; Ding, X. LocNet: Global Localization in 3D Point Clouds for Mobile Robots. In Proceedings of the 2018 IEEE Intelligent Vehicles Symposium (IV), Changshu, China, 26–30 June 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 1355–1360. [Google Scholar]
  20. Bai, X.; Hu, Z.; Zhu, X.; Huang, Q.; Chen, Y.; Fu, H.; Tai, C.-L. TransFusion: Robust LiDAR-Camera Fusion for 3D Object Detection with Transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 1–10. [Google Scholar]
  21. Cordts, M.; Omran, M.; Ramos, T.; Rehfeld, M.; Enzweiler, R.; Benenson, U.; Franke, S.; Roth, B.; Schiele, B. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 3213–3223. [Google Scholar]
  22. Omama, M.; Inani, P.; Paul, P.; Yellapragada, S.C.; Jatavallabhula, K.M.; Chinchali, S.; Krishna, M. Alt-pilot: Autonomous navigation with language augmented topometric maps. arXiv 2023, arXiv:2310.02324. [Google Scholar]
  23. Antequera, M.L.; Gargallo, P.; Hofinger, M.; Bulo, S.R.; Kuang, Y.; Kontschieder, P. Mapillary planet-scale depth dataset. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Cham, Switzerland, 2020; pp. 3–4. [Google Scholar]
  24. Geiger, A.; Lenz, P.; Stiller, C.; Urtasun, R. Vision meets robotics: The kitti dataset. Int. J. Robot. Res. 2013, 32, 1231–1237. [Google Scholar] [CrossRef]
  25. Xia, Z.; Booij, O.; Manfredi, M.; Kooij, J.F. Visual cross-view metric localization with dense uncertainty estimates. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Cham, Switzerland, 2022; pp. 2–13. [Google Scholar]
  26. Shi, Y.; Li, H. Beyond cross-view image retrieval: Highly accurate vehicle localization using satellite image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 2–13. [Google Scholar]
  27. Shi, Y.; Yu, X.; Campbell, D.; Li, H. Where am i looking at? Joint location and orientation estimation by cross-view matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 2–7. [Google Scholar]
  28. Zhu, S.; Yang, T.; Chen, C. Vigor: Crossview image geo-localization beyond one-to-one retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 2–7. [Google Scholar]
Figure 1. Example of FHVL. Comparison of feature resolution between FHVL and traditional deep learning. The upper image in the middle part of the picture is traditional feature map with blurred edges, and the lower one is Stixel feature map with enhanced depth and edge details.The prediction result (black arrow) closely aligns with the ground truth (red arrow).
Figure 1. Example of FHVL. Comparison of feature resolution between FHVL and traditional deep learning. The upper image in the middle part of the picture is traditional feature map with blurred edges, and the lower one is Stixel feature map with enhanced depth and edge details.The prediction result (black arrow) closely aligns with the ground truth (red arrow).
Ai 06 00121 g001
Figure 2. Architecture of FHVL, including BEV-Map matching and angle rotation matching modules.
Figure 2. Architecture of FHVL, including BEV-Map matching and angle rotation matching modules.
Ai 06 00121 g002
Figure 3. Process of coordinate transformations.
Figure 3. Process of coordinate transformations.
Ai 06 00121 g003
Figure 4. Comparing the performance of different methods on OpenStreetMap and Satellite data, our model shows more robustness in terms of angular recall.
Figure 4. Comparing the performance of different methods on OpenStreetMap and Satellite data, our model shows more robustness in terms of angular recall.
Ai 06 00121 g004
Figure 5. During training, when a single image is input into the FHVL model and fused with the input map, the prediction result (black arrow) closely aligns with the ground truth (red arrow), approaching precise matching. The model accurately captures characteristic patterns of urban objects (e.g., building corners, crosswalks, intersections) via advanced feature extraction and integrates these into location and scene analysis.
Figure 5. During training, when a single image is input into the FHVL model and fused with the input map, the prediction result (black arrow) closely aligns with the ground truth (red arrow), approaching precise matching. The model accurately captures characteristic patterns of urban objects (e.g., building corners, crosswalks, intersections) via advanced feature extraction and integrates these into location and scene analysis.
Ai 06 00121 g005
Table 1. Comparison of lateral accuracy at different distances.
Table 1. Comparison of lateral accuracy at different distances.
MapApproachTraining DatasetLateral@XmRuntime
1 m3 m5 m(ms)
OpenStreetMapretrieval [8,25]MGL37.4766.2472.89145
refinement [26]MGL50.8378.1082.22210
OrienterNet [3]MGL53.5188.8594.47125
BEV + StixelMGL56.0088.2095.1085
BEV + AngleMGL54.8089.5094.7080
OursMGL57.2587.6595.62105
SatelliteDSM [27]KITTI10.7731.3748.24180
VIGOR [28]KITTI17.3848.2070.79155
refinement [26]KITTI27.8259.7972.89210
OrienterNet [3]KITTI51.2684.7791.81125
BEV + StixelKITTI53.5084.1092.0087
BEV + AngleKITTI53.0085.0092.5082
OursKITTI54.8483.5692.35107
Table 2. Comparison of longitudinal recall at different distances.
Table 2. Comparison of longitudinal recall at different distances.
MapApproachTraining DatasetLongitudinal R@XmRuntime
1 m3 m5 m(ms)
OpenStreetMapretrieval [8,25]MGL5.9416.8826.97145
refinement [26]MGL17.7540.3252.40210
OrienterNet [3]MGL26.2559.8470.76125
BEV + StixelMGL27.5062.0071.0085
BEV + AngleMGL26.8061.5070.9080
OursMGL28.0763.7371.51105
SatelliteDSM [27]KITTI3.8711.7319.50180
VIGOR [28]KITTI4.0712.5220.14155
refinement [26]KITTI5.7516.3626.48210
OrienterNet [3]KITTI22.3946.7957.81125
BEV + StixelKITTI23.0048.5060.5087
BEV + AngleKITTI22.8047.5058.5082
OursKITTI23.449.8361.27107
Table 3. Comparison of orientation recall at different angles.
Table 3. Comparison of orientation recall at different angles.
MapApproachTraining DatasetOrientation R@X°Runtime
(ms)
OpenStreetMapretrieval [8,25]MGL2.9712.3223.27145
refinement [26]MGL31.0366.7676.07210
OrienterNet [3]MGL34.2673.5189.45125
BEV + AngleMGL40.8085.2094.7080
BEV + StixelMGL36.5075.0089.8085
OursMGL41.8086.5595.01105
SatelliteDSM [27]KITTI3.5314.0923.95180
VIGOR [28]KITTI---155
refinement [26]KITTI18.4249.7271.00210
OrienterNet [3]KITTI20.4152.2473.53125
BEV + AngleKITTI24.0060.0081.5082
BEV + StixelKITTI22.0055.0078.0087
OursKITTI24.9061.5182.68107
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, C.; Yang, Y.; Wang, Y.; Zhang, H.; Li, G. Fusing Horizon Information for Visual Localization. AI 2025, 6, 121. https://doi.org/10.3390/ai6060121

AMA Style

Zhang C, Yang Y, Wang Y, Zhang H, Li G. Fusing Horizon Information for Visual Localization. AI. 2025; 6(6):121. https://doi.org/10.3390/ai6060121

Chicago/Turabian Style

Zhang, Cheng, Yuchan Yang, Yiwei Wang, Helu Zhang, and Guangyao Li. 2025. "Fusing Horizon Information for Visual Localization" AI 6, no. 6: 121. https://doi.org/10.3390/ai6060121

APA Style

Zhang, C., Yang, Y., Wang, Y., Zhang, H., & Li, G. (2025). Fusing Horizon Information for Visual Localization. AI, 6(6), 121. https://doi.org/10.3390/ai6060121

Article Metrics

Back to TopTop