1. Introduction
Unmanned vehicles, owing to their intelligence, efficiency, and cost-effectiveness, have been widely adopted in fields such as intelligent logistics, smart transportation, precision agriculture, and defense applications. However, when operating in complex environments like indoor spaces, densely populated urban districts, or vegetation-covered areas, their localization is challenged by signal degradation or even the unavailability of the Global Navigation Satellite System (GNSS). In such GNSS-denied environments, pose initialization serves as a critical prerequisite step for unmanned vehicles, which directly impacts the accuracy and robustness of subsequent positioning and navigation processes [
1,
2]. Incorrect initialization adversely affects iterative positioning performance, increasing the risk of convergence failure or higher computational cost. These errors further propagate to planning and control modules, degrading overall system performance. To address localization in GNSS-denied environments, probabilistic filtering-based methods, feature-based matching methods, and deep learning-based methods have been widely explored for pose initialization [
3,
4,
5], differing in how sensor data are processed. From the perspective of sensing modality, pose initialization methods are primarily categorized into vision-based and LiDAR-based approaches. Vision-based methods use cameras to capture artificial markers or to acquire environmental images, where feature-matching or deep learning algorithms are applied to identify keypoints in the scene. LiDAR-based methods depend on laser scanning of the surroundings, achieving initialization through point cloud matching or pre-constructed maps.
Vision-based methods can be marker-based, using artificial markers such as Quick Response (QR) codes, or markerless, relying on natural features in the environment. Marker-based methods are accurate and computationally efficient in positioning. For instance, a QR-based method [
6] was introduced where the external QR code information and the internal encoder values are combined to refine pose estimation through extended Kalman filter. For large-scale or frequently updated environments, a smart artificial marker [
7] was developed with sensing and ranging capabilities, enabling active self-reporting and more accurate visual localization. To enhance the robustness of indoor localization, Yu [
8] presented a multi-source fusion framework integrating QR codes, Wi-Fi, Bluetooth Low Energy, and sensors. However, the application of marker-based methods is constrained by limited zones (only functional within marker-deployed areas), poor adaptability (requiring redeployment for scene modifications), and high costs of deployment and maintenance. Markerless methods, on the other hand, utilize natural environmental features such as wall corners, door frames, surface textures, or object edges. Keypoints and their descriptors can be extracted from images through feature detection algorithms. Similarity matching is then performed to determine the target’s position in the image. To achieve robust detection across multiple scales, Lowe [
9] proposed Scale Invariant Feature Transform (SIFT), which constructs descriptors from local gradient information for image matching. To improve localization under illumination variation or occlusion, patch-NetVLAD [
10] divides an image into patches and uses contextual cues. To enhance computational efficiency, a shared local feature strategy [
11] was introduced, allowing feature reuse across images to eliminate redundant computation. Likewise, a prioritized matching approach [
12] improves efficiency by exploiting visibility information from 3D reconstruction during 2D–3D feature matching. In recent years, deep learning-based visual localization methods have gained increasing attention, commonly employing convolutional neural networks (CNNs), Graph Neural Networks (GNNs), or transformers for image feature extraction and camera pose estimation. As for the CNNs, SuperPoint [
13] adopts CNN to jointly compute pixel-level interest points and descriptors, with homographic adaptation improving cross-domain repeatability. Dusmanu [
14] proposed D2-Net, where a single CNN simultaneously acts as dense feature descriptors and detectors, achieving robustness under day-night changes. To improve the accuracy, a partially differentiable CNN module [
15] enables the generation of keypoints at a sub-pixel level through optimization with reprojection losses. To improve real-time localization performance in large environments, Sarlin [
16] proposed a Hierarchical Feature Network (HF-Net) that integrates global retrieval and local matching within a coarse-to-fine pipeline. In terms of GNNs-based methods, SuperGlue [
17], which matches local features by jointly estimating correspondences and rejecting non-matchable points; Angle-Annular-GNN [
18], which efficiently learns robust geometric structural representations with annular feature extraction; and Sparse Spatial Scene Embedding-GNN [
19], which encodes image embeddings into a pose graph for scene matching. In addition to CNNs and GNNs, recent works explore Transformer architectures for visual localization. For instance, Shavit [
20] proposed using multi-headed Transformers to perform end-to-end absolute camera pose regression. Wang [
21] integrated a hierarchical scene coordinate network with a Transformer to achieve robust scalability in large environments. EffLoc [
22] leverages a hierarchical vision transformer with sequential group attention to enhance computational efficiency.
Laser-based localization methods estimate the pose by scanning environmental geometry and matching LiDAR data with an existing map. Laser-based methods are generally categorized into geometry-based and deep learning-based approaches. Typical geometric matching methods include Normal Distributions Transform (NDT) and Adaptive Monte Carlo Localization (AMCL). Derived from the classical Iterative Closest Point (ICP) algorithm, NDT improves robustness by replacing explicit point-to-point correspondences with probabilistic modeling of local geometric structures. NDT associates a piecewise normal distribution to the reference point cloud and finds the spatial transform that maximizes the probability of the source point cloud under this distribution. Several improvements have been made to enhance NDT performance. Multiscale Iterative NDT (MI-NDT) [
23] improves efficiency in large-scale registration by dividing point clouds into multiple resolutions, allowing tolerance to larger initialization errors. Similarly, another variant [
24] increases grid density at complex intersections and lowers it on open roads, balancing computational load and memory consumption in urban scenes. To reduce accumulated errors and drift probability, NDT-LOAM [
25] introduces range weighting and surface features such as curvature to refine covariance estimation within NDT voxels. Its local geometric features are further used to refine poses after coarse NDT registration. To address large-scale topological localization, NDT-Transformer [
26] compresses dense 3D point clouds into probabilistic NDT cells to describe geometric structures and employs a Transformer to learn global descriptors from these cells for location retrieval.
Adaptive Monte Carlo Localization (AMCL) applies a particle filter that relies on observed point clouds and their matching with a pre-built map. Particle weights are iteratively refined until convergence to the optimal pose estimation. Several improvements have been made to the AMCL. The Self-adaptive MCL (SAMCL) [
27] introduces similar energy regions, where poses share comparable energy, to guide the distribution of global samples and achieve higher localization performance. To improve localization robustness in changing environments, a modified AMCL algorithm [
28] incorporates object recognition and dynamic semantic map updating, while the Artificial Landmark Enhanced Localization AMCL (ALEL-AMCL) [
29] integrates pre-positioned artificial landmark observations into the particle update process. To address the kidnapped robot situation with an unknown initial pose, one method [
30] employs offline feature matching and particle swarm optimization, while another [
31] integrates 2D laser and range finder information to achieve more reliable localization. Generally, NDT provides high accuracy and robustness but struggles with dynamic objects and feature-sparse environments, while AMCL is efficient and scalable but highly sensitive to pose initialization. On the other hand, deep learning-based LiDAR methods can automatically learn environmental features without manual parameter tuning, offering stronger robustness in dynamic environments. For instance, DOPNet [
32] utilizes a graph convolutional network for feature extraction and a multilayer perceptron to predict spatial transformation. Furthermore, deformable kernels can be incorporated into graph convolutional networks [
33] to enhance feature extraction in irregular and unstructured point clouds. PointLoc [
34] leverages a PointNet-style network with self-attention to infer poses from a single LiDAR frame.
When prior pose information is unavailable, AMCL performs global localization by uniformly distributing particles across the entire map and iteratively updating their weights using motion and sensor measurements. However, its computational efficiency and localization reliability degrade as the map scale increases [
30]. In large or structurally symmetric environments, the particle filter may suffer from slow convergence or ambiguity-induced degeneration, making global localization challenging [
35]. Therefore, a reliable and efficient coarse localization mechanism capable of providing prior pose information is highly desirable for large-scale GNSS-denied environments. Sensor observations at the true location are expected to exhibit high similarity with the corresponding local map data. Siamese Neural Network, a deep learning architecture with twin branches sharing identical weights, can be employed to evaluate such similarities. It has been widely applied in comparison and matching tasks such as face verification [
36], fingerprint-based authentication [
37], and signature verification [
38]. Inspired by this architecture, point cloud maps can be represented as images, and sensor-acquired point clouds can be compared with map slices through the Siamese network to estimate the robot’s position. Therefore, this paper introduces an innovative pose initialization algorithm that first employs a Siamese network for coarse localization and then integrates AMCL and NDT for fine pose refinement, enabling robust recovery from kidnapped situations.
The main contributions of this paper are as follows:
- (1)
A novel framework for coarse-to-fine localization is proposed, integrating a Siamese Neural Network for initial pose estimation and subsequent refinement with AMCL and NDT.
- (2)
By treating point cloud maps as images, coarse localization is achieved through the Siamese Neural Network that identifies the correct slice by computing similarity scores between sensor-acquired point clouds and the pre-existing map slices.
- (3)
The proposed approach achieves centimeter-level positional accuracy and sub-degree orientation accuracy under GNSS-denied conditions. It obtains localization success rates of 99% on both medium- and large-scale maps, with distance RMSEs of 0.175 m and 0.348 m, and orientation RMSEs of 0.149° and 0.437°, respectively.
The remainder of this paper is organized as follows:
Section 2 introduces the overall framework of the algorithm,
Section 3 presents experiments conducted on self-collected LiDAR point cloud maps and the KITTI dataset with result discussion, and
Section 4 provides the concluding remarks.