Loop-MapNet: A Multi-Modal HDMap Perception Framework with SDMap Dynamic Evolution and Priors

Tang, Yuxuan; Hu, Jie; Zhang, Daode; Xu, Wencai; Zhao, Feiyu; Cheng, Xinghao

doi:10.3390/app152011160

Open AccessArticle

Loop-MapNet: A Multi-Modal HDMap Perception Framework with SDMap Dynamic Evolution and Priors

by

Yuxuan Tang

^1,2,3,4

,

Jie Hu

^1,2,3,4,*

,

Daode Zhang

⁵,

Wencai Xu

^1,2,3,4,

Feiyu Zhao

^1,2,3,4 and

Xinghao Cheng

^1,2,3,4

¹

College of Automotive Engineering, Wuhan University of Technology, Luoshi Road, Wuhan 430070, China

²

Hubei Key Laboratory of Advanced Technology for Automotive Components, Wuhan University of Technology, Luoshi Road, Wuhan 430070, China

³

Hubei Collaborative Innovation Center for Automotive Components Technology, Wuhan University of Technology, Luoshi Road, Wuhan 430070, China

⁴

Hubei Research Center for New Energy & Intelligent Connected Vehicle, Wuhan University of Technology, Luoshi Road, Wuhan 430070, China

⁵

School of Mechanical Engineering, Hubei University of Technology, Nanli Road, Wuhan 430068, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(20), 11160; https://doi.org/10.3390/app152011160

Submission received: 12 September 2025 / Revised: 10 October 2025 / Accepted: 16 October 2025 / Published: 17 October 2025

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

High-definition maps (HDMaps) are critical for safe autonomy on structured roads. Yet traditional production—relying on dedicated mapping fleets and manual quality control—is costly and slow, impeding large-scale, frequent updates. Recently, standard-definition maps (SDMaps) derived from remote sensing have been adopted as priors to support HDMap perception, lowering cost but struggling with subtle urban changes and localization drift. We propose Loop-MapNet, a self-evolving, multimodal, closed-loop mapping framework. Loop-MapNet effectively leverages surround-view images, LiDAR point clouds, and SDMaps; it fuses multi-scale vision via a weighted BiFPN, and couples PointPillars BEV and SDMap topology encoders for cross-modal sensing. A Transformer-based bidirectional adaptive cross-attention aligns SDMap with online perception, enabling robust fusion under heterogeneity. We further introduce a confidence-guided masked autoencoder (CG-MAE) that leverages confidence and probabilistic distillation to both capture implicit SDMap priors and enhance the detailed representation of low-confidence HDMap regions. With spatiotemporal consistency checks, Loop-MapNet incrementally updates SDMaps to form a perception–mapping–update loop, compensating remote-sensing latency and enabling online map optimization. On nuScenes, within 120 m, Loop-MapNet attains 61.05% mIoU, surpassing the best baseline by 0.77%. Under extreme localization errors, it maintains 60.46% mIoU, improving robustness by 2.77%; CG-MAE pre-training raises accuracy in low-confidence regions by 1.72%. These results demonstrate advantages in fusion and robustness, moving beyond one-way prior injection and enabling HDMap–SDMap co-evolution for closed-loop autonomy and rapid SDMap refresh from remote sensing.

Keywords:

HDMap; SDMap; multimodal perception; attention mechanism; closed-loop system; autonomous driving; urban environment

1. Introduction

HDMaps provide precise geometric and semantic constraints for autonomous driving on structured roads, underpinning robust localization and planning [1,2,3]. However, conventional production pipelines, depending on specialized mapping fleets and intensive manual QA [4,5,6,7], are expensive and slow, limiting scalability and refresh frequency. In parallel, remote sensing (airborne/satellite imagery) enables low-marginal-cost, wide-area coverage with good spatiotemporal accessibility, facilitating the construction of standard-definition maps (SDMaps) [8,9,10] that can act as priors for online perception [11,12,13,14]. SDMaps extend visibility to distant or occluded regions via road topology and lane geometry. Yet, remote sensing latency and localization drift yield geometric errors and semantic ambiguities that degrade prior quality and alignment with onboard sensors.

Two directions dominate online mapping. The first constructs HDMaps purely from onboard sensors [15,16], learning BEV representations from cameras and LiDAR to detect and vectorize map elements; performance nevertheless degrades at distance and in low light. The second injects SDMap priors [13,17] to enhance far-field visibility and mitigate sparsity/occlusions. Existing methods typically assume rigid alignment and are sensitive to GNSS/IMU drift, extrinsic errors, and prior geometry bias; moreover, SDMaps are treated as static priors, lacking closed-loop evolution.

We address the challenge of SDMap-prior-guided HDMap perception with Loop-MapNet: surround-view cameras, LiDAR, and SDMaps are aligned in BEV via a bidirectional adaptive cross-attention; a confidence-guided masked autoencoder (CG-MAE) enhances details in low-confidence regions; and a spatiotemporal-consistency-constrained update yields incremental SDMap refresh, closing the loop of perception–mapping–update–reperception. Compared with P-MapNet [13], HDMapNet [16], and MapTRv2 [18], our design targets real-world remote-sensing and onboard sensors challenges (latency, noise, and cross-view disparities) across three key aspects: (1) Adaptive cross-modal alignment to compensate localization/extrinsic errors and prior geometry bias; (2) Confidence-guided prior enhancement focusing on ambiguous, far, and occluded regions without altering the main inference path; (3) Closed-loop SDMap updates via weighted fusion of high-confidence HDMap predictions and priors under spatiotemporal checks.

On the nuScenes dataset, we validate far-range sensing, misalignment robustness, and update effectiveness. Crucially, the loop significantly improves revisited segments under limited visibility or stale priors, indicating clear engineering value for SDMap–HDMap co-evolution.

2. Related Work

2.1. Online HDMap Construction

Online HDMap construction spans single- and multimodal approaches [19,20,21]. HDMapNet [16] rasterizes BEV from surround cameras via semantic segmentation but suffers from perspective-induced curvature errors at long range. VectorMapNet [21] switches to vectorized map generation with autoregressive Transformer decoding, improving geometric continuity but still facing topological breaks in weak-texture areas. Multimodal methods such as SuperFusion [22] fuse LiDAR and vision to improve accuracy yet remain sensitive to calibration drift; MapTRv2 [18] leverages temporal cues but with limited adaptability. These no-prior approaches are bounded by sensor physics, motivating SDMap priors to extend capabilities.

2.2. SDMap Construction

Remote sensing and crowdsourcing enable low-cost, wide-area SDMap generation [8]. Aerial or satellite pipelines (e.g., RoadTracer [23], Sat2Graph [24], SpaceNet [25]) extract road graphs via CNNs and graph models, yet suffer from breaks and geometric distortions in weak-texture regions [26]. Lightweight semantic maps from crowdsourcing (e.g., RoadMap [27]) build cloud-updated priors for onboard localization at low cost. Despite complementary fusion with SLAM for trajectory correction [28,29], timeliness and fine-grained accuracy remain insufficient for HDMaps; nevertheless, SDMaps are valuable priors for far-field and occluded areas to aid online HDMap construction.

2.3. SDMap-Prior-Aided Mapping

The core is robust alignment and effective fusion between onboard perception and low-accuracy SD priors. P-MapNet [13] integrates SDMap and historical HDMap priors via attention to OpenStreetMap [30], improving lane detection beyond 200 m by 18.73% mIoU. BLOS-BEV [17] fuses near-field perception with SD priors to achieve 0–200 m sensing, gaining over 20% mIoU in 50–200 m. Other works also leverage SD priors for online HDMap construction [12,31]. However, static cross-modal alignment and neglect of localization drift [32] can misalign SD features and onboard observations, leading to prior failures.

Xu et al. [33] generate synthetic deviation maps featuring controlled biases in geometry, topology, and semantics to harden systems against prior discrepancies. However, their approach does not achieve co-evolution between online maps and priors, which consequently allows errors to accumulate across cycles.

2.4. Dynamic Map Update

Dynamic updates include trajectory-based inference, incremental updates, and closed-loop optimization. Trajectory-based approaches such as [34] reconstruct road graphs from GPS traces but falter at complex junctions and sparse data. DeepRoad [35] detects changes via onboard cameras, though binary detection does not yield usable geometries. At city scale, LDMapNet-U [36,36] proposes a perception–update–verification loop to strengthen reliability via multi-source consistency checks. These systems inform our closed-loop SDMap update design.

3. Method

Loop-MapNet aims to construct HDMaps online and update SDMaps in a closed loop through multimodal data fusion and dynamic update mechanisms. As illustrated in Figure 1, Loop-MapNet comprises the following core modules: (1) multimodal feature extraction; (2) cross-modal feature fusion; (3) CG-MAE feature optimization; (4) multi-task prediction and closed-loop update. In this section, we elaborate on the overall architecture and implementation details of each component.

3.1. Feature Extraction

3.1.1. Surround-View Image Feature Extraction

To extract multi-scale features from surround-view cameras and convert them to bird’s-eye view (BEV) representations, this section integrates the core ideas from P-MapNet [13], EfficientDet [37], and Lift, Splat, Shoot [38] to achieve efficient feature extraction and spatial transformation.

Given the surround-view image tensor

I \in R^{B \times n \times 3 \times h \times w}

, where B represents the batch size, n is the number of views (

n \in {1, 2, . . ., 6}

), 3 is the number of RGB channels, and h and w are the height and width of the images, respectively. After merging the batch and view dimensions into

I^{'} \in R^{B \cdot n \times 3 \times h \times w}

, we process it through the feature extraction network:

F_{c a m}^{'} = CamEncode (I^{'})

(1)

The output features

F_{c a m}^{'} \in R^{B \cdot n \times C_{c a m} \times h^{'} \times w^{'}}

, where

C_{c a m}

is the target channel number 256, and

h^{'}

and

w^{'}

are the downsampled dimensions. CamEncode employs EfficientNet-B4 [39] as the backbone network, extracting multi-layer features

{F_{1}, F_{2}, F_{3}, F_{4}, F_{5}}

, selecting the last three layers of features, unifying the channel numbers through

1 \times 1

convolution, and then fusing through BiFPN:

F_{f u s e d} = BiFPN (C_{3}^{'}, C_{4}^{'}, C_{5})

(2)

To adapt to the spatial requirements of BEV transformation, we apply average pooling downsampling to the first layer features in

F_{f u s e d}

, then reshape the dimensions to obtain

F_{c a m} \in R^{B \times n \times C_{c a m} \times h^{'} \times w^{'}}

, separating the features for each view.

Based on the Lift, Splat, Shoot [38] approach, we first perform the “Lift” process, flattening the surround-view features into

F_{f l a t}

, processing the features for each view separately, and lifting 2D features to 3D space through a fully connected network, projecting them onto the BEV plane

F_{b e v}

. Subsequently, in the “Splat” process, through inverse perspective mapping (IPM) [40], we project the features from each view to the unified top view

F_{t o p}

in the global coordinate system. Finally, we complete the “Shoot” process, optimizing BEV features through upsampling to obtain the BEV feature representation of surround-view images

F_{i m g} \in R^{B \times C_{i m g} \times H \times W}

, where B is the batch size,

C_{i m g}

is the feature channel number, and H and W are the BEV feature map dimensions.

3.1.2. Point Cloud Feature Extraction

To fully utilize LiDAR’s powerful modeling capability for 3D space, this section adopts PointPillar’s [41] feature extraction method to extract point cloud features and convert them to BEV representation. Given the input point cloud tensor

P_{c a r} \in R^{N \times D}

, where N is the number of point clouds and D represents the feature dimension of each point (spatial coordinates and reflection intensity,

D = 4

). We transform the point clouds from the vehicle coordinate system to the global coordinate system P through vehicle pose, then perform pillarization processing on the transformed point clouds:

P_{p i l l a r} = Voxelization (P) \in R^{K \times T \times D}

(3)

where K is the number of non-empty pillars, T is the maximum number of points per pillar, and D is the feature dimension of each point. We process the point cloud features within each pillar through the PointNet [42] structure:

F_{p o i n t} = MaxPool (MLP (P_{p i l l a r}))

(4)

Subsequently, we scatter the point features back to a pseudo-image, process the feature map through a 2D CNN backbone, obtain multi-scale features, and generate the final LiDAR BEV features

F_{l i d a r} \in R^{B \times C_{l i d a r} \times H \times W}

through upsampling and feature fusion, where B is the batch size,

C_{l i d a r}

is the feature channel number, and H and W are the BEV feature map dimensions.

3.1.3. SDMap Feature Extraction

The original SDMap serving as prior is provided by the open-source map dataset OpenStreetMap (OSM) [30], containing the road network structure of main roads in the corresponding area. We extract corresponding regional data from OSM based on the vehicle’s pose in the global coordinate system to construct a local SDMap dataset. We represent the original SDMap data in image form and process the SDMap data using the three-layer CNN encoding method from P-MapNet [13].

Representing the SDMap data

M_{s d}

as a tensor, we employ a three-layer convolutional module for SDMap feature extraction and downsampling, obtaining SDMap features

F_{s d} \in R^{B \times C_{s d} \times H \times W}

adapted to BEV spatial resolution, where B is the batch size,

C_{s d}

is the feature channel number, and H and W are the BEV feature map dimensions. In practice, this rasterized interface makes the SDMap encoder source-agnostic: priors from OSM or other providers can be ingested without changing the network. When no prior is available (e.g., new districts or rural roads), the pipeline naturally falls back to pure sensor fusion and accumulates priors over revisits.

3.2. Cross-Modal Feature Fusion

3.2.1. Sensor Feature Fusion

To effectively integrate the respective advantages of surround-view cameras and LiDAR, this section fuses the BEV features of both modalities. In the previous feature extraction process, we have already transformed the surround-view camera features

F_{i m g} \in R^{B \times C_{i m g} \times H \times W}

and LiDAR features

F_{l i d a r} \in R^{B \times C_{l i d a r} \times H \times W}

to a unified global coordinate system and the same spatial resolution through calibration parameters and IPM transformation.

The fusion process adopts channel-wise concatenation:

F_{c a r} = Concat (F_{i m g}, F_{l i d a r}) \in R^{B \times C_{c a r} \times H \times W}

(5)

where H and W are the feature spatial dimensions corresponding to the perception range, and

C_{c a r}

is the total number of feature channels after concatenation (sum of

C_{i m g}

and

C_{l i d a r}

). This direct concatenation fusion approach is computationally efficient while preserving the original information from both modalities, fully utilizing the subsequent SDMap bidirectional adaptive cross-attention mechanism to adaptively learn the importance and correlation of different modal features.

3.2.2. Bidirectional Adaptive Cross-Attention Mechanism

This section proposes an SDMap bidirectional adaptive cross-attention mechanism (bidirectional adaptive cross-attention) aimed at solving multimodal heterogeneous data alignment and effective fusion problems. In practical applications, due to GNSS positioning jitter and insufficient SDMap accuracy, the alignment between sensor features and SDMap features is inaccurate. This mechanism dynamically models the relationships between different modalities and adaptively corrects spatial offsets to generate consistent BEV representations, providing a solid foundation for HDMap online construction and SDMap dynamic optimization.

The inputs include sensor fusion features

F_{c a r} \in R^{B \times C_{c a r} \times H \times W}

and SDMap features

F_{s d} \in R^{B \times C_{s d} \times H \times W}

. The fusion process is divided into three key stages: initial feature processing, adaptive cross-modal alignment and fusion, and feature integration and optimization.

(1) Initial Feature Processing

To enhance the attention mechanism’s perception capability for spatial positions, we first perform positional encoding on the vehicle sensor fusion features. We process

F_{c a r}

through convolutional dimensionality reduction to obtain

F_{c a r}^{'}

, then generate sine positi- onal encoding:

P_{p o s} = PositionEmbeddingSine (F_{c a r}^{'}, mask)

(6)

where mask is an all-zero mask, and

P_{p o s}

is the 2D positional encoding containing sine transformation information of spatial coordinates.

(2) Adaptive Cross-Modal Alignment and Fusion

This stage designs a two-layer adaptive cross-attention layer structure to achieve bidirectional feature enhancement and adaptive alignment.

First, we predict inter-modal correlation weights through the feedforward network SimilarityNet:

S = SimilarityNet (Concat (F_{c a r}, F_{s d}))

(7)

SimilarityNet consists of two linear transformation layers with ReLU activation in the middle and Sigmoid activation in the output layer, ensuring that the output S represents the inter-modal correlation strength. Then, we predict the spatial offset matrix through OffsetNet:

O = OffsetNet (F_{c a r})

(8)

OffsetNet also consists of two linear transformation layers, with the output layer using Tanh activation function to generate a normalized offset O, representing the spatial correction amount. Subsequently, we apply the offset to

F_{s d}

through grid sampling to obtain the dynamically aligned SDMap features

F_{s d}^{'}

:

F_{s d}^{'} = GridSample (F_{s d}, O)

(9)

This step overcomes the spatial alignment deviation caused by positioning system jitter and insufficient SDMap accuracy, achieving adaptive spatial correction based on vehicle perception features.

Meanwhile, we enhance the internal consistency of vehicle sensor features through a self-attention mechanism:

F_{c a r}^{'} = SelfAttn (Q_{c a r}, K_{c a r}, V_{c a r})

(10)

where

Q_{c a r} = K_{c a r} = F_{c a r} + P_{p o s}

are the query and key matrices with added positional encoding, and

V_{c a r} = F_{c a r}

is the value matrix.

Finally, we implement cross-modal cross-attention fusion, using vehicle sensor features as queries and the corrected SDMap features as keys and values:

F_{a l i g n e d} = MultiheadAttn (Q_{c r o s s}, K_{c r o s s}, V_{c r o s s}) \cdot S

(11)

where

Q_{c r o s s} = F_{c a r}^{'} + P_{p o s}

are the enhanced vehicle sensor features, and

K_{c r o s s} = F_{s d}^{'} + P_{p o s}

and

V_{c r o s s} = F_{s d}^{'}

are the corrected SDMap features. Using a multi-head attention mechanism combined with residual connections and layer normalization, the final output is adaptively weighted by correlation weights S to obtain the fused feature representation

F_{a l i g n e d}

.

After bidirectional attention enhancement and adaptive alignment, we further optimize the feature representation through a feedforward network:

F_{a l i g n e d}^{'} = FFN (F_{a l i g n e d}) + F_{a l i g n e d}

(12)

The feedforward network includes dimension expansion, activation function, dropout, and dimension contraction operations, combined with residual connections to preserve original information. Two layers of adaptive cross-attention layer are cascaded to achieve iterative optimization, and finally, the original resolution is restored through upsampling convolution:

F_{f i n a l} = ConvUp (F_{a l i g n e d}^{'})

(13)

This mechanism overcomes positioning errors by dynamically predicting and correcting spatial offsets through OffsetNet, while SimilarityNet adaptively evaluates the correlation between different modalities to optimize fusion weights, combining self-attention and cross-attention mechanisms to simultaneously strengthen intra-modal consistency and inter-modal complementarity. This design not only solves the limitations of traditional fixed alignment strategies but also fully exploits the complementary information of multimodal data through adaptive learning, providing high-quality fused feature representations for HDMap construction and SDMap updates in complex scenarios.

3.3. Multi-Task Inference Framework

This section utilizes the fused multimodal features

F_{f i n a l}

to achieve multi-task parallel inference through a BevEncode decoder [43] based on a ResNet-18 [44] backbone network. This framework not only generates high-precision HDMaps but also constructs SDMap prediction branches and confidence evaluation branches, forming a complete map perception system. Among them, the SDMap branch is used to update prior maps; the confidence branch quantifies prediction reliability, providing support for the pre-trained confidence-guided masked autoencoder (CG-MAE) to further enhance the detail expression capability of SDMaps in low-confidence regions.

The input is cross-modal fusion features

F_{f i n a l} \in R^{B \times C_{in} \times H \times W}

, where B is the batch size,

C_{in}

is the input channel number, and H and W are the height and width of the features, respectively. The output includes four parts: HDMap semantic segmentation map

M_{H D}^{*} \in R^{B \times K \times H \times W}

(K is the number of semantic categories), instance embedding

E_{H D}^{*} \in R^{B \times D_{e} \times H \times W}

(

D_{e}

is the embedding dimension), confidence estimation

C \in R^{B \times 1 \times H \times W}

, and SDMap generation result

M_{S D}^{*} \in R^{B \times 1 \times H \times W}

. The model architecture includes two main stages: feature extraction and multi-task decoding.

3.3.1. Multi-Task Feature Fusion and Encoding

To extract multi-level semantic and spatial information from the input BEV features, we employ a ResNet-18 backbone network for feature extraction. We pass

F_{f i n a l}

through a

7 \times 7

convolutional layer for initial downsampling to generate feature X. Subsequently, after batch normalization and ReLU activation, we sequentially pass through the three residual blocks of ResNet-18 to obtain features of different depths

X_{1}

,

X_{2}

,

X_{3}

.

To achieve efficient decoding while preserving spatial details, we design a multi-task branch architecture based on skip connections. Specifically, we fuse high-level semantic information from deep features

X_{3}

with rich spatial details from shallow features

X_{1}

. Each branch ensures the independence and synergy of multi-task outputs through shared feature fusion frameworks and task-specific decoding modules. This constructs four task-specific decoding branches:

HDMap Semantic Segmentation

M_{HD}^{*}

: We upsample

X_{3}

and

X_{1}

by 4x to obtain intermediate features

X_{s e m}

; then, through 2x bilinear interpolation and convolutional operations, we output the semantic segmentation result

M_{H D}^{*}

. Meanwhile, we construct confidence ground truth

C_{g t}

as supervision signal:

C_{g t} = max (softmax (M_{H D}^{*}), \dim = 1)

(14)

HDMap Instance Embedding

E_{HD}^{*}

: To achieve instance-level map perception, we fuse

X_{3}

and

X_{1}

, upsample by 4× and adjust channel numbers to generate intermediate features

X_{e m b e d}

; through 2× bilinear interpolation and convolutional operations, we output the instance embedding vector

E_{H D}^{*}

.

HDMap Confidence Estimation

C_{H D}^{*}

: By predicting confidence scores for each pixel, we provide reliability assessment for HDMap optimization and SDMap updates. We fuse

X_{3}

and

X_{1}

, upsample by 4× and adjust channel numbers to generate

X_{c o n f}

; after 2× bilinear interpolation and convolutional processing, we output confidence estimation

C_{H D}^{*}

with range [0,1] through Sigmoid activation function.

SDMap Perception Result

M_{SD}^{*}

: We utilize real-time perception results to optimize prior maps. We fuse

X_{3}

and

X_{1}

, upsample by 4× and adjust channel numbers to obtain

X_{s d}

; after 2× bilinear interpolation and convolutional processing, we output binary SDMap result

M_{S D}^{*}

through Sigmoid activation function.

3.3.2. Multi-Task Optimization

To balance the optimization objectives among multiple tasks, we design a weighted combination loss function:

L = w_{s e g} L_{s e g} + w_{embed} L_{embed} + w_{d i s t} L_{d i s t} + w_{d i r} L_{d i r} + w_{conf} L_{conf} + w_{h d 2 s d} L_{h d 2 s d} + w_{s i m} L_{s i m}

(15)

where the weight coefficients w are used to balance task priorities. The loss function includes the following key components:

Semantic Segmentation Loss

L_{seg}

: We adopt weighted binary cross-entropy, alleviating class imbalance problems by assigning higher weights to positive samples.

Embedding Loss

L_{embed}

: Based on the discriminative loss framework, we combine variance, distance, and regularization losses to optimize intra-instance feature aggregation and inter-instance feature separation. The specific form is:

L_{embed} = L_{v a r i a n c e} + L_{d i s t a n c e} + L_{r e g u l a r i z a t i o n}

(16)

where

L_{v a r i a n c e}

controls intra-instance feature aggregation,

L_{d i s t a n c e}

controls inter-instance feature separation, and

L_{r e g u l a r i z a t i o n}

prevents excessive feature dispersion.

Distance Loss

L_{dist}

: As part of the embedding loss, it is used to control distance constraints between instances.

Direction Loss

L_{dir}

: Implemented using BCELoss, it learns to predict direction information of road elements, helping understand road connection relationships.

Confidence Loss

L_{conf}

: We use mean squared error to supervise the consistency between confidence prediction and ground truth, enhancing the model’s perception capability for uncertain regions.

HDMap to SDMap Generation Loss

L_{hd 2 sd}

: We adopt binary cross-entropy to optimize SDMap prediction, supporting dynamic map update requirements.

Similarity Loss

L_{sim}

: Through cross-modal feature alignment, we enhance the collaborative expression capability of multimodal information. The similarity loss measures the consistency between BEV features and SDMap features through cross-modal feature alignment. This loss utilizes the inter-modal correlation weights S predicted by SimilarityNet in the previous equations:

L_{sim} = α \cdot {∥ S - S_{g t} ∥}_{2}^{2} + β \cdot CE (F_{a l i g n e d}, F_{c a r})

(17)

where

S_{g t}

is the ground truth correlation indicator calculated based on the feature similarity matrix, obtained by normalizing the cosine similarity of spatially corresponding regions between vehicle perception features

F_{c a r}

and SDMap features

F_{s d}

.

α

and

β

are balance weights.

α

supervises the accuracy of SimilarityNet prediction, while

β

ensures that the aligned features

F_{a l i g n e d}

retain key information from vehicle perception features

F_{c a r}

through cross-entropy loss. This design enables the model to adaptively learn spatial correspondence relationships between different modalities while promoting effective information fusion, providing clear optimization objectives for the aforementioned SDMap bidirectional adaptive cross-attention mechanism.

3.4. Confidence-Guided Masked Autoencoder (CG-MAE)

To address the problem of poor quality in low-confidence regions during HDMap generation, this paper proposes a confidence-guided masked autoencoder (Confidence-Guided Masked Autoencoder, CG-MAE). This method enhances the model’s detail reconstruction capability in low-confidence regions by explicitly modeling the spatial distribution characteristics of HDMaps, thereby effectively improving generation robustness in complex scenarios. Compared with traditional masked autoencoders (MAE) [13] that adopt random or fixed masking strategies, CG-MAE can better handle local uncertainty issues in HDMap generation.

CG-MAE is based on a Vision Transformer architecture, combining feature embedding and multi-task decoding modules, including three stages: feature embedding and masking generation, Transformer encoding and feature reconstruction, and multi-task decoding and optimization. From a system-level perspective, CG-MAE functions as a plug-in refinement pathway that is tightly coupled with the main HDMap branch yet does not alter its inference stability. It first derives dynamic masks from the confidence map

C_{H D}^{*}

using a threshold

ε

and complements them with a global masking ratio

θ

to emphasize uncertain spatial regions while preserving well-determined areas. The masked map is then partitioned and embedded by ConvEmbed, and a ViT-based Transformer encodes long-range structural dependencies of road topology under positional embeddings. A lightweight demasking decoder (MAEHead) reconstructs fine-grained details specifically in masked (low-confidence) regions, producing

O_{o p t}

, which is subsequently fused with the primary prediction

M_{H D}^{*}

through confidence gating to yield

O_{f u s e}

.

The core idea is shown in Figure 2. By introducing a confidence-guided dynamic masking generation mechanism that prioritizes low-confidence regions, the masked features are fed into the Transformer encoder through positional embedding to capture inter-structural dependencies, and finally, the complete HDMap is reconstructed and detailed expression is enhanced through the decoding module, thereby solving the problem of insufficient details in low-confidence regions during HDMap generation.

This workflow preserves the original multimodal fusion benefits of Loop-MapNet, while explicitly routing representation capacity to ambiguous, far, or occluded areas, thereby enhancing detail fidelity without sacrificing global consistency or introducing optimization instability.

3.4.1. Feature Embedding and Confidence-Guided Masking Generation

The input masked map

M_{masked}

is generated through dataset preprocessing, obtained through HDMap confidence estimation

C_{H D}^{*}

(as shown in Figure 3) and random masking operations.

The masking generation process combines confidence distribution

C_{H D}^{*}

with random masking strategy, adopting a two-stage mechanism: first, regions with confidence below threshold

ε

are prioritized for masking according to global masking ratio

θ

, then random masking is supplemented throughout the entire image to ensure balanced masking distribution, generating

M_{masked}

. The feature embedding module divides

M_{masked}

into

s \times s

patches and embeds them into high-dimensional space:

X_{embed} = ConvEmbed (M_{masked})

(18)

where ConvEmbed employs

s \times s

convolutional kernels and stride s convolutional operations, outputting

X_{embed} \in R^{B \times D \times H \times W}

.

3.4.2. Transformer Encoding and Feature Reconstruction

Following the idea of Vision Transformer (ViT) [45], the masked feature sequences are processed through positional embedding and multi-layer Transformer:

X_{e n c} = Transformer (X_{embed} + P_{p o s})

(19)

where

P_{p o s}

is the positional encoding consistent with the previous Section 3.2.2, Transformer contains 12 layers of attention mechanisms, outputting

X_{e n c} \in R^{B \times N \times D}

, where

N = H \times W

is the total number of patches. The decoding stage restores spatial resolution through two layers of deconvolution with 2x upsampling and bilinear interpolation:

X_{r e c o n} = UpSample (X_{e n c})

(20)

obtaining

X_{r e c o n} \in R^{B \times D \times H \times W}

for reconstructing the original features of masked regions.

3.4.3. Decoding, Optimization and Training Strategy

The reconstructed features

X_{r e c o n}

are processed through the decoding module (MAEHead, containing only demasking decoding and segmentation head, not including the aforementioned Transformer encoder) to generate the optimized HDMap output

O_{opt}

:

O_{opt} = MAEHead (X_{r e c o n})

(21)

CG-MAE training follows three stages:

(1) Backbone Warm-up: Train the Loop-MapNet backbone network (3.1–3.3) without connecting CG-MAE to obtain stable

M_{H D}^{*}

and

C_{H D}^{*}

.

C_{H D}^{*}

is used to offline generate confidence-guided masking samples;

(2) CG-MAE Pre-training: Freeze the backbone and train CG-MAE using the above samples to prioritize reconstruction of low-confidence regions;

(3) Integration and Fine-tuning: Connect the pre-trained CG-MAE to the end of the main model to generate

O_{o p t}

, and fuse it with

M_{H D}^{*}

through confidence gating to obtain

O_{f u s e}

. During joint fine-tuning, the backbone maintains the original multi-task loss, and by default freezes the CG-MAE encoder, only fine-tuning the MAEHead and upsampling layers.

3.5. Dynamic SDMap Update Strategy

Reviewing the design motivation of this module, Loop-MapNet aims to enhance HDMap real-time perception using low-cost SDMap priors. To address the problem of HDMap accuracy degradation caused by SDMap static errors and lack of global consistency in online inference, we implement consistency enhancement and iterative optimization through a dynamic update strategy that fuses HDMap high-confidence regions with SDMap inference differences.

This strategy processes the results from the previous model output

C_{H D}^{*}

,

M_{S D}^{*}

, and prior SDMap data

M_{S D}

through Sigmoid activation to obtain probability maps, achieving dynamic SDMap updates through confidence weighting, updating the SDMap database while gradually improving SDMap accuracy and its enhancement effect on HDMap generation.

First, we extract high-confidence region masks from the confidence map

C_{H D}^{*}

using the confidence threshold

θ

to screen reliable regions:

M_{conf} = \{\begin{matrix} 1, & if C > θ \\ 0, & otherwise \end{matrix}

(22)

Then, we calculate the difference between the inferred SDMap

M_{S D}^{*}

and the original SDMap

M_{S D}

:

Δ M = (M_{S D}^{*} - M_{S D}) \in [- 1, 1]

(23)

According to confidence

C_{H D}^{*}

and mask

M_{conf}

, we weight and fuse the differences to focus on credible region updates:

M_{sd_new} = M_{sd_ori} + α \cdot C_{H D}^{*} \cdot M_{conf} \cdot Δ M

(24)

where

α \in (0, 1)

is the update step size, ensuring that high-confidence regions dominate the update through

C_{H D}^{*} \cdot M_{conf}

.

Subsequently,

M_{sd_new}

is constrained to

[0, 1]

through Sigmoid:

M_{sd_new} = Sigmoid (M_{sd_new})

(25)

Finally, we use the updated

M_{sd_new}

to replace

M_{S D}

as the prior for the next round of inference, forming a closed-loop optimization to improve the robustness and accuracy of HDMap generation.

4. Experiments and Discussion

4.1. Implementation Details

Dataset. Our experiments utilize the nuScenes dataset, comprising 1000 scenes with each scene lasting 20 s at a sampling frequency of 2 Hz, covering diverse urban environments in Boston and Singapore. The dataset provides six surround-view camera images (1600 × 900 resolution), 32-beam LiDAR point clouds, and high-precision HDMap annotations. We follow the official training set (700 scenes) and validation set (150 scenes) division, with the test set specifically selected to include typical scenarios such as crossroads, T-junctions, and curves. OpenStreetMap (OSM) provides SDMap priors, which are preprocessed and aligned with the nuScenes coordinate system to simulate low-cost map information.

Evaluation Metrics. We employ mean Intersection over Union (mIoU) for semantic segmentation, evaluating the accuracy of three object categories: Drivable areas (Div), Pedestrian areas (Ped), and road Boundaries (Bound). Additionally, we monitor model inference speed (FPS) to assess real-time performance.

Experimental Settings. The network is trained using the Adam optimizer with an initial learning rate of 0.0005, employing a step decay scheduling strategy (adjusted every 10 epochs). The batch size is set to 8, training for 30 epochs. The multi-task loss function weights are configured as

w_{seg} = 1.0

,

w_{embed} = 0.1

,

w_{d i s t} = 0.1

,

w_{d i r} = 0.1

,

w_{conf} = 0.1

,

w_{h d 2 s d} = 0.5

,

w_{s i m} = 0.1

. The balance weights for similarity loss

L_{sim}

are

α = 0.6

and

β = 0.4

. In CG-MAE pre-training, the confidence threshold

ε

is set to 0.6, and the global masking ratio

θ

is 0.5. The image input size is 128 × 352, and point cloud ranges are configured with three different scales of perception ranges according to experimental requirements: 60 × 30 m, 120 × 60 m, and 240 × 60 m, to evaluate model performance under different scenarios.

Experimental Environment. Implementation is based on PyTorch 1.10, using 3×NVIDIA Tesla A100 80G for parallel training, with a single NVIDIA GeForce RTX 3090 for inference performance evaluation.

4.2. Far-Range Perception Experiments

To systematically evaluate the far-range perception performance of Loop-MapNet’s multimodal perception enhancement architecture, we configured three different scales of perception ranges: 60 × 30 m, 120 × 60 m, and 240 × 60 m, conducting comparative analysis with HDMapNet [16] and P-MapNet [13].

As shown in Table 1, within the 60×30m range, our method achieves an average accuracy of 52.79%, improving by 8.81% and 0.62% compared to HDMapNet and P-MapNet [13], respectively. When the perception range extends to 120 × 60 m, our method reaches 61.05% average accuracy, improving by 9.66% and 0.63% compared to HDMapNet and P-MapNet [13], respectively. When the perception range extends to 240 × 60 m, due to environmental detail degradation at long distances captured by sensors, HDMapNet’s accuracy significantly declines, while Loop-MapNet maintains 49.92% accuracy, validating the effectiveness of multimodal fusion for far-range perception.

As illustrated in Figure 4, within the maximum test range of 240 × 60 m, the actual field of view approaches the physical perception limits of sensors. Figure 4 demonstrates our method’s clear advantages over baseline algorithms in far-range intersection details and lane line position detail expression. Figure 5 shows that under night scenarios, HDMapNet [16] exhibits mapping inaccuracy in far-range areas, while Loop-MapNet can more accurately reconstruct intersection structures and lane lines, providing sufficient prior information for vehicle advance planning.

Although our method’s inference speed is slightly lower than baseline algorithms, in HDMap mapping tasks, mapping accuracy and quality are far more important than real-time requirements, making this performance trade-off acceptable in practical applications.

4.3. Robustness to SDMap Prior Alignment Errors

To evaluate the system’s robustness to alignment errors between SDMap priors and sensor data, we simulate localization uncertainty under urban canyon effects by adding random perturbations to the ego vehicle position and heading in the nuScenes dataset. The experiments apply maximum random perturbations of 5 m and 20 m to the ego vehicle position coordinates, while simultaneously applying 1° and 5° perturbations to heading angles, selecting intersection scenarios for comparative analysis.

As shown in Table 2, under conditions without localization perturbations, our method outperforms P-MapNet [13] across all metrics, with mIoU improving by 0.77%. When introducing minor perturbations (5 m/1°), P-MapNet’s mIoU drops by 0.54%, while our method drops by only 0.07%, demonstrating excellent error adaptation capability. Under extreme perturbations (20 m/5°), P-MapNet’s mIoU drops by 3.36%, while our method drops by only 0.59%, fully demonstrating the proposed method’s robustness to significant localization deviations.

As shown in Figure 6, under extreme localization error conditions (20 m/5°), P-MapNet experiences severe SDMap prior misalignment leading to delayed mapping of forward intersections, while Loop-MapNet effectively compensates for localization deviations through the bidirectional adaptive cross-attention mechanism, maintaining accurate mapping performance. Particularly in boundary regions (Bound.), our method drops by only 1.2% under maximum perturbations, while the baseline method drops by 4.58%, validating the effectiveness of the adaptive feature alignment mechanism in maintaining road topology structure.

4.4. Ablation Study: CG-MAE Pre-Training Models

To verify the optimization effect of the confidence-guided masked autoencoder (CG-MAE) on HDMap mapping details, we designed comparative experiments for pre-training models: (1) no-pre-training baseline; (2) traditional MAE pre-training [13,46]; (3) the proposed CG-MAE pre-training. All experiments use the same training cycle (30 epochs) and perception range (120 × 60 m).

As shown in Table 3, compared to the no-pre-training baseline, MAE pre-training improves mIoU by 0.83%, while CG-MAE further improves to 61.05%, achieving a 1.72% improvement. Particularly in pedestrian areas (Ped.) and road boundary (Bound.) metrics, CG-MAE improves by 1.64% and 1.95%, respectively, compared to the no pre- training baseline.

As shown in Figure 7, under rainy low-visibility scenarios, the no-pre-training model can perceive distant intersections but exhibits blurred road edges and inaccurate topological connections; MAE pre-training improves intersection structure expression but still contains errors in junction connection relationships; CG-MAE pre-training significantly enhances complex intersection topological details, maintaining clear edge features even under low-visibility conditions.

CG-MAE, through its confidence-guided dynamic masking strategy, enables the model to focus more on data uncertain regions during pre-training, learning more discriminative feature representations. Experiments demonstrate that this strategy, while inheriting the advantages of self-supervised learning, significantly enhances HDMap detail perception capability in low-confidence regions, providing more reliable high-precision map support for autonomous driving systems.

4.5. Impact of HDMap Closed-Loop Updates on SDMap Perception Performance Under Road Network Structure Changes

To evaluate the performance improvement of Loop-MapNet’s closed-loop update mechanism under SDMap update lag scenarios, we simulate SDMap update lag through manual road network structure perturbations, focusing on verifying the enhancement effect of the closed-loop data on HDMap perception accuracy. The experiments specifically select visibility-limited scenarios, such as rainy, foggy weather and nighttime, for testing.

Considering that local road network changes have a limited impact on global IoU metrics, this experiment primarily employs qualitative analysis methods, focusing on examining the improvement effect of closed-loop update mechanisms on SDMap prior quality.

Experimental results indicate that under rainy, foggy weather conditions (Figure 8), the original SDMap prior fails to model distant intersections due to visibility limitations. After Loop-MapNet closed-loop updates, the system achieves accurate intersection reconstruction under the same visibility conditions, with intersection connection relationships and lane edge geometric expressions significantly improved.

Night scenario testing (Figure 9) shows that the original SDMap is limited by camera visibility in narrow road areas, primarily relying on immediate perception from onboard sensors. Through closed-loop updates, when the system revisits the same section, it can accurately reconstruct road topology structures even under equivalent visibility conditions, validating the compensatory effect of closed-loop data on sensor limitations.

By feeding high-confidence HDMap predictions back to SDMap updates, Loop-MapNet effectively improves reconstruction quality at changed intersections, particularly excelling in visibility-limited environments. This closed-loop data mechanism ensures continuous optimization of SDMap priors, providing a more reliable foundation for HDMap construction and offering an effective solution for long-term maintenance of high-precision maps.

5. Conclusions

The Loop-MapNet framework proposed in this paper innovatively realizes a “perception-mapping-update-reperception” closed-loop system, breaking through the traditional one-way prior injection paradigm and achieving collaborative evolution between HDMaps and SDMaps for the first time. The main innovations include: (1) bidirectional adaptive cross-attention mechanism for dynamic alignment of heterogeneous features; (2) confidence-guided masked autoencoder for enhancing detail expression in low-confidence regions; (3) dynamic SDMap update strategy based on high-confidence predictions, forming a closed-loop ecosystem of map data.

Experiments validate the framework’s superior performance across multiple challenging scenarios: the multimodal fusion architecture significantly outperforms single-modal methods in far-range perception; the adaptive alignment mechanism maintains stable performance under extreme localization errors; CG-MAE pre-training significantly improves mapping accuracy in low-visibility regions; the closed-loop update mechanism effectively improves perception capability for changing road networks.

Future work will focus on the following: (1) optimizing computational efficiency for real-time deployment (e.g., pruning/quantization/distillation and TensorRT/edge acceleration toward >15 FPS without accuracy loss); (2) expanding multi-vehicle collaborative updates with confidence-weighted fusion, conflict resolution, and privacy-aware communication to build a collective intelligent map ecosystem; (3) enhancing cross-city generalization and urban-scale scalability via hierarchical map representations and cloud–edge collaboration for efficient storage, retrieval, and incremental updates; (4) establishing a raster-to-vector pipeline and cloud-side global fusion to translate local raster updates into city-scale vector maps.

Author Contributions

Y.T., J.H. and D.Z. conceived and designed the study; Y.T., F.Z. and X.C. performed the experiments; Y.T. and F.Z. analyzed the data; Y.T. and W.X. wrote the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the Major Program (JD) of Hubei Province (No. 2023BAA017) and in part by the Innovative Group Project of Natural Science Foundation of Hubei Province (No. 2023AFA037) and in part by the Digital Twin System for Factory Fishery Farming System (No. 2023010402010589).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets generated and analyzed during the current study are available from the corresponding author upon reasonable request.

Acknowledgments

We sincerely thank Zhou Jiang et al., the authors of P-MapNet, for openly sharing their research and datasets, which provided a solid foundation for the efficient execution of this work.

Conflicts of Interest

The authors declare that they have no competing interests.

Abbreviations

HDMap	A high-definition map with centimeter-level geometry and rich semantics supporting localization and planning.
SDMap	A low-cost prior providing road topology and coarse geometry to assist online perception.
OpenStreetMap (OSM)	A community-curated SD map source used as prior for topology and geometry.
BEV	A top-down representation on the ground plane aligned with the map frame.
Bidirectional Adaptive Cross-Attention	A cross-modal alignment module predicting inter-modal correlation and spatial offsets.
SimilarityNet	A network predicting inter-modal correlation weights for fusion.
OffsetNet	A network predicting spatial offsets to align SDMap features.
GridSample	A differentiable sampling operator applying learned offsets for alignment.
CG-MAE	A confidence-guided masked autoencoder that reconstructs low-confidence regions to enhance details.
Confidence Gating	A fusion scheme combining CG-MAE output with the main prediction based on confidence.
Spatiotemporal Consistency	A constraint enforcing temporal and spatial coherence during SDMap updates.
Similarity Loss	A cross-modal alignment loss enforcing consistency between BEV and SDMap features.

References

Wang, C.; Aouf, N. Explainable deep adversarial reinforcement learning approach for robust autonomous driving. IEEE Trans. Intell. Veh. 2024, 10, 2551–2563. [Google Scholar] [CrossRef]
Reda, M.; Onsy, A.; Haikal, A.Y.; Ghanbari, A. Path planning algorithms in the autonomous driving system: A comprehensive review. Robot. Auton. Syst. 2024, 174, 104630. [Google Scholar] [CrossRef]
Xiao, Z.; Yang, D.; Wen, T.; Jiang, K.; Yan, R. Monocular localization with vector HD map (MLVHM): A low-cost method for commercial IVs. Sensors 2020, 20, 1870. [Google Scholar] [CrossRef]
Bao, Z.; Hossain, S.; Lang, H.; Lin, X. A review of high-definition map creation methods for autonomous driving. Eng. Appl. Artif. Intell. 2023, 122, 106125. [Google Scholar] [CrossRef]
Liu, R.; Wang, J.; Zhang, B. High definition map for automated driving: Overview and analysis. J. Navig. 2020, 73, 324–341. [Google Scholar] [CrossRef]
Tang, K.; Cao, X.; Cao, Z.; Zhou, T.; Li, E.; Liu, A.; Zou, S.; Liu, C.; Mei, S.; Sizikova, E.; et al. THMA: Tencent HD map AI system for creating HD map annotations. In Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 15585–15593. [Google Scholar]
Chen, S.; Zhang, Y.; Liao, B.; Xie, J.; Cheng, T.; Sui, W.; Zhang, Q.; Huang, C.; Liu, W.; Wang, X. VMA: Divide-and-conquer vectorized map annotation system for large-scale driving scene. arXiv 2023, arXiv:2304.09807. [Google Scholar]
Chen, Z.; Deng, L.; Luo, Y.; Li, D.; Junior, J.M.; Gonçalves, W.N.; Nurunnabi, A.A.M.; Li, J.; Wang, C.; Li, D. Road extraction in remote sensing data: A survey. Int. J. Appl. Earth Obs. Geoinf. 2022, 112, 102833. [Google Scholar] [CrossRef]
Alshehhi, R.; Marpu, P.R. Hierarchical graph-based segmentation for extracting road networks from high-resolution satellite images. ISPRS J. Photogramm. Remote Sens. 2017, 126, 245–260. [Google Scholar] [CrossRef]
Zheng, S.; Wang, J.; Rizos, C.; Ding, W.; El-Mowafy, A. SLAM for autonomous driving: Concept and analysis. Remote Sens. 2023, 15, 1156. [Google Scholar] [CrossRef]
Batra, A.; Singh, S.; Pang, G.; Basu, S.; Jawahar, C.; Paluri, M. Improved road connectivity by joint learning of orientation and segmentation. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Li, J.; Jia, P.; Chen, J.; Liu, J.; He, L.; Li, K. Local Map Construction with SDMap: A Comprehensive Survey. arXiv 2024, arXiv:2409.02415. [Google Scholar]
Jiang, Z.; Zhu, Z.; Li, P.; Gao, H.-A.; Yuan, T.; Shi, Y.; Zhao, H.; Zhao, H. P-MapNet: Far-seeing map generator enhanced by both SDMap and HDMap priors. IEEE Robot. Autom. Lett. 2024, arXiv:2403.10521. [Google Scholar] [CrossRef]
Xiang, C.; Feng, C.; Xie, X.; Shi, B.; Lu, H.; Lv, Y.; Yang, M.; Niu, Z. Multi-sensor fusion and cooperative perception for autonomous driving: A review. IEEE Intell. Transp. Syst. Mag. 2023, 15, 36–58. [Google Scholar] [CrossRef]
Liao, B.; Chen, S.; Wang, X.; Cheng, T.; Zhang, Q.; Liu, W.; Huang, C. MapTR: Structured modeling and learning for online vectorized HD map construction. arXiv 2022, arXiv:2208.14437. [Google Scholar]
Li, Q.; Wang, Y.; Chen, L. HDMapNet: An Online HD Map Construction and Evaluation Framework. IEEE T-ITS 2022, 23, 2105–2118. [Google Scholar]
Wu, H.; Zhang, Z.; Lin, S.; Qin, T.; Pan, J.; Zhao, Q.; Xu, C.; Yang, M. BLOS-BEV: Navigation map enhanced lane segmentation network, beyond line of sight. In Proceedings of the 2024 IEEE Intelligent Vehicles Symposium (IV), Jeju Island, Republic of Korea, 2–5 June 2024. [Google Scholar]
Liao, B.; Chen, S.; Zhang, Y.; Jiang, B.; Zhang, Q.; Liu, W.; Huang, C.; Wang, X. MapTRv2: An end-to-end framework for online vectorized HD map construction. Int. J. Comput. Vis. 2025, 133, 1352–1374. [Google Scholar] [CrossRef]
Tang, X.; Jiang, K.; Yang, M.; Liu, Z.; Jia, P.; Wijaya, B.; Wen, T.; Cui, L.; Yang, D. High-definition maps construction based on visual sensor: A comprehensive survey. IEEE Trans. Intell. Veh. 2024, 9, 5973–5994. [Google Scholar] [CrossRef]
Zhu, X.; Cao, X.; Dong, Z.; Zhou, C.; Liu, Q.; Li, W.; Wang, Y. NeMo: Neural map growing system for spatiotemporal fusion in bird’s-eye view and BDD-Map benchmark. arXiv 2023, arXiv:2306.04540. [Google Scholar]
Liu, Y.; Yuan, T.; Wang, Y.; Wang, Y.; Zhao, H. VectorMapNet: End-to-end vectorized HD map learning. In Proceedings of the 2023 International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023; pp. 22352–22369. [Google Scholar]
Tang, L.; Deng, Y.; Ma, Y.; Huang, J.; Ma, J. SuperFusion: A versatile image registration and fusion network with semantic awareness. IEEE/CAA J. Autom. Sin. 2022, 9, 2121–2137. [Google Scholar] [CrossRef]
Bastani, F.; He, S.; Abbar, S.; Alizadeh, M.; Balakrishnan, H.; Chawla, S.; Madden, S.; DeWitt, D. RoadTracer: Automatic extraction of road networks from aerial images. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 4720–4728. [Google Scholar]
He, S.; Bastani, F.; Jagwani, S.; Alizadeh, M.; Balakrishnan, H.; Chawla, S.; Elshrif, M.M.; Madden, S.; Sadeghi, A. Sat2Graph: Road graph extraction through graph-tensor encoding. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; pp. 51–67. [Google Scholar]
Van Etten, A.; Lindenbaum, D.; Bacastow, T.M. SpaceNet: A remote sensing dataset and challenge series. arXiv 2018, arXiv:1807.01232. [Google Scholar]
Mai, G.; Huang, W.; Sun, J.; Song, S.; Mishra, D.; Liu, N.; Gao, S.; Liu, T.; Cong, G.; Hu, Y.; et al. Opportunities and challenges of foundation models for GeoAI. ACM T-SAS 2024, 10, 1–46. [Google Scholar]
Qin, T.; Zheng, Y.; Chen, T.; Chen, Y.; Su, Q. A light-weight semantic map for visual localization towards autonomous driving. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021; pp. 11248–11254. [Google Scholar]
Chen, W.; Shang, G.; Ji, A.; Zhou, C.; Wang, X.; Xu, C.; Li, Z.; Hu, K. An overview on visual SLAM: From tradition to semantic. Remote Sens. 2022, 14, 3010. [Google Scholar] [CrossRef]
Alsadik, B.; Karam, S. The SLAM: An overview. Surv. Geospat. Eng. J. 2021, 1, 1–12. [Google Scholar]
Mooney, P.; Minghini, M. A review of OpenStreetMap data. In Mapping and the Citizen Sensor; Ubiquity Press: London, UK, 2017; pp. 37–59. [Google Scholar]
Zhang, H.; Paz, D.; Guo, Y.; Das, A.; Huang, X.; Haug, K.; Christensen, H.I.; Ren, L. Enhancing online road network perception and reasoning with standard definition maps. In Proceedings of the 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Abu Dhabi, United Arab Emirates, 14–18 October 2024; pp. 1086–1093. [Google Scholar]
Plachetka, C.; Maier, N.; Fricke, J.; Termohlen, J.-A.; Fingscheidt, T. Terminology and analysis of map deviations in urban domains. In Proceedings of the 2020 IEEE Intelligent Vehicles Symposium (IV), Las Vegas, NV, USA, 19 October–13 November 2020; pp. 63–70. [Google Scholar]
Xu, H.; Xiao, Y.; Li, W.; Hu, Y. Generating Synthetic Deviation Maps for Prior-Enhanced Vectorized HD Map Construction. In Proceedings of the 2025 IEEE Intelligent Vehicles Symposium (IV), Cluj-Napoca, Romania, 22–25 June 2025; pp. 419–426. [Google Scholar]
Biagioni, J.; Eriksson, J. Map inference in the face of noise and disparity. In Proceedings of the SIGSPATIAL ’12: Proceedings of the 20th International Conference on Advances in Geographic Information Systems, Redondo Beach, CA, USA, 6–9 November 2012; pp. 79–88. [Google Scholar]
Zhang, M.; Zhang, Y.; Zhang, L.; Liu, C.; Khurshid, S. DeepRoad: GAN-based metamorphic testing and input validation for ADS. In Proceedings of the 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE), Montpellier, France, 3–7 September 2018. [Google Scholar]
Xia, D.; Zhang, W.; Liu, X.; Zhang, W.; Gong, C.; Tan, X.; Huang, J.; Yang, M.; Yang, D. LDMapNet-U: An End-to-End System for City-Scale Lane-Level Map Updating. arXiv 2025, arXiv:2501.02763. [Google Scholar]
Tan, M.; Pang, R.; Le, Q.V. EfficientDet: Scalable and efficient object detection. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Philion, J.; Fidler, S. Lift, Splat, Shoot: Encoding images by implicitly unprojecting to 3D. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020. [Google Scholar]
Koonce, B. EfficientNet. In CNNs with Swift for TensorFlow; Apress: Berkeley, CA, USA, 2021; pp. 109–123. [Google Scholar]
Bertozzi, M.; Broggi, A.; Fascioli, A. Stereo inverse perspective mapping: Theory and applications. Image Vis. Comput. 1998, 16, 585–590. [Google Scholar] [CrossRef]
Lang, A.H.; Vora, S.; Caesar, H.; Zhou, L.; Yang, J.; Beijbom, O. PointPillars: Fast encoders for object detection from point clouds. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Qi, C.R.; Su, H.; Mo, K.; Guibas, J.L. Guibas PointNet: Deep learning on point sets. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Huang, J.; Huang, G.; Zhu, Z.; Ye, Y.; Du, D. BEVDet: High-performance multi-camera 3D object detection in BEV. arXiv 2021, arXiv:2112.11790. [Google Scholar]
Targ, S.; Almeida, D.; Lyman, K. ResNet in ResNet: Generalizing residual architectures. arXiv 2016, arXiv:1603.08029. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
He, K.; Chen, X.; Xie, S.; Li, Y.; Dollar, P.; Girshick, R. Masked autoencoders are scalable vision learners. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]

Figure 1. Loop-MapNet overview. (a) Surround-view image features: EfficientNet-B4 + BiFPN for multi-scale fusion, then BEV via view transformation. (b) LiDAR features: PointPillars voxel encoding and CNN backbone to BEV. (c) SDMap features and update: local SDMap extraction and CNN encoding with an update pipeline. (d) BEV fusion: channel-wise concatenation for camera+LiDAR. (e) Cross-modal alignment: bidirectional adaptive cross-attention aligns three modalities in BEV with dynamic positional encoding. (f) CG-MAE: confidence-guided masked autoencoder enhances BEV details and produces confidence maps. (g) Multi-task decoding and closed-loop SDMap update via spatiotemporal consistency.

Figure 2. CG-MAE pipeline: confidence-guided masking derived from

C_{H D}^{*}

and global ratio

θ

is embedded and encoded by ConvEmbed + Transformer; a demasking decoder reconstructs low-confidence details, and the output is fused with the main prediction through confidence gating to strengthen uncertain regions while preserving global topology.

Figure 2. CG-MAE pipeline: confidence-guided masking derived from

C_{H D}^{*}

and global ratio

θ

is embedded and encoded by ConvEmbed + Transformer; a demasking decoder reconstructs low-confidence details, and the output is fused with the main prediction through confidence gating to strengthen uncertain regions while preserving global topology.

Figure 3. Visualization of

C_{H D}^{*}

; darker indicates lower confidence, typically at distance, occlusions, or sparse coverage.

Figure 3. Visualization of

C_{H D}^{*}

; darker indicates lower confidence, typically at distance, occlusions, or sparse coverage.

Figure 4. Daytime crossroads scenario at 240 × 60 m perception range. (a) Surround-view images; (b) LiDAR point clouds and prior SDMap overlay; (c) SDMap ground truth; (d) HDMap inference results; (e) P-MapNet inference results; (f) Loop-MapNet inference results.

Figure 5. Night scenario at 240 × 60 m perception range: (a) Surround-view images; (b) LiDAR point clouds and prior SDMap overlay; (c) SDMap ground truth; (d) HDMap inference results; (e) P-MapNet inference results; (f) Loop-MapNet inference results.

Figure 6. SDMap mapping quality under different localization errors at 120 × 60 m perception range. (a) Surround-view camera images; (b) SDMap prior position in LiDAR coordinate system; (c) HDMap ground truth; (d) under maximum error of 5m and 1°, both P-MapNet and Loop-MapNet can accurately perceive road intersections at longer distances; (e) under maximum error of 20 m and 5°, due to excessive SDMap prior errors, P-MapNet shows delayed modeling of forward intersections compared to Loop-MapNet.

Figure 7. SDMap inference effects under different pre-training models at 120 × 60 m perception range. (a) Surround-view camera images; (b) HDMap ground truth; (c) HDMap inference results without pre-training model, showing poor detail expression in far-range intersection construction; (d) HDMap inference results using MAE pre-training model; (e) HDMap inference results using CG-MAE pre-training model.

Figure 8. SDMap inference effects under road network changes in rainy, foggy scenarios at 120 × 60 m perception range. (a) Surround-view camera images. (b) HDMap ground truth. (c) HDMap inferred using the original SDMap prior. (d) HDMap inferred using the updated SDMap prior. (e) The original SDMap prior. (f) The updated SDMap after traversal.

Figure 9. SDMap inference effects under road network changes in night scenarios at 120 × 60 m perception range: (a) Surround-view camera images. (b) HDMap ground truth. (c) HDMap inferred using the original SDMap prior. (d) HDMap inferred using the updated SDMap prior. (e) The original SDMap prior. (f) The updated SDMap after traversal.

Table 1. Quantitative results across different perception ranges. Best results in bold.

Range	Method	Epoch	Modal	Div.	Ped.	Bound.	mIoU	FPS
60 × 30	HDMapNet [16]	30	C + L	45.13	30.76	56.05	43.98	21.62
	P-MapNet [13]	30	C + L + SD	53.35	39.81	63.36	52.17	9.65
	Ours	30	C + L + SD	54.26	40.13	63.97	52.79	8.10
120 × 60	HDMapNet [16]	30	C+L	54.18	38.03	57.92	50.04	21.55
	P-MapNet [13]	30	C + L + SD	63.64	50.25	66.95	60.28	9.92
	Ours	30	C + L + SD	64.26	51.32	67.58	61.05	7.53
240 × 60	HDMapNet [16]	30	C+L	39.69	26.42	43.53	36.55	13.58
	P-MapNet [13]	30	C + L + SD	52.46	41.83	53.56	49.28	6.92
	Ours	30	C + L + SD	53.26	42.14	54.37	49.92	6.50

Table 2. Quantitative results under different localization errors at 120 × 60 m perception range. Best results in bold.

Max Pos.	Max Yaw	Method	Div.	Ped.	Bound.	mIoU
0 m	0°	P-MapNet [13]	63.64	50.25	66.95	60.28
		Ours	64.26	51.32	67.58	61.05
5 m	1°	P-MapNet [13]	63.31	49.87	66.03	59.74
		Ours	64.07	51.43	67.45	60.98
20 m	5°	P-MapNet [13]	60.75	47.63	62.37	56.92
		Ours	63.96	51.05	66.38	60.46

Table 3. SDMap inference metrics under different pre-training models at 120 × 60 m perception range. Best results in bold.

Pre-Training	Div.	Ped.	Bound.	mIoU	FPS
None	62.67	49.68	65.63	59.33	13.8
MAE [13,46]	63.14	50.26	67.08	60.16	7.36
CG-MAE (Ours)	64.26	51.32	67.58	61.05	6.53

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Tang, Y.; Hu, J.; Zhang, D.; Xu, W.; Zhao, F.; Cheng, X. Loop-MapNet: A Multi-Modal HDMap Perception Framework with SDMap Dynamic Evolution and Priors. Appl. Sci. 2025, 15, 11160. https://doi.org/10.3390/app152011160

AMA Style

Tang Y, Hu J, Zhang D, Xu W, Zhao F, Cheng X. Loop-MapNet: A Multi-Modal HDMap Perception Framework with SDMap Dynamic Evolution and Priors. Applied Sciences. 2025; 15(20):11160. https://doi.org/10.3390/app152011160

Chicago/Turabian Style

Tang, Yuxuan, Jie Hu, Daode Zhang, Wencai Xu, Feiyu Zhao, and Xinghao Cheng. 2025. "Loop-MapNet: A Multi-Modal HDMap Perception Framework with SDMap Dynamic Evolution and Priors" Applied Sciences 15, no. 20: 11160. https://doi.org/10.3390/app152011160

APA Style

Tang, Y., Hu, J., Zhang, D., Xu, W., Zhao, F., & Cheng, X. (2025). Loop-MapNet: A Multi-Modal HDMap Perception Framework with SDMap Dynamic Evolution and Priors. Applied Sciences, 15(20), 11160. https://doi.org/10.3390/app152011160

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Loop-MapNet: A Multi-Modal HDMap Perception Framework with SDMap Dynamic Evolution and Priors

Abstract

1. Introduction

2. Related Work

2.1. Online HDMap Construction

2.2. SDMap Construction

2.3. SDMap-Prior-Aided Mapping

2.4. Dynamic Map Update

3. Method

3.1. Feature Extraction

3.1.1. Surround-View Image Feature Extraction

3.1.2. Point Cloud Feature Extraction

3.1.3. SDMap Feature Extraction

3.2. Cross-Modal Feature Fusion

3.2.1. Sensor Feature Fusion

3.2.2. Bidirectional Adaptive Cross-Attention Mechanism

3.3. Multi-Task Inference Framework

3.3.1. Multi-Task Feature Fusion and Encoding

3.3.2. Multi-Task Optimization

3.4. Confidence-Guided Masked Autoencoder (CG-MAE)

3.4.1. Feature Embedding and Confidence-Guided Masking Generation

3.4.2. Transformer Encoding and Feature Reconstruction

3.4.3. Decoding, Optimization and Training Strategy

3.5. Dynamic SDMap Update Strategy

4. Experiments and Discussion

4.1. Implementation Details

4.2. Far-Range Perception Experiments

4.3. Robustness to SDMap Prior Alignment Errors

4.4. Ablation Study: CG-MAE Pre-Training Models

4.5. Impact of HDMap Closed-Loop Updates on SDMap Perception Performance Under Road Network Structure Changes

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI