KPMapNet: Keypoint Representation Learning for Online Vectorized High-Definition Map Construction

Jin, Bicheng; Hao, Wenyu; Qiu, Wenzhao; Pang, Shanmin

doi:10.3390/s25061897

Open AccessArticle

KPMapNet: Keypoint Representation Learning for Online Vectorized High-Definition Map Construction

¹

School of Software Engineering, Xi’an Jiaotong University, Xi’an 710049, China

²

China Academy of Electronics and Information Technology, Beijing 100041, China

^*

Author to whom correspondence should be addressed.

Sensors 2025, 25(6), 1897; https://doi.org/10.3390/s25061897

Submission received: 28 December 2024 / Revised: 7 March 2025 / Accepted: 17 March 2025 / Published: 18 March 2025

(This article belongs to the Special Issue Computer Vision and Sensor Fusion for Autonomous Vehicles)

Download

Browse Figures

Versions Notes

Abstract

:

Vectorized high-definition (HD) map construction is a critical task in the autonomous driving domain. The existing methods typically represent vectorized map elements with a fixed number of points, establishing robust baselines for this task. However, the inherent shape priors introduce additional shape errors, which in turn lead to error accumulation in the downstream tasks. Moreover, the subtle and sparse nature of the annotations limits detection-based frameworks in accurately extracting the relevant features, often resulting in the loss of fine structural details in the predictions. To address these challenges, this work presents KPMapNet, an end-to-end framework that redefines the ground truth training representation of vectorized map elements to achieve precise topological representations. Specifically, the conventional equidistant sampling method is modified to better preserve the geometric features of the original instances while maintaining a fixed number of points. In addition, a map mask fusion module and an enhanced hybrid attention module are incorporated to mitigate the issues introduced by the new representation. Moreover, a novel point-line matching loss function is introduced to further refine the training process. Extensive experiments on the nuScenes and Argoverse2 datasets demonstrate that KPMapNet achieves state-of-the-art performance, with 75.1 mAP on nuScenes and 74.2 mAP on Argoverse2. The visualization results further corroborate the enhanced accuracy of the map generation outcomes.

Keywords:

HD map construction; geometric modeling; autonomous driving; transformer

1. Introduction

High-definition (HD) maps are critical components in autonomous driving systems, as they provide essential environmental information for tasks such as trajectory forecasting [1,2,3], path planning [4,5,6], and other downstream applications [7,8]. Traditionally, the creation of these maps has depended on the manual annotation of LiDAR point clouds, a process that is labor-intensive, time-consuming, and costly. Moreover, manual annotation procedures are not readily updated to reflect dynamic changes in road conditions. In response, recent studies [9,10,11,12,13] have explored leveraging onboard sensor data to generate HD maps for the surrounding environment in real time.

Several works [9,11,12,14,15] have formulated real-time HD map generation as a semantic segmentation problem, aiming to learn and produce pixel-level rasterized maps from a bird’s-eye view (BEV) perspective. However, rasterized maps inherently contain redundant pixel-level semantic information and are not directly compatible with downstream tasks. Consequently, additional post-processing is required to vectorize these maps, a procedure that not only increases computational overhead but also may introduces cumulative errors. To overcome these limitations, VectorMapNet [16] represents each map element as a sequence of points within a two-stage framework that progressively refines the predictions from coarse to fine levels. Subsequently, MapTR [13] employs DETR to directly regress the point coordinates of map elements before grouping the predicted points to form complete road element instances. MapTRv2 [17] further improves model accuracy and training efficiency by refining the attention mechanism and incorporating auxiliary supervision.

Despite these promising advancements, the current map vectorization frameworks still exhibit several limitations. Representing map elements as point sets leads to varying numbers of points depending on their complexity, complicating model training. To address this variability, some studies [13,17] employ equidistant sampling on raw map data, converting each element into a fixed number of points regardless of shape complexity. However, as demonstrated in Figure 1b, this approach results in a loss of precision, particularly in regions characterized by right angles and curves, and introduces inherent errors that are not adequately captured by evaluation metrics, potentially impacting downstream tasks.

To overcome these issues and achieve a more precise map representation, the conventional equidistant sampling method has been refined by introducing a novel keypoint fine-tuning strategy. This approach selectively samples points to preserve the original shape of map elements as accurately as possible, thereby minimizing distortion. However, the resulting point sets no longer exhibit equidistant spacing, which introduces new challenges for network training. In response, KPMapNet is proposed as a novel framework designed to accurately model map elements through a redesigned loss function and an integrated prediction architecture that incorporates map masks. Specifically, the training ground truth is partitioned based on positional features into key points which delineate the shape and collinear points that supplement the representation. Based on this partitioning, the conventional point-to-point matching loss is reformulated into a point-to-line hybrid matching loss. Furthermore, a BEV mask fusion module is designed to enhance the localization and representational capabilities of BEV features by integrating the learned BEV mask into the feature map. Additionally, a hybrid query module is introduced within the decoder to improve query initialization and facilitate more effective detection.

Extensive experiments conducted on the NuScenes [18] and Argoverse2 [19] datasets demonstrate that KPMapNet achieves a state-of-the-art performance in terms of accuracy. Moreover, the visualization results indicate that KPMapNet yields more accurate shape recognition for map elements. The main contributions of this work can be summarized as follows:

A keypoint fine-tuning strategy is proposed that optimizes the preservation of original ground truth shape features, establishing a novel method for ground truth representation in this field.
A novel framework, KPMapNet, is introduced that integrates map mask feature fusion with an innovative point-to-line hybrid matching loss, thereby enabling the precise modeling of map elements.
KPMapNet achieves 75.1 mAP on the NuScenes dataset and 74.2 mAP on the Argoverse2 dataset. Moreover, the visualization results demonstrate the framework’s improved predictive performance for instances exhibiting complex shapes.

2. Related Works

2.1. Online HD Map Construction

Traditional offline HD map generation typically relies on manual or semi-automated annotation [9,20]. With the emergence of methods that transition from picture view (PV) to BEV representations [21,22], recent studies [1,23,24,25] have investigated the feasibility of generating HD maps online directly using perspective view sensors. For example, HDMapNet [12] aims to directly generate vectorized map elements rather than rasterized maps, while VectorMapNet [16] employs a two-stage framework with an auto-regressive decoder to iteratively predict vector vertices. Furthermore, MapTR [13] and its enhanced version, MapTRv2 [17], introduce novel representations for map elements, enabling the simultaneous regression of both category and positional information. The proposed approach further extends these methods by incorporating generated map mask features into the model to better capture the specific vector shapes of map elements. In addition, KPMapNet introduces supplementary instance queries with shared decoder parameters during training to enhance overall model performance.

2.2. Map Instance Modeling

HD maps consist of diverse instances with varying geometric characteristics, such as lane dividers, pedestrian crossings, and road boundaries. In certain datasets [18], these map elements are typically stored and represented as polylines. Unlike the fixed-format bounding boxes used in detection tasks, the number of points in a polyline instance can vary, which poses significant challenges for network training. The conventional approaches [13,17,26] typically sample a fixed number of equidistant points from the ground truth, a strategy that may lead to a loss of shape information. In contrast, BeMapNet [24] represents ground truth using Bézier curves, while PivotNet [27] dynamically models and matches instances by employing pivotal points on a per-instance basis. In this paper, KpMapNet proposes a keypoint fine-tuning strategy that replaces equidistant sampling with a more precise representation that closely aligns with the original instances, thereby preserving detailed map features while maintaining model efficiency.

2.3. Map Mask for Segmentations

Image masks are widely employed in segmentation tasks to enhance the quality of both instance-level and semantic features. Previous studies [28] have examined the interaction between learned mask features and instance activation maps to effectively capture object-specific characteristics. Other works [29,30] have integrated mask features into transformer-based architectures, leveraging attention mechanisms for efficient feature extraction. In the context of HD map detection, KepMapNet utilizes high-quality map mask features to emphasize and enhance regions with detailed map annotations, thereby improving the model’s ability to capture fine-grained information.

3. The Proposed Method

3.1. Model Pipeline

Figure 2 illustrates the overall architecture of KPMapNet. Building upon MapTRv2 [17], KPMapNet first extracts features from the input of surrounding PV images and projects them into the BEV space. To enhance the representational capacity of the BEV features, a BEV mask fusion module is introduced, which reintegrates the generated rasterized map into the BEV features. These enriched BEV features are subsequently fed into a carefully designed hybrid attention decoder that generates the corresponding map instances. Moreover, this method has been redesigned to optimize the processing of the ground truth data to obtain more accurate training targets; the loss function has also been redefined accordingly.

The remainder of this section is organized as follows: Section 3.2 details the proposed keypoint fine-tuning sampling method, Section 3.3 describes the architecture of KPMapNet, and Section 3.4 introduces the overall training loss function.

3.2. Ground Truth Representation

Each map instance is defined by its class label

c

and an ordered sequence of points

P = {x_{i}, y_{i}}_{i = 1}^{N}

that describe its shape, where N specifies the number of points representing each instance. The connectivity between the points is implicitly encoded in their ordering. As illustrated in Figure 1, the number of points N in the original representation varies with the complexity of the map element, complicating the training process. To standardize this representation, equidistant sampling is applied, resulting in a fixed number of points with equal spacing. While this approach simplifies the learning task, it can introduce geometric errors.

To mitigate this issue, this research decomposes the original point set

P

into the following two subsets: a set of key points

P_{k e y}

, which capture the essential shape features, and a set of redundant collinear points

P_{c o l}

, which serve as fillers. The keypoint set

P_{k e y}

is an ordered series of points that delineate the overall shape, typically marking changes in direction, whereas

P_{c o l}

does not contribute additional shape information.

Let

P_{e q u}

denote the point set obtained via equidistant sampling. Then, adjust

P_{e q u}

using

P_{k e y}

to preserve the original shape. The keypoint fine-tuning procedure is as follows: First, simplify

P

using methods such as Douglas–Peucker [31] or Visvalingam–Whyatt [32] to derive

P_{k e y}

. Then, apply equidistant sampling to

P

to obtain

P_{e q u}

. A custom matching algorithm is designed to align these two sets and obtain a fine-tuned configuration.

Consider matching the keypoint sequence

P_{k e y} = {x_{i}, y_{i}}_{i = 1}^{K}

with the equidistant sequence

P_{e q u} = {x_{i}, y_{i}}_{i = 1}^{M}

, where K is the number of key points and M is a predefined fixed number. While M is fixed, K varies with the map element’s shape complexity. Let

ν

denote the set of all possible combinations given the ordering constraints; there are

C_{M - 2}^{K - 2}

possible combinations. For the i-th key point

(x_{i}, y_{i})

, let

(x_{ν (i)}, y_{ν (i)})

be its corresponding point in

P_{e q u}

. For a given matching combination

ν

, the matching cost is defined as follows:

L_{m a t c h} (P_{e q u}, P_{k e y}, ν) = \sum_{i = 1}^{K} ∥(x_{i}, y_{i}), (x_{ν (i)}, y_{ν (i)})∥,

(1)

where

∥\cdot∥

denotes

L_{2}

distance. The optimal matching

ν_{b}

is then obtained by minimizing the matching cost, as in the following equation:

ν_{b} = \underset{ν^{*} \in ν}{argmin} L_{m a t c h} (P_{e q u}, P_{k e y}, ν^{*})

(2)

Due to the ordered nature of the point sets, a fixed correspondence is assumed, i.e.,

ν (1) = 1

and

ν (K) = M

. This matching problem is addressed using dynamic programming. The algorithm uses an array

d p

to store the matching costs; let

d p [i] [j]

denote the minimum matching cost for the first i key points and the first j equidistant points. The state transition can be expressed as follows:

d p [i] [j] = min (d p [i] [j - 1], d p [i - 1] [j - 1] + ∥(x_{i}, y_{i}), (x_{j}, y_{j})∥)

(3)

The initial condition is defined as follows:

d p [1] [j] = min_{1 < = k < = j} ∥(x_{1}, y_{1}), (x_{k}, y_{k})∥, 1 < = j < = M - K

(4)

Finally, replacing

P_{e q u}

with the optimal combination obtained

ν_{b}

yields the fine-tuned keypoint set

P_{f i n} = {x_{i}, y_{i}}_{i = 1}^{M}

, which maintains a consistent number of points while preserving a more accurate representation of the original shapes.

3.3. Architecture Detail

BEV Feature Extractor. Given the input surround images

I = \{I_{1}, \dots, I_{n}\}

, KpMapNet employs the conventional CNN backbones [33,34,35], along with an FPN [36], to extract the multi-view PV features

F = \{F_{1}, \dots, F_{n}\}

. These PV features are then integrated into a unified BEV representation using various strategies such as LSS [23,25,37,38,39], GKT [40], CVT [11] and Deformable Attention [13,41,42]. Following the settings in MapTRv2 [17], KpMapNet adopts LSS [38] as the default transformation method to leverage depth information during supervised learning. The resulting BEV feature is denoted as

F_{b e v} \in R^{H \times W \times C}

, where

H \times W

represents the spatial dimensions and C denotes the feature depth.

BEV Mask Fusion. Before passing the BEV features to the decoder, KpMapNet introduces an optimization module that refines the BEV features at a fine-grained, point-level resolution by leveraging a rasterized BEV feature mask. As shown in Figure 2, a convolutional operation is applied to

F_{b e v}

to generate the BEV mask

F_{m a s k} \in R^{1 \times H \times W}

. During training,

F_{m a s k}

is supervised using the ground truth mask

{GT}_{m a s k}

, with the loss function defined as the cross-entropy loss, as in the following equation:

L_{m a s k} = L_{C E} (F_{m a s k}, {GT}_{m a s k}) .

(5)

Subsequently, a CNN

ϕ_{u p} (\cdot)

upsamples

F_{m a s k}

to a channel dimensionality of 32. This upsampled mask is concatenated with the BEV feature

F_{b e v}

and a two-dimensional normalized positional encoding

F_{p o s e} \in R^{2 \times H \times W}

that encodes spatial location. Finally, these features are fused via a convolutional operation, as follows:

F_{f u s} = Conv (Concat (F_{b e v}, ϕ_{u p} (F_{m a s k}), F_{p o s e})) .

(6)

The fused feature

F_{f u s} \in R^{H \times W \times C}

emphasizes the salient positional and semantic information, thereby enhancing the model’s ability to accurately predict detailed map shapes by effectively distinguishing instance features from background noise.

Hybrid Attention Decoder. For each map instance, the transformer decoder employs a set of queries to predict both the category and the geometric details through regression. As depicted in Figure 2, KpMapNet adopts a hybrid approach that combines instance queries

Q_{i n s}

with point queries

Q_{p t s}

, while leveraging BEV features to refine the instance queries for improved initialization. The instance queries

Q_{i n s} \in R^{D \times C}

, with D denoting the number of instance queries, are dynamically updated alongside their random initialization to incorporate instance-specific features. Specifically, a convolutional layer followed by a sigmoid activation predicts an instance mask, which is then fused with the randomly initialized instance queries, as expressed in the following equation:

Q_{i n s} = Sigmoid (Conv (F_{f u s})) \times F_{f u s}^{⊺} + Init (Q_{i n s}) .

(7)

The point queries

Q_{p t s} \in R^{M \times C}

are obtained via random initialization. These instance and point queries are then combined to form the hybrid queries

Q \in R^{D \times M \times C}

. Finally,

Q

and

F_{f u s}

are fed into an L-layer decoder to generate the predicted category

\hat{c}

and the point set

\hat{P} = {{\hat{x}}_{i}, {\hat{y}}_{i}}_{i = 1}^{M}

. The iterative residual design of the decoder further enhances the network’s capacity to learn robust regression features, thereby improving both representation and prediction accuracy.

3.4. Training Loss

Point–line Matching Loss. The fine-tuned point set

P_{f i n}

obtained via the keypoint fine-tuning strategy does not enforce equal spacing between points, which poses new challenges for training. To accommodate the characteristics of this new ground truth, the original point-to-point matching loss is modified. Specifically

P_{f i n}

can be decomposed into key points

P_{k e y}

and collinear filler points

P_{c o l}

. For points corresponding to key features, the L1 distance is computed directly. For collinear points, the L1 distance is calculated relative to the perpendicular projection onto the predicted line, as illustrated in Figure 3. The final point-to-line matching loss is defined as follows:

L_{p 2 l} = L_{1} (P_{k e y}, {\hat{P}}_{k e y}) + L_{1} (Line (P_{c o l}), {\hat{P}}_{c o l}),

(8)

where

Line (\cdot)

denotes the straight line on which the collinear points reside.

Overall Loss. Following the approach in MapTRv2 [17], KpMapNet also incorporates classification loss

L_{c l s}

and an edge direction loss

L_{d i r}

. The total loss function is defined as follows:

L = β_{1} L_{m a s k} + β_{2} L_{p 2 l} + β_{3} L_{c l s} + β_{4} L_{d i r},

(9)

where

β_{1}

,

β_{2}

,

β_{3}

and

β_{4}

are the corresponding loss weights.

4. Experiments

4.1. Datasets

KpMapNet evaluates model performance through comprehensive experiments on two widely recognized datasets in autonomous driving research, namely NuScenes [18] and Argoverse2 [19]. The NuScenes dataset comprises 1000 driving scenarios collected in Singapore and Boston, USA, under diverse weather conditions and times of day. Out of these, 750 scenes are allocated for training, 100 for validation, and 150 for testing. Each scenario spans approximately 20 s and provides 40 keyframes sampled at 2 Hz, including 360-degree RGB images from six cameras, LiDAR point clouds, and precisely annotated 2D map instances. Argoverse2 consists of 1000 scenarios with 3D map annotations collected from six U.S. cities, featuring data from LiDAR, stereo cameras, and ring cameras. In line with the previous studies [16,17,26], our experiments focus on the following three static categories of map elements: pedestrian crossing (ped.), lane divider (div.), and road boundary (bou.).

4.2. Evaluation Metrics

To ensure fair comparisons, the quality of map instance predictions is evaluated using average precision (AP) based on the average Chamfer distance. The Chamfer distance quantifies the alignment between predictions and ground truth by computing the mean Euclidean distance between corresponding points. The method first samples points from the predicted instances and computes the forward Chamfer distance as the mean of the shortest distances from predictions to the ground truth, then computes the reverse distance in the same manner for the ground truth. The average Chamfer distance is obtained by averaging these two measures. Predictions with distances below a specified threshold are considered true positives (TPs). Following the previous studies [17,27], this research employs two threshold sets,

{0.5, 1.0, 1.5}

m and

{0.2, 0.5, 1.0}

m, and reports the mAP as the average precision across these thresholds. Consistent with MapTRv2 [17], 100 points are sampled for Chamfer distance calculations.

4.3. Implementation Details

Architecture. KPMapNet employs ResNet50 [33], EfficientNet-B0 [34], and VoVNetV2-99 [35] as its backbone networks. For the NuScenes dataset, the original image resolution is

1600 \times 900

. These images are resized by a factor of 0.5 and padded to

800 \times 480

. In the Argoverse2 dataset, the seven camera images have differing resolutions (with the front view at

1550 \times 2048

and the others at

2048 \times 1550

). The front view images are cropped to

1550 \times 1550

and padded to match the other views (

2048 \times 1550

), and then all seven images are resized to

704 \times 544

. The preprocessed images are subsequently passed through the backbone and FPN for feature extraction, followed by an encoder to generate BEV features. In the experiments, the BEV perception range is set to 30 m both ahead and behind the vehicle, as well as 15 m to both the left and right. With a feature resolution of

0.15 \times 0.15

m, this results in a grid of

200 \times 100

BEV queries, each with a feature dimension of 256, processed by six decoder layers. The default number of instance queries is 50, and the number of point queries is 20. All settings are adopted from the previous studies [13,17,27] to ensure a fair comparison.

Training and Inference. KpMapNet is trained on a single NVIDIA A100 Tensor Core GPU (Santa Clara, CA, USA) with a batch size of 16, utilizing the AdamW [43] optimizer with a weight decay of 0.01. Training is conducted for 24 epochs by default, starting with an initial learning rate of

4 \times 10^{- 4}

and applying a decay factor of 0.1 for the backbone. The learning rate follows a cosine annealing schedule with a linear warm-up phase. For the point-line matching loss, the loss weight is set to 6, while all other hyperparameters adhere to the configurations used in MapTRv2 [17]. Inference is executed on a single NVIDIA GeForce RTX 3090 GPU, with the inference time duly recorded.

4.4. Main Results

Results on nuScenes. KPMapNet is trained using various epoch schedules and backbone networks on the nuScenes dataset. Table 1 presents a performance comparison of KPMapNet, using only RGB image inputs, alongside the previous methods. KPMapNet achieves better performance (75.1, 53.2 mAP) under both threshold configurations. The reproducibility metrics for the comparison methods [12,13,16,17,27,44] are obtained by executing publicly available source code and model checkpoints. The results demonstrate that KPMapNet outperforms the previous approaches, yielding significant improvements in complex features, particularly for ped. and bou. Under the easy threshold

{0.5, 1.0, 1.5}

m, KpMapNet-ResNet50 shows improvements of 3.9 mAP for ped. and 3.7 mAP for bou., compared to the previous SOTA MapTRv2. Additionally, KPMapNet outperforms PivotNet by 3.6 mAP for ped. and 3.7 mAP for bou. KpMapNet based on VoVNetV2-99 achieves 75.1 mAP—1.7 mAP higher than MapTRv2. Moreover, the visualizations in Figure 4 illustrate the improved shape representation achieved by our model.

Results on Argoverse2. As shown in Table 2, on the Argoverse2 dataset, KPMapNet consistently exceeds teh previous SOTA methods under both evaluation settings. Our method achieves competitive performance, with KpMapNet outperforming MapTRv2 by reaching 74.2 mAP, thereby further validating the effectiveness of our proposed approach.

4.5. Ablation Study

This section presents the ablation experiments used to evaluate the contributions of the proposed modules and design choices. Unless otherwise specified, the experiments were conducted using ResNet50 [33] as the backbone, with NuScenes camera images as input, and training was performed for 24 epochs under the easy threshold configuration.

Ablation on different modules. The ablation results in Table 3 validate the contribution of each design component to overall performance. Initially, the configurations of MapTRv2 were adopted as baselines for comparison. Although the introduction of keypoint fine-tuning initially increased the network’s learning complexity and caused a slight performance decline, the subsequent integration of additional components resolved this issue and enhanced overall performance. Specifically, the new ground truth representation significantly improved the network’s precision in detecting complex vectors. The point-line matching loss accelerated convergence while maintaining performance comparable to the baseline. Moreover, incorporating mask fusion and hybrid attention resulted in performance gains of 2.9 mAP and 1.5 mAP, respectively. Together, these components achieved a cumulative improvement of 3.8 mAP, which was the highest observed enhancement.

Point–line loss weight. This research explores various weight configurations for the point-line matching loss to determine the optimal setting. As shown in Table 4, performance was relatively insensitive to specific weight values; however, excessively high or low weights adversely affected network convergence, leading to performance degradation. Based on these observations, a weight of 6 was identified as the optimal choice.

Auxiliary mask loss. Table 5 demonstrates the effectiveness of BEV mask supervision. By enforcing constraints on BEV features to ensure they capture meaningful spatial information, the auxiliary mask loss

L_{m a s k}

yielded a 1.7 mAP improvement, thereby enhancing overall performance.

Discussion of different thresholds. Smaller threshold settings impose stricter performance requirements on the model. Given that HD maps for autonomous driving require error control at the centimeter level, performance improvements under these stricter thresholds are particularly valuable for practical applications. As shown in Table 6, this model consistently achieves superior results compared to the previous methods across all thresholds. The improvements are especially pronounced at the

0.2

m and

0.5

m thresholds, underscoring the model’s effectiveness under more rigorous evaluation criteria.

5. Conclusions

This paper addresses the challenges of accurately processing vector representations for map elements by introducing a novel keypoint fine-tuning strategy and an online HD vectorized map detection framework, KPMapNet. Our approach offers a unified method for refining map elements, effectively preserving intricate shape details that are typically compromised during traditional ground truth processing. By integrating innovative components—including a BEV mask fusion module, a hybrid attention mechanism, and a point-line matching loss function—KPMapNet mitigates the loss of fine structural information and enhances the precision of HD map construction. Experimental results on public datasets demonstrate that KPMapNet achieves state-of-the-art performance, establishing a new benchmark in the domain. Notably, our framework significantly improves the detection accuracy of complex map elements, such as pedestrian crossings and road boundaries, underscoring its potential for practical autonomous driving applications.

Future research will explore the integration of multimodal sensor data and temporal information to construct more holistic and robust representations of the driving environment. These advancements are expected to further enhance the performance of vectorized map detection systems and contribute to the development of more reliable autonomous driving solutions.

Author Contributions

Conceptualization, B.J., W.H. and W.Q.; methodology, S.P.; software, B.J.; validation, B.J., W.H., W.Q. and S.P.; formal analysis, B.J.; investigation, W.H. and W.Q.; resources, S.P.; data curation, B.J.; writing—original draft preparation, B.J., W.H. and W.Q.; writing—review and editing, S.P.; visualization, B.J.; supervision, S.P.; project administration, S.P.; funding acquisition, S.P. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Key R&D Program of China under Grant 2022ZD0117903.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Liang, M.; Yang, B.; Hu, R.; Chen, Y.; Liao, R.; Feng, S.; Urtasun, R. Learning lane graph representations for motion forecasting. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part II 16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 541–556. [Google Scholar]
Song, H.; Luan, D.; Ding, W.; Wang, M.Y.; Chen, Q. Learning to predict vehicle trajectories with model-based planning. In Proceedings of the Conference on Robot Learning, PMLR, Auckland, New Zealand, 14–18 December 2022; pp. 1035–1045. [Google Scholar]
Deo, N.; Wolff, E.; Beijbom, O. Multimodal trajectory prediction conditioned on lane-graph traversals. In Proceedings of the Conference on Robot Learning, PMLR, Auckland, New Zealand, 14–18 December 2022; pp. 203–212. [Google Scholar]
Da, F.; Zhang, Y. Path-aware graph attention for hd maps in motion prediction. In Proceedings of the 2022 International Conference on Robotics and Automation (ICRA), Philadelphia, PA, USA, 23–27 May 2022; pp. 6430–6436. [Google Scholar]
Espinoza, J.L.V.; Liniger, A.; Schwarting, W.; Rus, D.; Van Gool, L. Deep interactive motion prediction and planning: Playing games with motion prediction models. In Proceedings of the Learning for Dynamics and Control Conference. PMLR, Stanford, CA, USA, 23–24 June 2022; pp. 1006–1019. [Google Scholar]
Wu, L.; Huang, X.; Cui, J.; Liu, C.; Xiao, W. Modified adaptive ant colony optimization algorithm and its application for solving path planning of mobile robot. Expert Syst. Appl. 2023, 215, 119410. [Google Scholar] [CrossRef]
Levinson, J.; Montemerlo, M.; Thrun, S. Map-based precision vehicle localization in urban environments. In Proceedings of the Robotics: Science and Systems, Atlanta, GA, USA, 27–30 June 2007; Volume 4, pp. 121–128. [Google Scholar]
Hu, Y.; Yang, J.; Chen, L.; Li, K.; Sima, C.; Zhu, X.; Chai, S.; Du, S.; Lin, T.; Wang, W.; et al. Planning-oriented autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 17853–17862. [Google Scholar]
Jiao, J. Machine learning assisted high-definition map creation. In Proceedings of the 2018 IEEE 42nd Annual Computer Software and Applications Conference (COMPSAC), Tokyo, Japan, 23–27 July 2018; Volume 1, pp. 367–373. [Google Scholar]
Lu, C.; Van De Molengraft, M.J.G.; Dubbelman, G. Monocular semantic occupancy grid mapping with convolutional variational encoder–decoder networks. IEEE Robot. Autom. Lett. 2019, 4, 445–452. [Google Scholar] [CrossRef]
Pan, B.; Sun, J.; Leung, H.Y.T.; Andonian, A.; Zhou, B. Cross-view semantic segmentation for sensing surroundings. IEEE Robot. Autom. Lett. 2020, 5, 4867–4873. [Google Scholar] [CrossRef]
Li, Q.; Wang, Y.; Wang, Y.; Zhao, H. Hdmapnet: An online hd map construction and evaluation framework. In Proceedings of the 2022 International Conference on Robotics and Automation (ICRA), Philadelphia, PA, USA, 23–27 May 2022; pp. 4628–4634. [Google Scholar]
Liao, B.; Chen, S.; Wang, X.; Cheng, T.; Zhang, Q.; Liu, W.; Huang, C. MapTR: Structured Modeling and Learning for Online Vectorized HD Map Construction. In Proceedings of the International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Roddick, T.; Cipolla, R. Predicting semantic map representations from images using pyramid occupancy networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11138–11147. [Google Scholar]
Zhang, Y.; Zhu, Z.; Zheng, W.; Huang, J.; Huang, G.; Zhou, J.; Lu, J. Beverse: Unified perception and prediction in birds-eye-view for vision-centric autonomous driving. arXiv 2022, arXiv:2205.09743. [Google Scholar]
Liu, Y.; Yuan, T.; Wang, Y.; Wang, Y.; Zhao, H. Vectormapnet: End-to-end vectorized hd map learning. In Proceedings of the International Conference on Machine Learning, PMLR, Honolulu, HI, USA, 23–29 July 2023; pp. 22352–22369. [Google Scholar]
Liao, B.; Chen, S.; Zhang, Y.; Jiang, B.; Zhang, Q.; Liu, W.; Huang, C.; Wang, X. Maptrv2: An end-to-end framework for online vectorized hd map construction. Int. J. Comput. Vis. 2025, 133, 1352–1374. [Google Scholar] [CrossRef]
Caesar, H.; Bankiti, V.; Lang, A.H.; Vora, S.; Liong, V.E.; Xu, Q.; Krishnan, A.; Pan, Y.; Baldan, G.; Beijbom, O. nuScenes: A multimodal dataset for autonomous driving. arXiv 2019, arXiv:1903.11027. [Google Scholar]
Wilson, B.; Qi, W.; Agarwal, T.; Lambert, J.; Singh, J.; Khandelwal, S.; Pan, B.; Kumar, R.; Hartnett, A.; Pontes, J.K.; et al. Argoverse 2: Next Generation Datasets for Self-driving Perception and Forecasting. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks (NeurIPS Datasets and Benchmarks 2021), Virtual, 4–6 December 2021. [Google Scholar]
Shan, T.; Englot, B. Lego-loam: Lightweight and ground-optimized lidar odometry and mapping on variable terrain. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018; pp. 4758–4765. [Google Scholar]
Ma, Y.; Wang, T.; Bai, X.; Yang, H.; Hou, Y.; Wang, Y.; Qiao, Y.; Yang, R.; Manocha, D.; Zhu, X. Vision-centric bev perception: A survey. arXiv 2022, arXiv:2208.02797. [Google Scholar] [CrossRef] [PubMed]
Li, H.; Sima, C.; Dai, J.; Wang, W.; Lu, L.; Wang, H.; Zeng, J.; Li, Z.; Yang, J.; Deng, H.; et al. Delving into the devils of bird’s-eye-view perception: A review, evaluation and recipe. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 46, 2151–2170. [Google Scholar] [CrossRef] [PubMed]
Liu, Z.; Tang, H.; Amini, A.; Yang, X.; Mao, H.; Rus, D.L.; Han, S. Bevfusion: Multi-task multi-sensor fusion with unified bird’s-eye view representation. In Proceedings of the 2023 IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023; pp. 2774–2781. [Google Scholar]
Qiao, L.; Ding, W.; Qiu, X.; Zhang, C. End-to-end vectorized hd-map construction with piecewise bezier curve. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 13218–13228. [Google Scholar]
Philion, J.; Fidler, S. Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XIV 16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 194–210. [Google Scholar]
Yuan, T.; Liu, Y.; Wang, Y.; Wang, Y.; Zhao, H. StreamMapNet: Streaming Mapping Network for Vectorized Online HD Map Construction. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2024; pp. 7356–7365. [Google Scholar]
Ding, W.; Qiao, L.; Qiu, X.; Zhang, C. Pivotnet: Vectorized pivot learning for end-to-end hd map construction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 3672–3682. [Google Scholar]
Cheng, T.; Wang, X.; Chen, S.; Zhang, W.; Zhang, Q.; Huang, C.; Zhang, Z.; Liu, W. Sparse instance activation for real-time instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–22 June 2022; pp. 4433–4442. [Google Scholar]
Liang, J.; Homayounfar, N.; Ma, W.C.; Xiong, Y.; Hu, R.; Urtasun, R. Polytransform: Deep polygon transformer for instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 9131–9140. [Google Scholar]
Ding, J.; Xue, N.; Xia, G.S.; Schiele, B.; Dai, D. Hgformer: Hierarchical grouping transformer for domain generalized semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 15413–15423. [Google Scholar]
Douglas, D.H.; Peucker, T.K. Algorithms for the reduction of the number of points required to represent a digitized line or its caricature. Cartogr. Int. J. Geogr. Inf. Geovisualization 1973, 10, 112–122. [Google Scholar] [CrossRef]
Visvalingam, M.; Whyatt, J. Line generalisation by repeated elimination of points. Cartogr. J. 1993, 30, 46–51. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Tan, M.; Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In Proceedings of the International Conference on Machine Learning, PMLR, Long Beach, CA, USA, 9–15 June 2019; pp. 6105–6114. [Google Scholar]
Lee, Y.; Hwang, J.w.; Lee, S.; Bae, Y.; Park, J. An Energy and GPU-Computation Efficient Backbone Network for Real-Time Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Long Beach, CA, USA, 16–17 June 2019. [Google Scholar]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Huang, J.; Huang, G.; Zhu, Z.; Ye, Y.; Du, D. Bevdet: High-performance multi-camera 3d object detection in bird-eye-view. arXiv 2021, arXiv:2112.11790. [Google Scholar]
Li, Y.; Ge, Z.; Yu, G.; Yang, J.; Wang, Z.; Shi, Y.; Sun, J.; Li, Z. Bevdepth: Acquisition of reliable depth for multi-view 3d object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Montréal, QC, Canada, 8–10 August 2023; Volume 37, pp. 1477–1485. [Google Scholar]
Huang, J.; Huang, G. Bevpoolv2: A cutting-edge implementation of bevdet toward deployment. arXiv 2022, arXiv:2211.17111. [Google Scholar]
Chen, S.; Cheng, T.; Wang, X.; Meng, W.; Zhang, Q.; Liu, W. Efficient and robust 2d-to-bev representation learning via geometry-guided kernel transformer. arXiv 2022, arXiv:2206.04584. [Google Scholar]
Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable detr: Deformable transformers for end-to-end object detection. arXiv 2020, arXiv:2010.04159. [Google Scholar]
Li, Z.; Wang, W.; Li, H.; Xie, E.; Sima, C.; Lu, T.; Qiao, Y.; Dai, J. Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 1–18. [Google Scholar]
Loshchilov, I.; Hutter, F. Fixing Weight Decay Regularization in Adam. arXiv 2018, arXiv:1711.05101. [Google Scholar]
Zhang, G.; Lin, J.; Wu, S.; Luo, Z.; Xue, Y.; Lu, S.; Wang, Z. Online map vectorization for autonomous driving: A rasterization perspective. Adv. Neural Inf. Process. Syst. 2024, 36, 31865–31877. [Google Scholar]

Figure 1. Different approaches to ground truth processing. (a) Ground truth provided by the dataset; (b) ground truth processed using equidistant sampling; (c) ground truth processed using keypoint fine-tuning. As highlighted in the circled region of (b), conventional ground truth processing methods often introduce additional shape distortions. In contrast, the proposed approach effectively preserves shape consistency, thereby mitigating these distortions.

Figure 2. Overview of the KPMapNet framework. The top row illustrates the overall model pipeline, where multi-view images serve as input, and the model generates vectorized map elements in an end-to-end manner. The BEV mask fusion module extracts map masks and then refines BEV features through a reverse fusion process, enhancing spatial and semantic consistency. The hybrid decoder dynamically interacts to extract both point-level and element-level information from the map elements, continuously constructing and updating queries for improved representation learning.

Figure 3. Schematic of point-to-line matching. For clarity, the gap between the predicted points and the ground truth is exaggerated.

Figure 4. Comparison of visualization results on the NuScenes dataset.

Table 1. Performance comparison on NuScenes dataset.

Method	Backbone	Epoch	AP_div.	AP_ped.	AP_bou.	mAP	AP_div.	AP_ped.	AP_bou.	mAP	FPS
Method	Backbone	Epoch	${0.5, 1.0, 1.5} m$				${0.2, 0.5, 1.0} m$				FPS
HDMapNet [12]	EB0	30	23.6	24.1	43.5	31.4	17.7	13.6	32.7	21.3	0.8
KpMapNet (Ours)	EB0	30	26.2	33.5	45.3	35.0	14.5	15.8	25.0	18.4	14.1
VectorMapNet [16]	R50	110	47.3	36.1	39.3	40.9	27.2	18.2	18.4	21.3	2.2
MapTR [13]	R50	24	51.5	46.3	53.1	50.3	30.7	23.2	28.2	27.3	15.1
MapVR [44]	R50	24	54.4	47.7	51.4	51.2	-	-	-	-	15.1
PivotNet [27]	R50	24	56.5	56.2	60.1	57.6	41.4	34.3	39.8	38.5	10.4
MapTRv2 [17]	R50	24	62.4	59.8	62.4	61.5	40.0	35.4	36.3	37.2	14.1
MapTRv2 ^† [17]	R50	24	56.0 ^†	57.3 ^†	62.0 ^†	58.4 ^†	37.2 ^†	31.9 ^†	37.0 ^†	35.4 ^†	14.1
KpMapNet (Ours)	R50	24	62.4	63.7	66.1	64.1	42.7	37.9	43.5	41.4	13.9
MapTRv2 [17]	V2-99	24	67.1	63.6	69.2	66.6	-	-	-	-	9.9
MapTRv2 [17]	V2-99	110	73.7	71.4	75.0	73.4	-	-	-	-	9.9
KpMapNet (Ours)	V2-99	24	65.9	63.7	71.2	66.9	45.9	37.2	48.5	43.9	9.3
KpMapNet (Ours)	V2-99	110	74.2	73.4	77.6	75.1	56.3	47.1	56.1	53.2	9.3

^† indicates the results obtained by retraining with the keypoint fine-tuning GT. “-” means that the corresponding results are not available. “EB0”, “R50”, and “V2-99”, respectively, correspond to EfficientNet-B0 [34], ResNet50 [33], and VoVNetV2-99 [35]. Best results are highlighted in bold.

Table 2. Performance comparison on Argoverse2 dataset.

Method	Backbone	Epoch	AP_div.	AP_ped.	AP_bou.	mAP	AP_div.	AP_ped.	AP_bou.	mAP
Method	Backbone	Epoch	${0.5, 1.0, 1.5} m$				${0.2, 0.5, 1.0} m$
VectorMapNet [16]	R50	-	36.1	38.3	39.2	37.9	-	-	-	-
MapTR [13]	R50	6	58.1	54.7	56.7	56.5	42.2	28.3	33.7	34.8
MapVR [44]	R50	-	60.0	54.6	58.0	57.5	-	-	-	-
PivotNet [27]	R50	6	-	-	-	-	47.5	31.3	43.4	40.7
MapTRv2 [17]	R50	6	72.1	62.9	67.1	67.4	52.5	34.8	40.6	42.6
KpMapNet (Ours)	R50	6	69.4	74.7	78.5	74.2	53.3	45.7	56.1	51.7

“-” means that the corresponding results are not available. Best results are highlighted in bold.

Table 3. Effectiveness of different modules in KPMapNet.

Fine-Tuning	Point-Line	Fusion	Hybrid	AP_div.	AP_ped.	AP_bou.	mAP
✗	✗	✗	✗	61.2	58.9	61.5	60.5
✓	✗	✗	✗	55.7	57.5	61.5	58.2
✓	✓	✗	✗	57.6	59.9	63.3	60.3
✓	✓	✓	✗	61.2	63.1	65.2	63.2
✓	✓	✗	✓	59.3	60.5	64.5	61.8
✓	✓	✓	✓	62.4	63.7	66.1	64.1

Best results are highlighted in bold.

Table 4. Effectiveness of point-line loss weight.

Loss Weight	AP_div.	AP_ped.	AP_bou.	mAP
4	61.0	63.0	66.6	63.5
5	61.3	64.5	65.9	63.9
6	62.4	63.7	66.1	64.1
7	61.3	61.6	65.7	62.9

Best results are highlighted in bold.

Table 5. Effectiveness of auxiliary loss.

$L_{mask}$	AP_div.	AP_ped.	AP_bou.	mAP
✗	59.3	63.0	64.9	62.4
✓	62.4	63.7	66.1	64.1

Best results are highlighted in bold.

Table 6. Comparison of performance under different thresholds.

Method	mAP_0.2m	mAP_0.5m	mAP_1.0m	mAP_1.5m
VectorMapNet [16]	1.1	16.6	46.2	64.0
MapTR [13]	2.2	24.7	55.1	70.1
MapTRv2 [17]	5.7	38.6	67.3	72.9
KpMapNet (Ours)	11.3	43.4	69.5	79.4

Best results are highlighted in bold.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Jin, B.; Hao, W.; Qiu, W.; Pang, S. KPMapNet: Keypoint Representation Learning for Online Vectorized High-Definition Map Construction. Sensors 2025, 25, 1897. https://doi.org/10.3390/s25061897

AMA Style

Jin B, Hao W, Qiu W, Pang S. KPMapNet: Keypoint Representation Learning for Online Vectorized High-Definition Map Construction. Sensors. 2025; 25(6):1897. https://doi.org/10.3390/s25061897

Chicago/Turabian Style

Jin, Bicheng, Wenyu Hao, Wenzhao Qiu, and Shanmin Pang. 2025. "KPMapNet: Keypoint Representation Learning for Online Vectorized High-Definition Map Construction" Sensors 25, no. 6: 1897. https://doi.org/10.3390/s25061897

APA Style

Jin, B., Hao, W., Qiu, W., & Pang, S. (2025). KPMapNet: Keypoint Representation Learning for Online Vectorized High-Definition Map Construction. Sensors, 25(6), 1897. https://doi.org/10.3390/s25061897

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

KPMapNet: Keypoint Representation Learning for Online Vectorized High-Definition Map Construction

Abstract

1. Introduction

2. Related Works

2.1. Online HD Map Construction

2.2. Map Instance Modeling

2.3. Map Mask for Segmentations

3. The Proposed Method

3.1. Model Pipeline

3.2. Ground Truth Representation

3.3. Architecture Detail

3.4. Training Loss

4. Experiments

4.1. Datasets

4.2. Evaluation Metrics

4.3. Implementation Details

4.4. Main Results

4.5. Ablation Study

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI