A Lightweight High Definition Mapping Method Based on Multi-Source Data Fusion Perception

: In this paper, a lightweight, high-deﬁnition mapping method is proposed for autonomous driving to address the drawbacks of traditional mapping methods, such as high cost, low efﬁciency, and slow update frequency. The proposed method is based on multi-source data fusion perception and involves generating local semantic maps (LSMs) using multi-sensor fusion on a vehicle and uploading multiple LSMs of the same road section, obtained through crowdsourcing, to a cloud server. An improved, two-stage semantic alignment algorithm, based on the semantic generalized iterative closest point (GICP), was then used to optimize the multi-trajectories pose on the cloud. Finally, an improved density clustering algorithm was proposed to instantiate the aligned semantic elements and generate vector semantic maps to improve mapping efﬁciency. Experimental results demonstrated the accuracy of the proposed method, with a horizontal error within 20 cm, a vertical error within 50 cm, and an average map size of 40 Kb/Km. The proposed method meets the requirements of being high deﬁnition, low cost, lightweight, robust, and up-to-date for

Crowdsourced mapping methods, based on low-cost sensors, are increasingly being explored to reduce mapping costs and promote the development of autonomous driving technology [14,15]. However, due to the large measurement and system errors of low-cost sensors, a semantic alignment fusion algorithm is necessary to optimize the data collected from multiple trajectories of the same road section to improve mapping accuracy [16][17][18].
Common registration algorithms include the iterative closest point (ICP) algorithm, the generalized iterative closest point (GICP) algorithm, and the semantic iterative closest point (SICP) algorithm [16][17][18]. While the ICP algorithm corrects the relative error between trajectories by constructing loop-closure pose constraints between multiple trajectories in a pairwise mode, it is heavily dependent on initial values and can easily be trapped in a local optimal solution, resulting in low registration accuracy. On the other hand, the GICP algorithm combines the ICP algorithm and the point-to-plane ICP into the probability frame model, eliminating the role of some bad corresponding points in the solution process.
Finally, the SICP algorithm improves on the ICP algorithm by adding semantic prior information, which results in higher alignment accuracy and faster convergence speed.
After the semantic alignment of multiple trajectories, a lightweight vector semantic map can be obtained by merging maps on the cloud server-a process known as semantic aggregation. Two different semantic aggregation methods have been proposed. Qin et al. [14] used rich, high-precision sensors to build local segment maps (LSMs) of the vehicle, and only semantic element aggregation was needed on the cloud. Their semantic aggregation method generates a grid map on the cloud and divides the LSM into 0.1 m × 0.1 m × 0.1 m small grids, each containing the location, the semantic element type, and the number of each semantic element type. The global map is merged in the cloud according to the grid of the same resolution, and the semantic element type with the highest score in each grid is used as the semantic type after aggregation. This method is more suitable for the local map generated by sensors with high accuracy and low noise tolerance. In contrast, Herb et al. [15] used crowdsourced data from commercial sensors and fused multi-period vehicle terminal maps in the cloud. The core of their alignment algorithm is to extract the landmarks observed in the keyframe of the sub-map common view, match the features, calculate the relative pose as a closed-loop constraint, and finally, construct the vector map through semantic map reconstruction. However, this method is computationally expensive and inefficient, and it requires a large storage bandwidth.
In summary, this paper proposes an improved semantic alignment algorithm and semantic clustering algorithm for lightweight mapping by merging LSMs on the cloud server. This low-cost, crowdsourced semantic mapping solution has the potential to reduce mapping costs and promote the development and application of autonomous driving technology.

Mapping System Overview
The HD mapping system is comprised of two components: real-time local mapping on the vehicle and global mapping on the cloud server, as depicted in Figure 1. Low-cost sensors, such as a monocular camera, an inertial measurement unit (IMU)/ global positioning system (GPS), and a wheel encoder, are used to collect data on-board the vehicle. Semantic elements, including lane lines, ground arrows, road stop lines, traffic signs, electrical poles, etc., are extracted from monocular vision. The ego-motion, estimated by the IMU and wheel encoder, is projected onto the 3D world coordinate system to form a local segment map (LSM) using semantic splicing. The global absolute pose of the vehicle can be estimated through GPS/IMU/wheel encoder fusion processing and then uploaded to the cloud server.
Limited by the precision of low-cost sensors and the computing power of the vehicleend hardware platform, the LSMs generated through single-vehicle acquisition at one time cannot meet the requirements of semantic element precision and completeness for autonomous driving. To solve this problem, we can repeat LSM mapping for the same road section many times and then upload LSMs to the cloud for overall fusion mapping, which can offset the observation error of single mapping, to some extent, and greatly improve the mapping precision and semantic completeness. First, the LSMs collected on the same road section are unified to the global world coordinate system. Then, semantic alignment is performed, different types of semantic elements are instantiated, and finally, the global semantic map is generated using vectorization.
However, the precision of low-cost sensors and the computing power of the vehicleend hardware platform limit the LSMs' precision and completeness. To address this issue, LSM mapping is repeated multiple times for the same road section and then uploaded to the cloud server for overall fusion mapping. This offsets the observation error of single mapping to some extent and significantly improves the mapping precision and semantic completeness. The LSMs collected on the same road section are unified in the global world coordinate system, and semantic alignment is performed to instantiate different types of semantic elements. Finally, the global semantic map is generated through vectorization.
Due to the large measurement error and system error of low-cost sensors, we propose a fusion alignment algorithm, GBA_SGICP, based on crowdsourced data, for semantics alignment. we also present an improved density clustering algorithm for semantic aggregation in this paper.

Semantic Segmentation and Detection
To improve the accuracy and efficiency of image perception, semantic segmentation and object detection methods were used to extract semantic features from images [19][20][21][22][23][24][25]. To further advance this approach, a neural network model, called OneNet, was proposed. OneNet integrates the detection and segmentation of all semantic elements and is composed of an encoding-decoding main network, target detection head, semantic segmentation head, and instance segmentation head. Figure 2 shows the architecture of OneNet. In OneNet, ResNet18 is adopted as the backbone network to extract features from input images [26]. The target detection head is used to extract air elements, such as traffic lights, traffic signs, and other elements, and the semantic segmentation head is responsible for extracting ground elements, such as arrows, dotted line boxes, lane stop lines, etc. [27]. The instance segmentation head is used to extract lane line semantic elements [28]. The effectiveness of this approach is demonstrated in Figure 3, which shows the segmentation detection results [29].

Semantic Element Ranging
To obtain the local map under 3D world coordinates, the semantic elements extracted from the OneNet model are projected into the vehicle body coordinate system. Since monocular vision is used for local mapping, distance measurement is necessary to estimate the depth of semantic pixels. For the ground elements, the pixel depth is estimated primarily based on the assumption of a near-ground plane. To improve ranging accuracy, a ground equation calibration algorithm, based on the vanishing points of lane lines, is used [30]. For aerial elements, BA optimization is performed through optical flow tracking and attitude, provided by ego-motion, to estimate the image depth [31].

Near-Ground Plane Assumption
It is assumed that the ground can be approximated as a plane within a certain range of camera observations. This allows for a fixed-projection relationship between 2D points on the camera image plane and 3D points on the ground, which can be directly backprojected to solve the problem of monocular camera ranging for ground elements [32]. By estimating the camera pose and using the ground equation calibration algorithm based on the vanishing points of lane lines, the ranging accuracy of the ground elements can be improved.

Ground Element Ranging
To estimate the depth of ground elements, the near-ground plane assumption is commonly used. The derivation formula is as follows: where X t represents the feature point on a 2D image captured by the monocular camera at time t, which corresponds to pavement features, such as lane lines and ground arrows. K is the camera's internal parameter, which is a 3 × 3 linear projection matrix. By inverting K, we can obtain the points on the normalized projection plane using X t . The normal vector of the ground plane equation is denoted by n, and h represents the camera height above the ground, both of which can be obtained through external parameter calibration. Using these parameters, we can recover the depth information of the 3D points corresponding to the ground features under the assumption of a near-ground plane.

Aerial Element Ranging
The ranging of aerial features is accomplished through target tracking and geo-motion triangulation. Target tracking is based on perceptual semantic detection instance, meaning that only aerial semantic elements detected through perception are tracked. In monocular vision, triangulation introduces scale uncertainty. To address this, the algorithm leverages ego-motion to obtain the relative pose between two tracking frames, which is equivalent to obtaining the baseline of the next two observations (i.e., binocular vision). This approach resolves the scale problem of monocular ranging.
The derivation formula is: where R c j w , t w c j j=1,...,M is the pose of the monocular camera at different times, which can be obtained from the ego-motion calculation, and p w i denotes the spatial position of the 3D feature points of semantic instance i in the world coordinate system. Examples here include, but are not limited to, traffic signs, lights, poles, etc. Semantic segmentation and detection are implemented on the same network. z ij denotes tracked 2D feature points, and the optical flow method is used to track and determine multi-frame 2D semantic observations belonging to the same aerial landmark, where j is used to identify multi-frame observations belonging to the same semantic instance, i. π(.) is a reprojection function, and the Gauss-Newton method or the LM method is usually used to minimize the reprojection error to find the optimal p w i . After extracting the 3D semantic features from consecutive multi-frame images, they are transformed into the world coordinate system, with the first frame as the origin using the ego-motion information, n. Then, the transformed multi-frame semantic elements are merged to generate a local semantic map [33]. An example of the local semantic map is illustrated in Figure 4.

Fusion Mapping on Cloud Server
In this section, we will discuss the steps and improved algorithm of fusion mapping on-cloud. The following are the details of the method: (1) First, all local semantic maps (LSMs) generated by the vehicle are transferred from the local coordinate system to the global coordinate system through global pose graph optimization, which ensures that all LSMs are aligned to the same global reference frame. (2) Next, semantic alignment or trajectory alignment is performed to improve the position accuracy of semantic elements in the global coordinate system by optimizing the poses of multiple trajectories. This step is crucial for determining the quality of the resulting map. (3) Finally, semantic elements are aggregated to instantiate aligned semantic elements and generate vector semantic maps.
The structure of on-cloud fusion mapping is illustrated in Figure 5.

Global Pose Graph Optimization
In order to build a globally consistent semantic map, it is necessary to unify the local trajectory and semantic elements of LSMs for different vehicles and time periods with the global coordinate system (ENU coordinate system) since they are relative to the vehicle body coordinate system at a certain time. This is achieved by aligning the odometry pose of the LSMs with the global coordinate system. To accomplish this, we construct global pose graph optimization by fusing the LSM-associated local trajectory and GPS trajectory, as shown in Figure 6a. The optimization problem can be expressed as: where Ω l t and Ω g t denote the covariance of odometer error and GPS measurement error, respectively, p l t−1 , q l t−1 represents the local pose at t−1, and e l and e g represent the relative pose constraint and the global absolute position constraint, respectively. The optimization goal is

Semantic Alignment Algorithm
Due to the low measurement accuracy of low-cost sensors, even though the LSM data constructed using different trajectories are converted to the unified global coordinate system in the cloud, there is still a certain pose deviation, as shown in Figure 5C-c1. Therefore, it is necessary to use a semantics alignment fusion algorithm to correct the relative relationship between trajectories. The semantic alignment algorithm is the core of on-cloud fusion mapping using crowdsourced data, which will directly determine the accuracy of fusion mapping. To address this issue, we propose a two-stage, multi-trajectory pose optimization algorithm, called global bundle adjustment, based on semantic GICP (GBA_SGICP), which is based on SICP and GICP improvement.
In the first stage, we add semantic prior information to GICP, use different loss functions for different element types, such as ground arrows and traffic signs, and build a loss function based on the point-to-plane distance to reduce the point-cloud matching error. To make the registration position converge to a more accurate value, we use a semantic pyramid hierarchy, where the upper semantic registration result serves as the initial value of the lower registration. We refer to this improved algorithm as the semantic GICP (SGICP).
Next, we use the SGICP algorithm to calculate the inter-relative pose constraint (InterRPC) between trajectories using the pairwise method. We then add the relative constraints in each trajectory to construct a graph optimization to obtain a more accurate pose, which improves the robustness of the algorithm in different scenarios and enhances the convergence speed of the algorithm [34,35]. Figure 3 shows that the semantic alignment fusion algorithm corrects the relative relationship between trajectories, which is essential for achieving accurate fusion mapping.
The SGICP algorithm solves the problem as follows: Assuming P s is the source point cloud (point cloud to be registered), P t is the target point cloud, P k s and P k t represent the source and target point clouds of the k-th layer of the pyramid, respectively, and T ∈ SE(3) is the required rotation translation matrix. Then, where A(l) is the matching relationship of P k s and P k t , and the K nearest neighbor algorithm is usually used.
where i is the number of semantic point cloud pairs matched by A(l), w l denotes the confidence weight of different semantic elements, P k The optimization problem can be described using probabilistic model as: We assume that the measurement model follows the Gaussian distribution, and the relative pose constraint probability model in the trajectory is given by The relative pose constraint probability model between different trajectories is given by X * under the maximum a posteriori probability is the optimal solution we obtained. You can take the negative logarithm of Formula (8) and combine Formula (9) and (10) to convert it into a nonlinear least squares problem Considering the robustness of the algorithm, we add a filter function ρ (.) of the pose edge in loop-closure pose in the Formula (11) Pairwise Consistency Measurement (PCM) [30] based on maximum clique is used to filter out abnormal or low precision loopback pose edges.
The optimized pose can be obtained The optimized pose χ = (R, t), by solving the aforementioned nonlinear least square problem. The position error of semantic elements caused by trajectory errors can be corrected by using the formula P ′ = RP + t thus achieving alignment of semantic elements observed by multiple trajectories. To perform iterative optimization, we use the Levenberg Marquard algorithm [36]. The algorithm is implemented using the C++three-way toolkit GTSAM [37].
In the second stage, we use the pose calculated in the first stage as the initial value, and add the spatial constraints of semantic elements to construct a global Boundary Adjustment optimization. This method, called GBA_SGICP (Global Bundle Adjustment based on Semantic GICP), optimize both the relative error between multiple trajectories and the overall absolute error. This approach further improves the semantic alignment accuracy, as shown in Figure 6b. Unlike the traditional 2D-3D Boundary Adjustment [38], we use Direct Bundle Adjustment [39] (13) where N P is the set of all semantic elements belonging to the point-plane matching relationship and n k p denotes the unit normal vector of the plane where the position p k t is located.
Let χ = {x 1 , x 2 , · · · , x n } be the set of all pose nodes of multiple trajectories, and x s , x t ∈ χ. All matching pairs and initial pose values in (12) and (13) are derived from the optimization in the first step. The trajectory pose optimization equation in the second step can be obtained by using (12) and (13) χ * = argmin χ R ∑ m,n ∑ i∈m,j∈n ρ e l x i , x j + ρ e l x i , x j (14) where R denotes the number of tracks participating in fusion optimization, m, n ∈ R, m = n represents the m-th and n-th tracks, x i , x j denotes the i-th pose node of the m-th pass and the j-th pose node of the n-th pass, respectively. The Huber loss function, ρ, can be defined By employing the aforementioned optimization, we can obtain more precise trajectory poses, resulting in the improved alignment of semantic elements observed by multiple trajectories, as depicted in Figure 5c-c2.

Semantic Aggregation
After performing semantic alignment, the semantic elements observed by multiple tracks achieve good consistency in the global coordinate system, as shown in Figure 4A. However, this is still a composition of many semantic elements. To obtain a lightweight vector semantic map, it is necessary to aggregate the fusion map after multiple semantic alignments. The process of semantic aggregation includes instantiation and vectorization.

Instantiation
In this section, we propose a lane instantiation algorithm that is based on clustering and uses logical constraints to address the complexity of lane instantiation. The algorithm consists of a two-stage DBSCAN (density-based spatial clustering of applications with noise) [39] method.
In the first stage, we expand the LSM lane elements from the vehicle left and right within a certain range (20 cm) to form buffered line objects, which we denote as B. The input dataset is denoted as D = {b 1 , b 2 , · · · , b n }, where each b is a sample. We use the overlap area between two samples of D as the distance measurement of DBSCAN, expressed as distance b i , b j = overlap_area b i , b j . The resulting clusters are marked as C = {c 1 , c 2 , · · · , c n }, as shown in Figure 6b.
The LSM lane lines with noise can be detected and eliminated using the first stage of clustering. Each cluster generated in the first stage of clustering forms a MultiLine object. In the second stage of clustering, we use the cluster set formed in the first stage as input, and we define the distance measurement of clustering as the Euclidean distance between two MultiLine objects. Considering the logical relationship between lane lines in the real world, a lane line can only be connected to, at most, the following lane lines, such as a new lane instance that is generated at the bifurcation.
We then adjust the definition of the core object in the DBSCAN algorithm. If the number of samples in the fruit subsample set meets the condition MinPts N ǫ c j ≤ MaxPts(MinPts = 1, MaxPts = 2), then we add sample c j to the core object set and define the cluster generated by the second stage of clustering as C ′ = c ′ 1 , c ′ 2 , · · · , c ′ n , which is the final instantiation object of the lane, as shown in Figure 4C. This method can deal with any road scene, including branching roads, and has stronger robustness compared to other clustering methods.

Vectorization
Vectorization is the process of converting the same instance formed after clustering into a unique vector semantic element. For the lane line, we use third-order Bessel curve fitting to vectorize, which generates a smooth curve that closely approximates the lane line. The third-order Bézier curve can be expressed as: where P 0 , P 1 , P 2 , P 3 are the curve control points that define the shape of the curve and t is a parameter that varies between 0 and 1. We select four points on the lane line instance as the control points and use the least-squares method to optimize their positions to fit the Bézier curve. The resulting curve is then used to represent the lane line in the vector semantic map. The vectorization result is shown in Figure 6d.
To instantiate discrete semantic elements, such as ground arrows, poles, traffic signs, etc., we define the distance measurement method of density clustering based on their spatial characteristics. For ground arrows, we calculate the intersection area between two samples, which is the overlapping area between two rectangular boxes. The clustering process uses the weighted average method for vectorization. To calculate the spatial relationship in all clustering processes, we use the C++ Boost.Geometry library.

Testing and Result Analysis
We conducted experiments, using crowdsourced data collected in real-world scenarios, to evaluate the effectiveness of our proposed GBA_SGICP algorithm and the capacity of fusion mapping on a cloud server. For this experiment, we used three vehicles equipped with the same sensors, including GPS, IMU, wheel speed encoder, and monocular camera, to collect data. The test scenario was a typical complex overpass scenario, as shown in Figure 7. To evaluate the quality of our automated crowdsourced mapping approach, we used the map produced by high-precision equipment acquisition vehicles and manually verified it as the ground truth. We used the following evaluation metrics: relative precision, semantic completeness, and semantic redundancy.

a.
Relative precision measures the accuracy of the relative position relationship between map elements, such as the lane width, the distance between two adjacent poles, etc. It is an important indicator to measure whether the map accuracy meets the requirements of autonomous driving. In this paper, we define relative precision as a comprehensive index of horizontal and vertical precision. The formula for calculating the relative precision of the same type of semantic elements is as follows: relative precision= For discrete-type elements (such as signboards), the calculation formula is Sum_Num obs /N; for continuous type elements (such as lane lines), the calculation formula is Sum_Len obs /∑ N i=1 Len truth i , where N is the number of instances of semantic elements of the same type that actually exist in the real world, and Sum_Num obs /Sum_Len obs represents the number of elements that do not exist in the real world but exist in the map construction.
We conducted a comparison between our proposed GBA_SGICP algorithm and different alignment algorithms using the same dataset, and with a data concentration of 50 times for each road. We evaluated the mapping accuracy of four alignment algorithms on four semantic elements, including two ground and two air elements, as shown in Table 1. The results indicate that our proposed GBA_ SGICP algorithm achieves the highest relative accuracy compared to the other alignment algorithms.  Table 2 presents the impact of the four alignment algorithms on the mapping efficiency using the average values of the efficiency of various scenes, including high-speed and urban scenes. The results show that the GBA_SGICP algorithm can achieve a mapping efficiency of nearly 18 km/h, which is about three times higher than the ICP alignment algorithm. In addition, we compare the low-cost lightweight crowdsourcing mapping method proposed in this paper with the two representative crowdsourcing cloud mapping methods in previous studies, such as the methods proposed by Qin et al. [14] and Herb et al. [15], respectively. The comparison of the accuracy, completeness, and overall map construction efficiency of the three fusion mapping algorithms for different semantic elements is shown in Table 3. It is shown that our method has the lowest accuracy error, higher factor completion rate, and relatively high efficiency. We analyzed the completeness rate and redundancy rate of different semantic elements and found that the ground elements, such as lane line, ground arrow LaneMarking, and StopLine, were slightly better than air elements, such as TrafficSign and Pole, as shown in Figure 8. This is because the quality of ground elements in the LSM generated on-vehicle is better than that of air elements.
The comparative analysis of the influence of different observation times on fusion mapping on-cloud is shown in Figure 9. The results indicate that the accuracy of map elements improves as the observation time increases. For instance, the relative error of 50 observation fusions is over three times lower than that of 10 fusion mapping, and the element completeness rate is over double. When 100 fusions are reached, the relative accuracy of the lane line is close to 10 cm, and that of other elements is within 50 cm. Additionally, the completeness rate of elements is 95% on average, which has reached the mapping level of traditional high-precision sensors and can meet the needs of high-level autonomous driving. Therefore, it can be concluded that increasing the observation time can significantly improve the accuracy and completeness of fusion mapping.  We conducted an analysis of crowdsourced fusion mapping on a section of an urban road in an actual scene, and the results are presented in Figure 10. It is observed that due to the accuracy limitations of low-cost sensors, even though LMS data has been transferred to the global coordinate system through pose graph optimization, large errors still exist in multiple observation results of the same road section as shown in Figure 10a. However, after the optimization of semantic alignment, the location consistency of semantic elements observed many times has been greatly improved as shown in Figure 10b. Further, after semantic aggregation algorithm processing, some abnormal or low-precision observations are filtered out, thus obtaining a high-precision lightweight semantic map as shown in Figure 10c. In this experiment, we utilized our crowdsourced semantic mapping method to automatically build a semantic map covering approximately 25 km in a complex circular overpass scenario, as shown in Figure 11. The semantic elements include various types of lanes (real lane lines, virtual lane lines, road edges, etc.), ground arrows, traffic signs, etc. To verify the actual usability of this map, we loaded the map onto a vehicle and conducted semantic positioning tests. The results show an average lateral error of 15 cm, a longitudinal error of 35 cm, and a heading angle error of 0.5 degrees, as shown in Figure 12. These results confirm the reliability and practicality of the crowdsourced mapping method proposed in this paper.

Conclusions
This paper proposes a low-cost and lightweight semantic HD mapping method for autonomous driving applications that is based on multi-source data fusion perception. The mapping system framework builds upon previous work that utilizes crowdsourced data, but the paper introduces an improved semantic alignment algorithm and semantic aggregation algorithm that are evaluated in a practical scene. For fusion mapping on the cloud, we focus on the optimization of multiple trajectories' pose and propose a semanticbased GICP two-stage semantic alignment algorithm, which significantly improves the semantic alignment accuracy, and propose a semantic aggregation algorithm based on lane line logic constraints for lightweight vector map construction, which effectively suppresses the impact of abnormal data on semantic element instantiation. Compared with the stateof-the-art (Qin et al. and Herb et al.), our method reduces the relative accuracy error of fusion mapping by 30%, increases the factor completion rate by about 15%, and improves the mapping efficiency by about 10%. The experimental results demonstrate that our method is highly accurate, lightweight, and robust, and can meet the application accuracy requirements of high-precision mapping. Our method also provides convenience for users to segment and load maps.