High-Deﬁnition Map Representation Techniques for Automated Vehicles

: Many studies in the ﬁeld of robot navigation have focused on environment representation and localization. The goal of map representation is to summarize spatial information in topological and geometrical abstracts. By providing strong priors, maps improve the performance and reliability of automated robots. Due to the transition to fully automated driving in recent years, there has been a constant effort to design methods and technologies to improve the precision of road participants and the environment’s information. Among these efforts is the high-deﬁnition (HD) map concept. Making HD maps requires accuracy, completeness, veriﬁability, and extensibility. Because of the complexity of HD mapping, it is currently expensive and difﬁcult to implement, particularly in an urban environment. In an urban trafﬁc system, the road model is at least a map with sets of roads, lanes, and lane markers. While more research is being dedicated to mapping and localization, a comprehensive review of the various types of map representation is still required. This paper presents a brief overview of map representation, followed by a detailed literature review of HD maps for automated vehicles. The current state of autonomous vehicle (AV) mapping is encouraging, the ﬁeld has matured to a point where detailed maps of complex environments are built in real time and have been proved useful. Many existing techniques are robust to noise and can cope with a large range of environments. Nevertheless, there are still open problems for future research. AV mapping will continue to be a highly active research area essential to the goal of achieving full autonomy.


Introduction
The problem of mobile robot navigation has traditionally been approached by breaking it down into three parts: environment mapping, localization, and trajectory planning. For autonomous vehicles (AVs), accurate and reliable self-localization is critical [1]. In order to operate safely, AVs must precisely predict the future actions and/or trajectories of other road participants such as connected and nonconnected vehicles and pedestrians [2][3][4][5]. For instance, the ability to accurately predict pedestrian behavior is crucial to ensure safe autonomous driving solutions. However, this task is challenging due to the fact that in general, pedestrian's trajectories can change rapidly, and they lack temporal smoothness [6]. Accessing the environment information in the form of a prebuilt map can help with such challenging tasks. Furthermore, when combined with a prebuilt map, a high-precision self-localization solution can transform the difficult problem of perception and scene interpretation into a less complex positioning problem [7,8]. The criteria for achieving accurate self-localization on the map have been discussed in [9].
AVs intend to offer a safe and comfortable ride using the output of sensory units, a map, and a high-level route [10][11][12][13]. Meanwhile, in safety-critical applications such as self-driving cars, creating interpretable intermediate representations that explain why the car performed a given maneuver is critical for decision-making [14,15]. If the map is updated regularly and reliably, AV can partially handle the autonomy problem offline by using map information for decision-making during maneuvers [16]. Furthermore, map data can be shared and updated by multiple AVs, allowing for real-time map updates and improving confidence in the accuracy of the map. The higher levels of autonomy require the maps to be more refined in details with quality standards. In this context, the solution for high-precision localization is to provide a unified representation that combines the agent dynamics, collected by perception and tracking systems, with the scene context, commonly provided as prior knowledge in the form of high-definition (HD) maps [17][18][19][20].
Differently from the unified representation, other solutions use an end-to-end approach that creates an internal-learned map representation of the world [21][22][23][24][25][26][27]. End-to-end approaches that learn such internal mapping could be beneficial to scale self-driving solutions that can generalize and find optimal map representations for the driving task [28,29]. Towards this goal, the work in [30] is one of the earliest end-to-end systems and pioneered this field by using a neural network to directly control the AV. Current end-to-end solutions use simultaneous perception and prediction to provide outputs such as object tracking and predicted trajectories, or learning an intermediate semantic mapping that is used to control the AV, enabling end-to-end learning of the full autonomy system [25,28]. Other end-to-end solutions propose controlling the steering angle directly from raw camera images [27]. In [31], the authors proposed an end-to-end trainable long short-term memory (LSTM) [32] network for that purpose. The authors in [33] used a similar strategy and proposed a 3D convolutional neural network (CNN) model with LSTM layers and residual connections. Other studies proposed other combinations of LSTM, CNN, and 3D CNN models for end-to-end driving solutions [34,35]. These approaches focus on imitating human drivers and learning a hidden representation but are not interpretable.
An end-to-end learnable neural network can perform joint perception, prediction, and motion planning for AVs while producing interpretable intermediate representations.
The interpretable representations are used by the planner and help to explain the AV decisions [21,23]. In [21], the authors presented an end-to-end approach for predicting intermediate representations in the form of an online map as well as agents' dynamics and their current and future states. The solution produced probabilistic intermediate representations that were interpretable and ready to use for the motion planner. Although directly outputting driving commands is a general solution, it may have stability and robustness issues, and a combination of an HD map and internal latent representations (feature map) can be advantageous [22] and can also be learned end-to-end from human demonstrations [23]. This is accomplished through the use of a novel differentiable semantic occupancy representation, which is explicitly used as a cost in the motion-planning process.
It is also common to rasterize maps into a top-down view or bird's-eye view (BEV) map, which can be referred to as a 3D top-view map that respects the nature of the data, making the learning process easier as it can leverage priors about objects' geometry [23,24,[36][37][38][39][40]. Because height localization information is less valuable for AVs, the relevant information that an AV requires for decision-making could be suitably encoded using a BEV map representation. Using a BEV as an output of the perception module will result in interpretable and easy-to-use representation for prediction and motion-planning modules [24,37,40].
Despite significant progress in this area, it still presents significant challenges due to the nature of sensor noise and practical constraints during the map creation. Existing mapping algorithms are commonly surprisingly complex, both from a mathematical and from an implementation point of view. As a result, novel map representations are required for the full adoption of AVs. This review provides a general review of the most common map representation approaches, with a focus on AV mapping. The contributions of the paper are as follows: • This paper describes and compares different map representation approaches and their applications, such as highly/moderately simplified map representations, which are primarily used in the robotics domain.
• We provide a detailed literature review of HD maps for automated vehicles, as well as the structure of their various layers and the information contained within them, based on different companies' definitions of an HD map. • We discuss the current limitations and challenges of the HD map, such as data storage and map update routines, as well as future research directions.

Real-Time (Online) Mapping
Robotics applications frequently necessitate real-time processing. This means that the input data must be processed at a rate that is faster than or equal to the input data rate to avoid frame dropping [41]. Real-Time mappings allow the robot to map out unknown environments and perform localization in that map at the same time. However, as the action of driving gradually transfers from humans to machines, the role and scope of maps extend beyond navigation. As a result, offline map-based approaches have received more attention in most AV applications during the last decade. The computational problem of constructing or updating a map of an unknown environment while simultaneously tracking an agent's location within it is known as simultaneous localization and mapping (SLAM).

Simultaneous Localization and Mapping (SLAM)
In many applications, such as indoor robot navigation, offline maps are not available [42,43]. The agent can utilize SLAM to construct a map on the fly from raw sensory data (mapping), while also using that constructed map to maintain track of its location (localization) [44][45][46][47][48][49][50]. SLAM techniques perform well over short distances, but they suffer from an accumulative inaccuracy over longer distances due to their dependent nature; meanwhile, loop-closure modules in SLAM systems (and pose graph optimization) will compensate for the errors and correct the accumulated drift. The map representations employed in SLAM techniques can vary widely, but the most major distinction is whether they are 2D-or 3D-orientated, or a combination of both. It is reasonable to suppose that when the SLAM is paired with a combination of sensors (GPS, IMU, LIDAR, and radar), it will perform better. Figure 1 shows a result of SLAM using the Cartographer package [51,52]. and hokuyo 2D LiDAR. The details of the SLAM method is discussed in [52].
ORB-SLAM is among the most well-known mapping and localization system that operates in real time while keeping localization and tracking accuracy at a desirable level [53][54][55][56]. In [57], the authors showed that a multilayer perceptron (MLP) could be used as the only scene representation in a real-time SLAM system using a hand-held RGB-D camera.

Highly/Moderately Simplified Map Representations
This category of maps is mainly utilized in the robotics domain and can be classified into three subcategories: topological maps, metric maps, and geometric maps.

Topological Maps
The topological maps are mainly graph-based representations, exclusively deal with places and their interactions [80][81][82], and describe the environment as a collection of nodes (locations) connected by edges [83]. An edge between two nodes is labeled with a probability distribution over the relative locations of the two poses, conditioned to their mutual measurements [84]. A world representation based on this simplification makes the map extension easier and provides the required information for path planning and motion prediction [85][86][87][88][89]. Despite the world model's reduction, topological representations lose the sense of proximity and lack explicit information regarding the space's occupancy. Many authors have approached this problem by storing additional data or combining it with metric maps [83,90].

Metric Maps
Contrary to topological maps, in metric maps, the objects are represented with precise coordinates. Such maps contain all the required information for a mapping or navigation algorithm to function [91]. In these methods, the map size is directly proportionate to the region of interest's area. Therefore, mapping vast areas, especially in a 3D representation, is computationally expensive. Landmark-based maps, occupancy grid maps, and geometric maps are the most popular metric mapping methods.

Landmark-Based Maps
Landmark-based representations, also known as feature-based representations, are used to identify and maintain the postures of specific distinguishing landmarks [92][93][94][95][96]. The landmarks must be unique and identifiable by the robot perception system, which is a prerequisite in these representations. Landmarks can be defined as sophisticated descriptors, rather than raw sensor data. Points, lines, and corners can be used to create a minimalist description of the landscape. Some methods such as that in [97] applied topological mapping on landmark-based maps.

Occupancy Grid Maps
Occupancy grid maps [98] divide the environment into so-called grid cells. Each cell contains data about the area it covers [58]. Figure 2 shows an example of a simple grid map. It is typical to save a single value in each cell that represents the likelihood of an obstacle being there. Traditional probability-based techniques, such as particle or Kalman filters, are most typically used to combine input from several sensors and localized to a known prior map [59,60,64,69,[99][100][101][102][103].
Occupancy grid maps can be either 2D or 3D [104]. A version known as 2.5D contains height information in an extended 2D grid cell map rather than being a pure 3D grid map [105]. Regular grids or sparse grids can be used to create grid maps. Regular grids discretize continuous space into cells with the same dimensions for the entire region, whereas sparse grids extend the concept of the regular grid by grouping regions with the same values in a tree-like fashion. This map can be used to predict multipedestrian movements [106] as well as obstacle crossing. In general, the occupancy grid maps can be categorized as follows: • Octree: The octree encoding [107] is a 3D hierarchical octal tree structure capable of representing objects with any morphology at any resolution. Because the memory required for representation and manipulation is on the order of the area of the object, it is commonly employed in systems that require 3D data storage due to its great efficiency [71][72][73][75][76][77][108][109][110]. • Costmap: The costmap represents the difficulty of traversing different areas of the map. The cost is calculated by integrating the static map, local obstacle information, and the inflation layer, and it takes the shape of an occupancy grid with abstract values that do not represent any measurement of the environment. It is mostly utilized in path planning [111][112][113][114].

Geometric Maps
The geometric maps attempt to represent the sensory data with discrete simplified geometric shapes such as circles or polygons [115]. The geometric maps represent the surroundings efficiently without sacrificing too much information; however, it impedes trajectory calculation and data management in general. As a result show in Table 1, this method is rarely used in practice, and the occupancy grid map alternative is preferred [116,117].

High-Accuracy Map Representations
Currently, academics and manufacturers are working to develop advanced driverassistance systems (ADAS) to attain a high-level autonomy in vehicles. Maps can be used for a variety of purposes, including lowering computation cost by providing the offline maps as a prior, implementing safety measures, avoiding sensor range constraints, and sharing maps data among different AVs, all of which can improve ADAS accuracy and reliability. According to [118,119], high-accuracy map representations can be loosely categorized based on their level of information into one of three categories: digital maps, enhanced digital maps, and HD maps. Traditional street maps, such as Google Map, are digital maps. Road geometry, signage, lane design, and speed limits are all included in enhanced digital maps. Finally, HD maps incorporate all of the features found in the preceding categories, as well as a semantically segmented 3D representation of the agent's surrounding.
If the map is kept accurate and used intelligently, with an understanding of its own limitations, an HD map can be thought of as an extra sensor that is unaffected by environmental occlusions with a nearly perfect detection system.

Digital Maps
A conventional digital map is a traditional electronic street map and is given by a variety of map providers, such as Google Map. These are topometric (topological and metric) maps that encode street layout, names, and distances. It is worth noting that an automated car can still benefit from these prior maps, but they are unlikely to be a crucial facilitator of fully autonomous operation on their own (as opposed to HD maps). Even with an up-to-date digital map, the lack of positionally accurate and identifiable environment data (such as the location of a stop sign) limits the extent to which it can assist an automated vehicle. However, such level of information is still sufficient for high-level navigation tasks, such as finding the shortest path from point A to point B. For more details, one can refer to [120].

Enhanced Digital Maps
An enhanced digital map is a conventional digital map that has had certain augmented data, making it useful for both ADAS and AVs. Road speed restrictions, road curvature, lane structure, and road signage have all been added to a basic digital map [121,122]. The list below goes through each of these additions based on TomTom's ADAS map [123]. Due to the lack of a clear distinction between an enhanced digital map and an HD map, researchers classify any map that stores a 3D world representation as an HD map, while the rest are classified as enhanced digital maps.

High-Definition (HD) Maps
A high-definition (HD) map is a 3D representation of the world that supplements an enhanced digital map [124,125]. A combination of sensors, including LiDAR, radar, and cameras, can be used to create this representation [126]. A high positional accuracy, on the order of 10 cm, is a common feature of all HD maps [127]. Although technology constraints limit the highest possible accuracy of map features, a higher precision is always desirable.
An HD map can be as simple as a collection of accurate positioning of road signs, lane markings, and guardrails in the surroundings, or be as complex as a dense semantically segmented LiDAR point cloud that stores the distance to every obstacle around the agent as shown in Figure 3. For more information, one can refer to [128]. An HD map is usually divided into numerous layers, each of which contains different sorts of data. Figure 4 illustrates an HD map along with its layer, originally published in [130]. Furthermore, Figure 5 illustrates the layers of the HD map defined by HERE [131].  In Lyft's HD map, the five core layers are described as follows [132,133]: • Base map layer: The entire HD map is layered on top of a standard street map. • Geometric map layer: The geometric layer in Lyft's maps contains a 3D representation of the surrounding road network. This 3D representation is provided by a voxel map with voxels of 5 cm × 5 cm × 5 cm and was built using sensory data of LiDAR and cameras. Voxels are a cheaper alternative to point clouds in terms of required storage. • Semantic map layer: The semantic map layer contains all semantic data, such as lane marker placements, travel directions, and traffic sign locations [23,134,135]. Within the semantic layer, there are three major sublayers: -Road-graph layer; -Lane-geometry layer; -Semantic features include all objects relevant to the driving task, such as traffic lights, pedestrian crossings, and road signs. •

Map priors layer:
This layer adds to the semantic layer by integrating data that have been learned via experience (crowd-sourced data). For example, the average time it takes for a traffic light to turn green or the likelihood of coming across parked vehicles on the side of a narrow route, allows the AV to raise its "caution" while driving. • Real-time knowledge layer: This is the only layer designed to be updated in real time, to reflect changing conditions such as traffic congestion, accidents, and road work.
Based on a combination of the open-source Apollo software [136] platform and DeepMap's U.S. patent [137], another description of the core layers of the HD map is offered below. Occupancy map: A spatial 3D representation of the road and all physical objects around the road. This representation can be stored as a mesh geometry, point cloud, or voxels. The 3D model is essential to centimeter-level accuracy in the AV's location on the map.

Localization in HD Maps
Road DNA, proposed by TomTom [138], is one of the possible solutions for the localization problem in HD maps. In this method, the detailed 3D representation of the road environment with all features and depth information was compressed into a collection of 2D raster images, where the image intensity corresponded to the depth of a certain area of the environment. A 2D depth image was also created utilizing the agent's sensor data. The Road DNA solution allowed for a precise localization with substantially fewer data storage requirements, compared to using dense LiDAR point clouds. For an accurate localization, pattern matching algorithms were applied. Since significant structural changes occur less frequently in a road environment than appearance changes, depth photos can be more resistant to environmental changes than raw camera images.
In [139], a robust ego-motion estimation technique using sensors and a map-matching technique with HD maps was presented. The authors proposed a new line segmentation matching model and a geometric correction approach of road making obtained by an inverse perspective mapping (IPM) methodology for the map-matching technique with HD map. Combining these two technologies increased robustness and accuracy, according to the authors' experiments.
The authors in [20] compared sensory scans to an HD map using a particle filter. Their study integrated data from an IMU and a GPS receiver to determine location. The root-mean-squared error (RMSE) of the localization accuracy was 2.8 m without an HD map prior, 1.5 m with an HD map and odometry (IMU), and 1.2 m with an HD map, odometry, and GPS. While the obtained accuracy was not as good as commercial methods, the results confirmed the significant effect of having prior HD maps on AV's localization.
For LiDAR-enabled self-driving cars, the iterative closest point (ICP) algorithm is commonly used to match a 3D LiDAR point cloud to a previously collected set of points in the map. The ICP algorithm is a least-squares optimizer that tries to determine the best rotation, scale, and translation to transform a set of incoming LiDAR points into a data set of points iteratively [140]. The strategies for aligning LiDAR points using ICP were discussed in [8]. They also utilized a Kalman Filter to fuse sensor data.
Finally, RTK (real-time kinematic) GPS can be used to obtain highly accurate localization. However, because RTK GPS relies on a network of ground stations to function properly, extra infrastructure is required to have AVs locate themselves accurately using this technique. In densely built urban environments, GPS is also vulnerable to dropouts, interference, and multipath reflection, which, although acceptable for long-range navigation planning, is insufficient for a second-by-second local-positioning-based control of AVs.

Limitations and Challenges
The broad range of traffic laws between countries, such as restrictions for turning left and right, is one of the challenges in generating HD maps [141][142][143]. The required data storage for HD maps causes another challenge. Google's Waymo AV, for example, collects about 1 GB of data every 20 s [144]. Since each AV has limited storage space, the vehicle must perform a dynamic map download and cache refresh routine as it travels across the surroundings. DeepMap's map-tiling technique [137] separates the whole HD map into map tiles and downloads the necessary map tiles based on the vehicle position to decrease the memory requirement. The third issue is exact vehicle localization inside the HD map, which is accomplished by comparing incoming sensor data with the current map and updating the map. The processing of incoming sensory data requires onboard high-performing processing resources, and the real-time execution of the commands needs a latency time of less than 10 ms [145].
The HD map update and maintenance is also a major challenge [141,146]. There are millions of kilometers of roads in the world, and many HD map modeling algorithms are proposed for highway scenarios and neglect input anomaly (such as bad lane marking paint, flatten curb, tree occlusion), and uncertainties around nonroad objects (such as construction zones, nearby vehicles, trees). However, in reality, such uncertainties and anomalies are present in many urban and rural roads. Therefore, more efforts are needed to mitigate the effect of these problems.

Conclusions and Future Work
Both specialist mapping businesses, as well as automated vehicle companies, have started to generate HD maps for AVs. There exists a wide range of HD map solutions available or in development, ranging from lightweight HD map solutions that primarily store lane markings and lane logic (Atlatec, Apollo), to maps that include full 3D point cloud representations (Waymo). While the most comprehensive maps with full 3D representations provide the best assurance of safety, they are costly to generate and maintain and necessitate massive quantities of data. A layered strategy, in which precise 3D data are updated less frequently and a lower-memory 2D representation of the road network is updated considerably more frequently, could be the optimal answer.
In order to implement real-time safety-critical HD maps for AVs, some fundamental challenges must be overcome. This include providing a consistent communication system between agents and HD map providers to transfer an agent's location and corresponding semantic information in real time, a mechanism for informing HD map providers about changes to static road features (such as road signs) or anomalies and consequently rectifying such anomalies, and finally policy considerations on whether HD maps should be privately or publicly owned and operated.
In this paper, we reviewed the major map representations and important open problems in the field of HD map representation. The current state of AV mapping is encouraging; the field has matured to a point where detailed maps of complex environments are built in real time and have been proved useful. Many existing techniques are robust to noise and can cope with a large range of environments. Nevertheless, there are still open problems for future research. It is heartwarming to see new applications and innovations in map representation that can generalize to previously unseen scenarios, are scalable for real-time applications, and are applicable to unstructured, disaster, and extreme weather environments where many of the techniques described are ineffective. AV mapping will remain a highly active research area critical to achieving full autonomy.

Institutional Review Board Statement:
The study did not require ethical approval as it did not involve any humans or animals.

Informed Consent Statement:
The study did not require ethical approval as it did not involve any humans or animals.
Data Availability Statement: Not applicable.

Conflicts of Interest:
The authors declare no conflict of interest.