Automatic Recognition and Geolocation of Vertical Trafﬁc Signs Based on Artiﬁcial Intelligence Using a Low-Cost Mapping Mobile System

: Road maintenance is a key aspect of road safety and resilience. Trafﬁc signs are an important asset of the road network, providing information that enhances safety and driver awareness. This paper presents a method for the recognition and geolocation of vertical trafﬁc signs based on artiﬁcial intelligence and the use of a low-cost mobile mapping system. The approach developed includes three steps: First, trafﬁc signals are detected and recognized from imagery using a deep learning architecture with YOLOV3 and ResNet-152. Next, LiDAR point clouds are used to provide metric capabilities and cartographic coordinates. Finally, a WebGIS viewer was developed based on Potree architecture to visualize the results. The experimental results were validated on a regional road in Avila (Spain) demonstrating that the proposed method obtains promising, accurate and reliable results.


Introduction
Among the different expenditure items that are part of the life cycle of transport infrastructure, maintenance is one of the most important. The development of transport infrastructure is a major investment for public administrations, and their life cycle can span decades. For that reason, proper maintenance is essential to get the best return on these investments [1]. However, according to data gathered by the European Road Federation (ERF), the volume of investment in inland transport infrastructure has stalled since a significant cut after the 2008 crisis, when it reached its maximum [2]. This is especially worrying considering that the transport of goods and passengers has been steadily growing for the last decade, and the infrastructure is ageing under a context of maintenance budget cuts. According to a study by Calvo-Poyo et al. [3], spending on road maintenance not only prevents deterioration and prolongs the life of the infrastructure, but also increases road safety, reducing the death rate.
The use of new technologies for capturing, managing, and communicating information is essential to optimize the cost of infrastructure maintenance and to increase its security and resilience. Transport infrastructure digitalization is a key concept that is supposed to drive the transition towards the goals of the Sustainable and Smart Mobility Strategy of the European Union [4].
Intelligent Transportation Systems (ITS) apply information and communication technologies to the infrastructure, vehicles and users, interfacing between different modes of transportation and potentially offering data management capabilities for road maintenance, and they are an intrinsic part of the future of transport [5]. Together with ITS, remote sensing technologies are being extensively reported in the literature for transportation infrastructure maintenance and assessment. While technologies such as satellites or aerial Image-based and point cloud approaches have complementary advantages and disadvantages. While semantic recognition in images can be performed accurately with DL-based techniques, the geolocation of signals is not as straightforward as in the case of point cloudbased methods. Therefore, it seems logical to merge both sources of information to carry out inventory tasks. One possible approach is based on detecting signals in the point cloud from the geometric and radiometric properties of traffic signs, and projecting the 3D information onto images where recognition is performed using ML or DL approaches, given that the point cloud and the images are synchronized with each other [30,31]. Other works have a complementary workflow, performing both detection and recognition on the images, and projecting the result on the point cloud to geolocate the position of the road sign [8].
This paper presents a methodology for traffic signal inventory using MMS that follows the latter approach. First, the detection and recognition of traffic signs in images is carried out, and then the geolocation is performed on the point cloud, projecting the results obtained on the images. The contribution of this paper aims to close two gaps from previous works: (1) While previous works use high-end MMS (RIEGL VMX-450 in [30], Optech LYNX in [8,31]), which provide the calibration of their cameras and laser scanner system, this work uses a low-cost MMS with a commercial camera that was manually mounted together with the laser scanner. Thus, this methodology offers a complete workflow that includes the calibration between the camera and the laser scanner system to carry out the geolocation of traffic signs. (2) The 3D visualization of inventory results is a pending work in the literature. In this work, a 3D WebGIS based on Potree architecture [32] of large point cloud datasets is proposed.
This work is organized as follows. Section 2 will describe the case study data and the proposed methodology for traffic sign inventory. Section 3 will show and discuss the results obtained, focusing on the specific contributions that are being made. Finally, Section 4 will outline the conclusions and future directions of this work.

Case Study Data
The 3D point cloud data and 2D imagery for validation in this paper were acquired with a customized assembly of the Phoenix Scout Ultra 32 Mobile LiDAR System. It was a low-cost system equipped with a Velodyne VLP-32C laser scanner, that has a horizontal and vertical field of view of 360 • and 40 • respectively, and 32 laser beams. The scanner had a scan rate of 600,000 measurements per second and operated at a wavelength of 903 nm. The complete specifications of the scanner can be found in the reference [33]. The global positioning was solved with a single-antenna, dual-frequency RTK-GNNS receiver, whose position with respect to the laser and IMU was known and calibrated. A GNSS Topcon HiPer V, placed at less than 15 km from the trajectory, was used as a base for trajectory postprocessing with Inertial Explorer. A Sony A6000 camera was mounted manually, its position being fixed to that of the mobile mapping system. Due to the way the MMS was mounted on the vehicle and the need for calibration between the LiDAR and the camera, which requires the camera to point in the same direction as the LiDAR beams, the mounting of the camera was sub-optimal, pointing backwards and to the left with respect to the movement of the vehicle, hence offering a more challenging setup for traffic sign detection and recognition tasks (Figure 1a).
The case study consists of a point cloud and images acquired on a 6 km stretch of a regional road in the province of Ávila (Spain). The acquisition was performed in both directions, therefore the total trajectory was approximately 12 km (Figure 1b). The acquisition speed was adapted to the maximum speed limit of the road, which was 80 km/h. The point cloud contains approximately 222 million points. The image acquisition system was set to take one image every second, forming a dataset of 634 images. The point cloud contains approximately 222 million points. The image acquisition system was set to take one image every second, forming a dataset of 634 images.
Furthermore, a large image dataset was used to train the Deep Learning architecture employed in this work, with more than 56 thousand traffic sign images. The specific dis tribution of those images are outlined in Section 2.2.

Figure 1.
Case study. (a) low-cost mobile mapping system, with a Velodyne VLP-32C laser scanne and a manually mounted Sony A600 camera. (b) 6 km stretch of a regional road in the province o Á vila (Spain).

Methodology
The methodological approach of this work can be conceptualized in two main blocks as shown in Figure 2. The input data were the images and the 3D point cloud acquired b the MMS. The images were processed by a first block where DL architectures (YOLOv and ResNet152) were used to solve the detection and classification of traffic signs. Thi information, together with the calibration data from the camera and the LiDAR system was used to extract the geographical position of the detected signs in the geolocation block. In this section, each of these blocks will be described in detail, as well as the cali bration process of the camera and the LiDAR system.

Deep Learning Architecture
The Deep Learning part of the workflow consists of two different architectures YOLOv3 and ResNet152. First, the input image is fed into the YOLOv3 network, which returns a new image in which traffic signs have been identified and classified into six dif ferent classes. This output image is then cropped into different images each containing an individual sign. Each of these images is used as an input to the ResNet152 architecture which further classifies these signs into different subclasses ( Figure 3). Furthermore, a large image dataset was used to train the Deep Learning architectures employed in this work, with more than 56 thousand traffic sign images. The specific distribution of those images are outlined in Section 2.2.

Methodology
The methodological approach of this work can be conceptualized in two main blocks, as shown in Figure 2. The input data were the images and the 3D point cloud acquired by the MMS. The images were processed by a first block where DL architectures (YOLOv3 and ResNet152) were used to solve the detection and classification of traffic signs. This information, together with the calibration data from the camera and the LiDAR system, was used to extract the geographical position of the detected signs in the geolocation block. In this section, each of these blocks will be described in detail, as well as the calibration process of the camera and the LiDAR system. The point cloud contains approximately 222 million points. The image acquisition system was set to take one image every second, forming a dataset of 634 images. Furthermore, a large image dataset was used to train the Deep Learning architectures employed in this work, with more than 56 thousand traffic sign images. The specific distribution of those images are outlined in Section 2.2.

Figure 1.
Case study. (a) low-cost mobile mapping system, with a Velodyne VLP-32C laser scanner and a manually mounted Sony A600 camera. (b) 6 km stretch of a regional road in the province of Á vila (Spain).

Methodology
The methodological approach of this work can be conceptualized in two main blocks, as shown in Figure 2. The input data were the images and the 3D point cloud acquired by the MMS. The images were processed by a first block where DL architectures (YOLOv3 and ResNet152) were used to solve the detection and classification of traffic signs. This information, together with the calibration data from the camera and the LiDAR system, was used to extract the geographical position of the detected signs in the geolocation block. In this section, each of these blocks will be described in detail, as well as the calibration process of the camera and the LiDAR system.

Deep Learning Architecture
The Deep Learning part of the workflow consists of two different architectures: YOLOv3 and ResNet152. First, the input image is fed into the YOLOv3 network, which returns a new image in which traffic signs have been identified and classified into six different classes. This output image is then cropped into different images each containing an individual sign. Each of these images is used as an input to the ResNet152 architecture, which further classifies these signs into different subclasses ( Figure 3).

Deep Learning Architecture
The Deep Learning part of the workflow consists of two different architectures: YOLOv3 and ResNet152. First, the input image is fed into the YOLOv3 network, which returns a new image in which traffic signs have been identified and classified into six different classes. This output image is then cropped into different images each containing an individual sign. Each of these images is used as an input to the ResNet152 architecture, which further classifies these signs into different subclasses ( Figure 3).  YOLOv3 is a deep learning model used for object detection and classification [22]. This architecture uses Convolutional Neural Networks (CNNs) to provide object detection at three different scales that are then merged to produce the output (Figure 4). This output consists of a series of bounding boxes along with the recognized classes. The ResNet152 architecture is a deep learning model used for image classification [34]. It consists of a Deep Residual Network of up to 152 convolutional layers. The residual network is constructed by adding identity connections between the layers, which adds information from the input of each layer to its output, allowing deeper networks to obtain state-of-the-art performance.
A large image dataset was used to train the Deep Learning architectures employed in this paper, with more than 56 thousand traffic sign images. Following the hierarchical classification schema, the YOLOv3 architecture was trained with 56,111 images, belonging to six traffic sign classes: Stop, yield, no entry, obligation, prohibition, and danger. For the case of obligation and prohibition, where a larger number of images were available, 27,308 images were used to train the ResNet152 architecture, to recognize 11 subclasses of prohibition signs, and 14 subclasses of obligation signs, as depicted in Figure 5. Training parameters are listed in Table 1.  YOLOv3 is a deep learning model used for object detection and classification [22]. This architecture uses Convolutional Neural Networks (CNNs) to provide object detection at three different scales that are then merged to produce the output (Figure 4). This output consists of a series of bounding boxes along with the recognized classes.  YOLOv3 is a deep learning model used for object detection and classification [22]. This architecture uses Convolutional Neural Networks (CNNs) to provide object detection at three different scales that are then merged to produce the output (Figure 4). This output consists of a series of bounding boxes along with the recognized classes. The ResNet152 architecture is a deep learning model used for image classification [34]. It consists of a Deep Residual Network of up to 152 convolutional layers. The residual network is constructed by adding identity connections between the layers, which adds information from the input of each layer to its output, allowing deeper networks to obtain state-of-the-art performance.
A large image dataset was used to train the Deep Learning architectures employed in this paper, with more than 56 thousand traffic sign images. Following the hierarchical classification schema, the YOLOv3 architecture was trained with 56,111 images, belonging to six traffic sign classes: Stop, yield, no entry, obligation, prohibition, and danger. For the case of obligation and prohibition, where a larger number of images were available, 27,308 images were used to train the ResNet152 architecture, to recognize 11 subclasses of prohibition signs, and 14 subclasses of obligation signs, as depicted in Figure 5. Training parameters are listed in Table 1.  The ResNet152 architecture is a deep learning model used for image classification [34]. It consists of a Deep Residual Network of up to 152 convolutional layers. The residual network is constructed by adding identity connections between the layers, which adds information from the input of each layer to its output, allowing deeper networks to obtain state-of-the-art performance.
A large image dataset was used to train the Deep Learning architectures employed in this paper, with more than 56 thousand traffic sign images. Following the hierarchical classification schema, the YOLOv3 architecture was trained with 56,111 images, belonging to six traffic sign classes: Stop, yield, no entry, obligation, prohibition, and danger. For the case of obligation and prohibition, where a larger number of images were available, 27,308 images were used to train the ResNet152 architecture, to recognize 11 subclasses of prohibition signs, and 14 subclasses of obligation signs, as depicted in Figure 5. Training parameters are listed in Table 1.

Camera-LiDAR Calibration
The output data of the AI block consists of, on the one hand, images with bounding boxes corresponding to the signal detections, and, on the other hand, a text file including the path of each image together with a text string representing the classified signal, and the coordinates of its bounding box in the image coordinate system as a vector (x, y, w, h), where (x,y) are the pixel coordinates of the upper left corner of the bounding box and (w,h) are its width and height in pixel units.
This process alone is not sufficient for an accurate inventory of vertical signs, as the geographical location of the detected signs is not yet available at this stage. However, with the support of a 3D point cloud, this location can be obtained relatively easily if both 2D and 3D data are temporally and spatially synchronized.
The time synchronization is done from the GNSS of the MMS. Both the camera and the LiDAR store the Time of Week (TOW) from the GNSS: each image, and each point of the 3D point cloud have an associated timestamp, and both are synchronized with each other, so that it is possible to obtain point cloud data of a photographed scene given the time stamp of the acquisition of any image.
Spatial synchronization is more complex to implement. The objective is to know the geometric transformation needed to convert a three-dimensional coordinate in the point cloud to its corresponding two-dimensional coordinate in the image, and vice versa. Thus, it is important to define the coordinate systems that play a role in this step of the process ( Figure 6): coordinates go North and East from an origin that depends on the UTM zone. • Vehicle coordinate system [ , , ]: It has its origin in the centre of navigation of the MMS, which coincides with the inertial measurement unit (IMU). The position and orientation of this system with respect to the global coordinate system is given in the vehicle trajectory file, with a frequency of 1 Hz (i.e., one point per second) corresponding to the GNSS, including for each point the 3D position and the three orientation angles (roll, pitch, and yaw) coming from the IMU.

•
Sensor co-ordinate system [ , , ]: It has its origin in the LiDAR sensor. The point cloud is initially registered in this coordinate system, and then transformed to the global coordinate system during its pre-processing. The transformation with respect to the vehicle coordinate system is given by the MMS calibration sheet.

Camera-LiDAR Calibration
The output data of the AI block consists of, on the one hand, images with bounding boxes corresponding to the signal detections, and, on the other hand, a text file including the path of each image together with a text string representing the classified signal, and the coordinates of its bounding box in the image coordinate system as a vector (x, y, w, h), where (x, y) are the pixel coordinates of the upper left corner of the bounding box and (w, h) are its width and height in pixel units.
This process alone is not sufficient for an accurate inventory of vertical signs, as the geographical location of the detected signs is not yet available at this stage. However, with the support of a 3D point cloud, this location can be obtained relatively easily if both 2D and 3D data are temporally and spatially synchronized.
The time synchronization is done from the GNSS of the MMS. Both the camera and the LiDAR store the Time of Week (TOW) from the GNSS: each image, and each point of the 3D point cloud have an associated timestamp, and both are synchronized with each other, so that it is possible to obtain point cloud data of a photographed scene given the time stamp of the acquisition of any image.
Spatial synchronization is more complex to implement. The objective is to know the geometric transformation needed to convert a three-dimensional coordinate in the point cloud to its corresponding two-dimensional coordinate in the image, and vice versa. Thus, it is important to define the coordinate systems that play a role in this step of the process ( Figure 6): coordinates go North and East from an origin that depends on the UTM zone.
It has its origin in the centre of navigation of the MMS, which coincides with the inertial measurement unit (IMU). The position and orientation of this system with respect to the global coordinate system is given in the vehicle trajectory file, with a frequency of 1 Hz (i.e., one point per second) corresponding to the GNSS, including for each point the 3D position and the three orientation angles (roll, pitch, and yaw) coming from the IMU. • Sensor co-ordinate system [X s , Y s , Z s ]: It has its origin in the LiDAR sensor. The point cloud is initially registered in this coordinate system, and then transformed to the global coordinate system during its pre-processing. The transformation with respect to the vehicle coordinate system is given by the MMS calibration sheet.
It defines the position of the optical centre and the orientation of the camera. As it was mounted on the vehicle without being initially related to the MMS, the transformation between this reference system and that of the sensor was unknown. Therefore, the spatial synchronization problem can be solved if this transformation is known, as well as the internal parameters of the camera.
Infrastructures 2022, 7, 133 7 of 16 • Camera coordinate system [ , , ]: It defines the position of the optical centre and the orientation of the camera. As it was mounted on the vehicle without being initially related to the MMS, the transformation between this reference system and that of the sensor was unknown. Therefore, the spatial synchronization problem can be solved if this transformation is known, as well as the internal parameters of the camera. For the calibration process of the transformation matrix between the sensor and camera coordinate systems ( ), as well as for the calibration of the intrinsic parameters of the camera, a checkboard was used. It had 9 × 12 squares, and the side of each square measured 65 mm. The data collection for the calibration consisted of the simultaneous acquisition of images and point clouds from the checkboard. Nine pairs of point clouds and images were obtained for calibration ( Figure 7). First, the intrinsic parameters of the camera were calibrated using the images. All the parameters, including distortion coefficients, were estimated simultaneously using nonlinear least-squares minimization [35,36]. These parameters were internal and fixed to the camera employed, and they included: • Focal length, f. Distance between the optical centre of the camera (origin of the camera coordinate system) and the sensor along the optical axis.

•
Intrinsic matrix, M. Transformation between the 3D camera coordinates and the 2D image coordinates. Camera principal point ( , ) (intersection of the optical axis and the sensor) and focal length were embedded in this matrix (Equation (1)). For the calibration process of the transformation matrix between the sensor and camera coordinate systems (T sc ), as well as for the calibration of the intrinsic parameters of the camera, a checkboard was used. It had 9 × 12 squares, and the side of each square measured 65 mm. The data collection for the calibration consisted of the simultaneous acquisition of images and point clouds from the checkboard. Nine pairs of point clouds and images were obtained for calibration (Figure 7).
Infrastructures 2022, 7, 133 7 of 1 • Camera coordinate system [ , , ]: It defines the position of the optical centre and the orientation of the camera. As it was mounted on the vehicle without being initiall related to the MMS, the transformation between this reference system and that of th sensor was unknown. Therefore, the spatial synchronization problem can be solved i this transformation is known, as well as the internal parameters of the camera. For the calibration process of the transformation matrix between the sensor and cam era coordinate systems ( ), as well as for the calibration of the intrinsic parameters of th camera, a checkboard was used. It had 9 × 12 squares, and the side of each square meas ured 65 mm. The data collection for the calibration consisted of the simultaneous acquisi tion of images and point clouds from the checkboard. Nine pairs of point clouds and im ages were obtained for calibration (Figure 7). First, the intrinsic parameters of the camera were calibrated using the images. All th parameters, including distortion coefficients, were estimated simultaneously using non linear least-squares minimization [35,36]. These parameters were internal and fixed to th camera employed, and they included: • Focal length, f. Distance between the optical centre of the camera (origin of the camer coordinate system) and the sensor along the optical axis.  First, the intrinsic parameters of the camera were calibrated using the images. All the parameters, including distortion coefficients, were estimated simultaneously using nonlinear least-squares minimization [35,36]. These parameters were internal and fixed to the camera employed, and they included: • Focal length, f. Distance between the optical centre of the camera (origin of the camera coordinate system) and the sensor along the optical axis. • Intrinsic matrix, M. Transformation between the 3D camera coordinates and the 2D image coordinates. Camera principal point (c x , c y ) (intersection of the optical axis and the sensor) and focal length were embedded in this matrix (Equation (1)).
• Coefficients for radial distortion (k 1 , k 2 ). They describe the lens distortion as a deviation from an ideal projection following the Gaussian model (Equation (2)), as a function of the distance r to the camera principal point.
The results of this process, as well as the calibration error, will be shown and discussed in Section 3.2.
Then, the transformation matrix T sc was computed to perform LiDAR-camera data fusion. From a set of images and point clouds of the same scene it was possible to estimate the geometric relationship between the LiDAR and camera coordinate systems. For this purpose, the checkerboard was used. First, the checkerboard corners in the images and the checkerboard plane in the point cloud were extracted. Then, from a minimum of four image-point cloud pairs, this geometric information was used to obtain the transformation matrix T sc . The process is defined in detail in [37], and the results and calibration errors are shown in Section 3.2.

Geolocation and Inventory Visualization
The geolocation workflow for a single image with a traffic sign detected within the IA block is summarized in Figure 8. First, the input data is described: • Coefficients for radial distortion ( 1 , 2 ). They describe the lens distortion as a deviation from an ideal projection following the Gaussian model (Equation (2)), as a function of the distance r to the camera principal point. (1) The results of this process, as well as the calibration error, will be shown and discussed in Section 3.2.
Then, the transformation matrix was computed to perform LiDAR-camera data fusion. From a set of images and point clouds of the same scene it was possible to estimate the geometric relationship between the LiDAR and camera coordinate systems. For this purpose, the checkerboard was used. First, the checkerboard corners in the images and the checkerboard plane in the point cloud were extracted. Then, from a minimum of four image-point cloud pairs, this geometric information was used to obtain the transformation matrix . The process is defined in detail in [37], and the results and calibration errors are shown in Section 3.2.

Geolocation and Inventory Visualization
The geolocation workflow for a single image with a traffic sign detected within the IA block is summarized in Figure 8. First, the input data is described: • Image: 2D image that contains a traffic sign, as detected by the IA block.  Point cloud data selection. To perform the 3D point projection on the image, it is convenient to select the part of the point cloud that is around the scene in the image, as the complete point cloud will cover a much larger area than the image, thus the process would be computationally expensive. This can be done in a simple way from the time synchronization, by selecting those points whose time stamp is within ±2 s of the image time stamp on which to project the 3D points.
3D points projection. To carry out the projection of the point cloud on the image, it is necessary to know all the transformations that have been defined in Figure 7, to describe all the points in LiDAR sensor coordinates and then, from the transformation matrix and the intrinsic parameters of the camera, to define the image coordinates of a 3D point. Point cloud data selection. To perform the 3D point projection on the image, it is convenient to select the part of the point cloud that is around the scene in the image, as the complete point cloud will cover a much larger area than the image, thus the process would be computationally expensive. This can be done in a simple way from the time synchronization, by selecting those points whose time stamp is within ±2 s of the image time stamp on which to project the 3D points.
3D points projection. To carry out the projection of the point cloud on the image, it is necessary to know all the transformations that have been defined in Figure 7, to describe all the points in LiDAR sensor coordinates and then, from the transformation matrix T sc and the intrinsic parameters of the camera, to define the image coordinates of a 3D point. The transformation between the global coordinate system and the vehicle coordinate system is defined as Equations (3) and (4): where t wv = (x, y, z) T v is the position of the vehicle coordinate system, and ( φ, θ, ψ) v the orientation of the vehicle for the image that is being processed. The first term in Equation (3) refers to the transformation between the global coordinate system orientation and the reference for the vehicle coordinate system (which is a right-up-backwards coordinate system).
Then, the transformation T vs can be directly applied, as it is given by the calibration sheet of the MMS. The transformation matrix T ws = T wv * T vs can be finally employed to define the point cloud coordinates on the sensor coordinate system and project the 3D point cloud in the image, using the camera-LiDAR extrinsic parameters (T sc ) and the camera intrinsic parameters. The projection, as shown in Figure 9, allows to visualize the points of the cloud projected on the image, as well as to assign colour to the point cloud from the RGB information of the image. The transformation between the global coordinate system and the vehicle coordinate system is defined as Equations (3) and (4): where = ( , , ) is the position of the vehicle coordinate system, and ( , , ) the orientation of the vehicle for the image that is being processed. The first term in Equation (3) refers to the transformation between the global coordinate system orientation and the reference for the vehicle coordinate system (which is a right-up-backwards coordinate system).
Then, the transformation can be directly applied, as it is given by the calibration sheet of the MMS. The transformation matrix = * can be finally employed to define the point cloud coordinates on the sensor coordinate system and project the 3D point cloud in the image, using the camera-LiDAR extrinsic parameters ( ) and the camera intrinsic parameters. The projection, as shown in Figure 9, allows to visualize the points of the cloud projected on the image, as well as to assign colour to the point cloud from the RGB information of the image. 3D traffic sign geolocation. At this point, it is possible to extract the precise position of a road sign from its detection in an image. From the bounding box of the detection and the projection of the point cloud in the image, it is possible to extract the 3D points that are projected inside the bounding box. There is the possibility that, in addition to the traffic sign, points belonging to the terrain or other objects behind it are also projected inside the bounding box. To filter out these points, a Euclidean clustering of all points projected inside the bounding box is computed, and the traffic sign is considered to be the cluster of points closest to the sensor. In addition, it is possible that there are several detections of the same sign in different images, which will result in almost identical positions for several signs in the inventory. To avoid this multiplicity, the position and semantic information corresponding to the bounding box with the largest area among all detections is selected. Finally, a table with the traffic sign inventory is exported, containing the path of the image where the traffic sign was detected, its bounding box, its semantic information, and the centroid of the 3D points from the point cloud that are projected into the bounding box.
Visualization. One of the major drawbacks of carrying out asset inventories in road infrastructure is their 3D visualization. The large number of points that are captured makes it difficult to visualize the 3D point cloud data with traditional viewers on standard hardware. For that reason, a 3D viewer was developed based on Potree architecture. Potree is a 3D viewer powered by WebGL using three.js which runs on a web browser (it 3D traffic sign geolocation. At this point, it is possible to extract the precise position of a road sign from its detection in an image. From the bounding box of the detection and the projection of the point cloud in the image, it is possible to extract the 3D points that are projected inside the bounding box. There is the possibility that, in addition to the traffic sign, points belonging to the terrain or other objects behind it are also projected inside the bounding box. To filter out these points, a Euclidean clustering of all points projected inside the bounding box is computed, and the traffic sign is considered to be the cluster of points closest to the sensor. In addition, it is possible that there are several detections of the same sign in different images, which will result in almost identical positions for several signs in the inventory. To avoid this multiplicity, the position and semantic information corresponding to the bounding box with the largest area among all detections is selected. Finally, a table with the traffic sign inventory is exported, containing the path of the image where the traffic sign was detected, its bounding box, its semantic information, and the centroid of the 3D points from the point cloud that are projected into the bounding box. Visualization. One of the major drawbacks of carrying out asset inventories in road infrastructure is their 3D visualization. The large number of points that are captured makes it difficult to visualize the 3D point cloud data with traditional viewers on standard hardware. For that reason, a 3D viewer was developed based on Potree architecture. Potree is a 3D viewer powered by WebGL using three.js which runs on a web browser (it works on a web service) and its able to visualize massive point clouds. Potree mainly consists of two parts: Potree Converter and Potree Renderer [38].
Potree Converter is a tool used to convert a point cloud into a multi-resolution octree required by the Potree Renderer. To generate the octree, first the minimum distance between points at its root level is defined. This distance is called spacing. Each subsequent level will reduce this value by half, increasing the resolution. This parameter can be defined by the user, or a default value known as CAABB (cubic axis-aligned bounding box) can be computed. Next, a second parameter is defined, which will indicate the number of levels of the octree. In this way, a hierarchical data structure is built, with level of detail (LoD) selection and view frustrum culling capabilities.
Potree Renderer is the 3D viewer that renders the multi-resolution octree generated by Potree Converter, rendering only the nodes inside the visible region, and favouring those that are close to the viewer position. Therefore, only relevant points and loaded in memory for a given viewer position.
For visualizing the road signs, using the exported table of the previous section and a set of placeholder images for each class of traffic sign, it is possible to draw a texture with the corresponding image of the sign on an invisible surface where the normal vector is perpendicular to the road in the position of the geolocation exported in the table.
As shown in Figure 10, the surface of the texture is above the road so that it is always visible. works on a web service) and its able to visualize massive point clouds. Potree mainly consists of two parts: Potree Converter and Potree Renderer [38]. Potree Converter is a tool used to convert a point cloud into a multi-resolution octree required by the Potree Renderer. To generate the octree, first the minimum distance between points at its root level is defined. This distance is called spacing. Each subsequent level will reduce this value by half, increasing the resolution. This parameter can be defined by the user, or a default value known as CAABB (cubic axis-aligned bounding box) can be computed. Next, a second parameter is defined, which will indicate the number of levels of the octree. In this way, a hierarchical data structure is built, with level of detail (LoD) selection and view frustrum culling capabilities.
Potree Renderer is the 3D viewer that renders the multi-resolution octree generated by Potree Converter, rendering only the nodes inside the visible region, and favouring those that are close to the viewer position. Therefore, only relevant points and loaded in memory for a given viewer position.
For visualizing the road signs, using the exported table of the previous section and a set of placeholder images for each class of traffic sign, it is possible to draw a texture with the corresponding image of the sign on an invisible surface where the normal vector is perpendicular to the road in the position of the geolocation exported in the table.
As shown in Figure 10, the surface of the texture is above the road so that it is always visible.
Drawing the road signs this way makes a better performance because all the calculations are made in GPU, whereas doing it other way may involve CPU and GPU calculations making this process much slower.

Traffic Sign Detection and Recognition
The traffic sign recognition approach from Section 2.2.1 was validated against a manual inventory of the traffic signs in the road section acquired with the MMS as introduced in Section 2.1. It was a 6 km stretch of a regional road in the province of Á vila (Spain). The acquisition was performed in both directions, thus traffic signs on both sides of the road were visible.
Results for traffic sign detection are shown in Table 2. It shows the precision, recall and f-score of the traffic sign detection step, as well as the mean IoU (intersection over union) for the bounding boxes of the true positives. As the recall metric shows, the number of false positives is considerably high. However, this was mainly because the traffic sign detection algorithm was able to detect traffic signs which were extremely challenging and had been omitted by the manual labeller as the traffic sign was covering only a few pixels for the bounding box to be significative ( Figure 11). Thus, the number of false positives may not be relevant in this context, as long as there is at least one detection of each traffic sign that allows an accurate and reliable inventory. Drawing the road signs this way makes a better performance because all the calculations are made in GPU, whereas doing it other way may involve CPU and GPU calculations making this process much slower.

Traffic Sign Detection and Recognition
The traffic sign recognition approach from Section 2.2.1 was validated against a manual inventory of the traffic signs in the road section acquired with the MMS as introduced in Section 2.1. It was a 6 km stretch of a regional road in the province of Ávila (Spain). The acquisition was performed in both directions, thus traffic signs on both sides of the road were visible.
Results for traffic sign detection are shown in Table 2. It shows the precision, recall and f-score of the traffic sign detection step, as well as the mean IoU (intersection over union) for the bounding boxes of the true positives. As the recall metric shows, the number of false positives is considerably high. However, this was mainly because the traffic sign detection algorithm was able to detect traffic signs which were extremely challenging and had been omitted by the manual labeller as the traffic sign was covering only a few pixels for the bounding box to be significative ( Figure 11). Thus, the number of false positives may not be relevant in this context, as long as there is at least one detection of each traffic sign that allows an accurate and reliable inventory.   In Table 3, a confusion matrix shows the traffic sign recognition results in the first hierarchical step, where seven global classes were defined, following Figure 5. This shows that traffic sign recognition in this first stage had accurate results. From a total of 99 true positive detections, 96 were correctly recognized. However, it should be noted that the class distribution in the validation dataset did not contain enough data for certain types of signals to draw robust conclusions. Finally, the prohibition subclasses were analysed, as the validation dataset did not present enough mandatory signs to perform any further analysis. Table 4 shows the results. Note that the number of images differs from the 47 prohibition signs from Table 3. That is because the DL architecture was not trained for some of the prohibition signs in the dataset, and they were omitted to obtain the results. Table 4. Prohibition subclasses analysis.

Number of Recognized Prohibition Signs
Number of Correct Predictions 34 28

Camera-LiDAR Calibration and Data Geolocation
The camera-LiDAR calibration process was carried out with the LIDAR camera calibrator tool in Matlab. It uses a set of images and point clouds where a checkerboard is visible. On the one hand, it will serve for the calibration of the intrinsic parameters of the camera (Table 5), and on the other hand, from the detection of the plane of the checkerboard both in the image and in the point cloud in several image-point cloud pairs, the transformation matrix between the coordinate system of the lidar sensor and that of the camera, , is obtained, together with the errors in the calibration process (Table 6). In Table 3, a confusion matrix shows the traffic sign recognition results in the first hierarchical step, where seven global classes were defined, following Figure 5. This shows that traffic sign recognition in this first stage had accurate results. From a total of 99 true positive detections, 96 were correctly recognized. However, it should be noted that the class distribution in the validation dataset did not contain enough data for certain types of signals to draw robust conclusions. Finally, the prohibition subclasses were analysed, as the validation dataset did not present enough mandatory signs to perform any further analysis. Table 4 shows the results. Note that the number of images differs from the 47 prohibition signs from Table 3. That is because the DL architecture was not trained for some of the prohibition signs in the dataset, and they were omitted to obtain the results. Table 4. Prohibition subclasses analysis.

Camera-LiDAR Calibration and Data Geolocation
The camera-LiDAR calibration process was carried out with the LIDAR camera calibrator tool in Matlab. It uses a set of images and point clouds where a checkerboard is visible. On the one hand, it will serve for the calibration of the intrinsic parameters of the camera (Table 5), and on the other hand, from the detection of the plane of the checkerboard both in the image and in the point cloud in several image-point cloud pairs, the transformation matrix between the coordinate system of the lidar sensor and that of the camera, T sc , is obtained, together with the errors in the calibration process (Table 6).  To validate the data geolocation process, the position of the traffic signs obtained with this method were compared with a manually annotated ground truth. Note that only traffic signs belonging to classes that could be detected by the DL architecture were considered, and the ground truth had a total of 47 traffic signs. Table 7 shows the results of this validation. The first row of the table indicates the total number of signs that were detected, recognized or geolocated respectively. The second row indicates the percentage of signs that were processed correctly at each stage. That is, the percentage of recognition and geolocation was calculated over the number of detected images, and the percentage of subclass recognition was calculated over the number of detected signals of the subclasses for which the DL architecture was trained.

Data Visualization
Finally, the generation time of the 3D visualization together with the corresponding textures of classified road signs was computed. The results are shown in Table 8, with a point cloud of more than 40 million points, where the Octree generation and visualization took less than half a minute on a standard computer (Intel i5-3570 processor, 32 Gb of RAM, and a NVIDIA Quadro 2000 GPU). In Figure 12, the visual result is shown.

Discussion
The results obtained raise several questions for discussion in this section. First, the geolocation of road signs depends on their correct detection in the images. That is why for those signs for which the Deep Learning model has not been trained, no detection and therefore no geolocation will be obtained, having a negative impact on the inventory process. However, it should be noted that the methodology is proposed as a combination of image-based and 3D point cloud information, so this disadvantage could be overcome by combining the results of an image-based detection, as in this methodology, and that of a 3D point cloud detection based on the intensity parameter and the geometry of the point cloud. The combination of both types of detection would allow the inventory of traffic signs that are not common as a large enough dataset to train a classification model that includes them is not available.
Exploiting the 3D point cloud data to improve the output of this methodology could have more benefits to the overall performance of the inventory process. While this work focuses on the calibration process of the Camera-Lidar system, and the projection from the image to the point cloud, extracting geometric information to enrich the inventory from the information in the 3D point cloud would be straightforward. Geometric measurements of the signal, or its orientation (towards which direction it points) are parameters that can be extracted from the point cloud and that would improve the level of information output of the method.
Finally, it is relevant to discuss the possible causes of error in the geolocation of signals that were correctly detected with the DL architecture. As can be seen in Figure 7, the checkboard used for calibration had to be placed on the right-hand side of the image to enable overlap with the lidar beams. This means that, although the reprojection errors were acceptable, they were smaller in the part of the image where the checkboard was placed. This implies that for signals appearing on the left side of the image (which will also be further from the MMS than a traffic sign on the right side of the image), it is possible that the reprojection will not be performed correctly and it will not be possible to geolocate the signal. This disadvantage could be solved in several ways. On the one hand, by redesigning the placement of the sensors so that the signs are captured more optimally: with the camera pointing to the right in the forward direction of the vehicle and positioned in such a way that the lidar system beams overlap with the central part of the image; on the other hand, by taking pictures more frequently, to avoid cases where there is no optimal position of a signal in any image. In any case, the results obtained with this system allow us to ensure that the methodology of calibration and fusion of 2D and 3D data are

Discussion
The results obtained raise several questions for discussion in this section. First, the geolocation of road signs depends on their correct detection in the images. That is why for those signs for which the Deep Learning model has not been trained, no detection and therefore no geolocation will be obtained, having a negative impact on the inventory process. However, it should be noted that the methodology is proposed as a combination of image-based and 3D point cloud information, so this disadvantage could be overcome by combining the results of an image-based detection, as in this methodology, and that of a 3D point cloud detection based on the intensity parameter and the geometry of the point cloud. The combination of both types of detection would allow the inventory of traffic signs that are not common as a large enough dataset to train a classification model that includes them is not available.
Exploiting the 3D point cloud data to improve the output of this methodology could have more benefits to the overall performance of the inventory process. While this work focuses on the calibration process of the Camera-Lidar system, and the projection from the image to the point cloud, extracting geometric information to enrich the inventory from the information in the 3D point cloud would be straightforward. Geometric measurements of the signal, or its orientation (towards which direction it points) are parameters that can be extracted from the point cloud and that would improve the level of information output of the method.
Finally, it is relevant to discuss the possible causes of error in the geolocation of signals that were correctly detected with the DL architecture. As can be seen in Figure 7, the checkboard used for calibration had to be placed on the right-hand side of the image to enable overlap with the lidar beams. This means that, although the reprojection errors were acceptable, they were smaller in the part of the image where the checkboard was placed. This implies that for signals appearing on the left side of the image (which will also be further from the MMS than a traffic sign on the right side of the image), it is possible that the reprojection will not be performed correctly and it will not be possible to geolocate the signal. This disadvantage could be solved in several ways. On the one hand, by redesigning the placement of the sensors so that the signs are captured more optimally: with the camera pointing to the right in the forward direction of the vehicle and positioned in such a way that the lidar system beams overlap with the central part of the image; on the other hand, by taking pictures more frequently, to avoid cases where there is no optimal position of a signal in any image. In any case, the results obtained with this system allow us to ensure that the methodology of calibration and fusion of 2D and 3D data are valid for the inventory of traffic signs.

Conclusions
This paper presents a method for the automatic recognition and geolocation of vertical traffic signs, so it can support those inventory works performed in road infrastructures. The input data are based on laser point clouds and images acquired with a low-cost mobile mapping system. In particular, the LiDAR employed is a Phoenix Scout Ultra 32, equipped with a Velodyne VLP-32C laser scanner. Data for this work was collected on a regional road in Avila (Spain). Furthermore, the method was validated using data recorded in the field as "ground truth" by an operator, finding that it can reliably detect, recognize and geolocate the vertical traffic signs. With the proposed method, it is possible to generate an automatic inventory of vertical traffic signs which can be visualized and managed in a WebGIS viewer based on Potree architecture. Therefore, this method can save resources in road maintenance works; to provide an inventory of all the vertical traffic signs with a manual operator requires a lot of time. With this approach an automatic inventory of vertical traffic signs can be obtained, and the data can be collected at conventional driving speeds, with no need for maintenance staff to measure outside of the vehicle (reducing risks and roadblocks). Future steps should be focused on refining the different methodological building blocks and broadening the type of infrastructure assets that can be geolocated with this methodology.