Novel approach to automatic traffic sign inventory based on mobile mapping system data and deep learning

: Tra ﬃ c signs are a key element in driver safety. Governments invest a great amount of resources in maintaining the tra ﬃ c signs in good condition, for which a correct inventory is necessary. This work presents a novel method for mapping tra ﬃ c signs based on data acquired with MMS (Mobile Mapping System): images and point clouds. On the one hand, images are faster to process and artiﬁcial intelligence techniques, speciﬁcally Convolutional Neural Networks, are more optimized than in point clouds. On the other hand, point clouds allow a more exact positioning than the exclusive use of images. The false positive rate per image is only 0.004. First, tra ﬃ c signs are detected in the images obtained by the 360 ◦ camera of the MMS through RetinaNet and they are classiﬁed by their corresponding InceptionV3 network. The signs are then positioned in the georeferenced point cloud by means of a projection according to the pinhole model from the images. Finally, duplicate geolocalized signs detected in multiple images are ﬁltered. The method has been tested in two real case studies with 214 images, where 89.7% of the signals have been correctly detected, of which 92.5% have been correctly classiﬁed and 97.5% have been located with an error of less than 0.5 m. This sequence, which combines images to detection–classiﬁcation, and point clouds to geo-referencing, in this order, optimizes processing time and allows this method to be included in a company’s production process. The method is conducted automatically and takes advantage of the strengths of each data type.


Introduction
Communication and mobility of people and goods are key elements of modern societies and developing countries. Economic growth has a huge dependence on and a big relationship with transport networks. Infrastructures such as ports (maritime and river), airports, railways, highways, and roads are among the most relevant transport systems to guarantee the quality of life of people. This relevance is well known by the EU. Proof of that are substantial national and EU funds spent on transport infrastructures every year [1]. These policies are developed based on annual budgets dedicated to new project construction and maintenance of existing infrastructures. In recent times, Traffic sign (TS) current technology provides mainly two sources of data: 3-D georeferenced point clouds acquired through Mobile Laser Scanning (MLS) techniques; and digital images from a still camera or as a frame extracted from a video. 3-D point-cloud data contains precise information related to 3-D location and geometrical properties of the TS, as well as intensity. However, resolution of most MMS techniques under normal use is not accurate enough to recognise all TS classes. Images are used to overcome that weakness as they contain visual properties, despite the lack of spatial information. Since the objective in automated traffic sign inventory is to accurately determine placement in global coordinates and the specific type of each traffic sign on the road, point cloud and image become complementary [20][21][22].
For TS inventory to be automated it is required to follow four main steps: traffic sign detection (TSD), segmentation (TSS), recognition or classification (TSR), and TS 3-D location. TSD aims to identify regions of interest (ROI) and boundaries of traffics signs. In TSS, a segment corresponding to the object is separated from the set of input data. TSR consists of determining the meaning of the traffic sign. Meanwhile, TS 3-D location deals with estimating 3-D position and orientation, or pose, of the TS. A variety of approaches for these steps have been proposed in literature directly or indirectly related to TS inventory.
One group of these approaches defines techniques focused on detecting and segmenting the set of points with spatial information of the TS from 3-D laser scanner point clouds. These techniques are based on the a priori knowledge of 3-D location, geometrical and/or retro-reflective properties. All approaches are conditioned by the huge amount of information contained in point clouds (see, for instance, [23][24][25][26][27][28][29][30]). With the aim of accurate TSR, aforementioned approaches combine point clouds with images to extract features. As a previous step to TSR, segmented points can be projected onto the corresponding 2-D image in the traffic-sign-mapping (TSM) step. A review of methods for TSR in point-cloud and image approaches can be found in [31].
These types of techniques based on TSD in 3-D point cloud and TSR in image are accurate and reliable for TS inventory. However, they entail high time and computational costs, mainly for the TSD and TSS steps. As an alternative, images can be used not only for TSR but also for TSD without making use of the 3-D point cloud. Some authors have used TSD in image for coarse segmentation of the 3-D point cloud [32,33].
TSD, TSS, and TSR in image, which become TSDR, have been extensively studied for TS inventory as well as for other applications such as advanced driver assistance systems (ADAS). The vast variety of techniques proposed by the computer-vision community have been reviewed and compared, detailing advantages and drawbacks, in [34][35][36][37][38]. Recently, Wali et al. [39] provided a comprehensive survey on vision-based TSDR systems.
According to them, in TSDR image-based techniques detection consists in finding the TS bounding box, while recognition involves classification by giving an image a label. Common TSD methods are: colour-based, on different colour spaces, i.e., RGB, CIELab, and HIS [40]; shape-based, such as Hough Transform (HT) and Distance Transform (DT); texture-based, such as Local Binary Patterns (LBP) [41]; and hybrid. By these methods a feature vector is extracted from image with lower computational cost than from 3-D point cloud. Then, the class label of the feature vector is obtained using a classifier such as Support Vector Machine (SVM) or with Deep Learning-based (DL) methods [42][43][44]. Among the latter, Convolutional Neural Networks (CNN) have been widely adopted, given their high performance in both TSD and TSR in images [45][46][47][48] and in point clouds [49].
Regarding TS inventory, TSDR in image requires the TS 3-D location to be completed. TS 3-D location, after TSDR, has been considered by several authors in image-based 3-D reconstruction approaches without making use of a 3-D point cloud. These techniques require prior accurate camera calibration and removement perspective distortion. In [50], 3-D localization is based on epipolar geometry of multiple images, while Hazelhoff et al. [51] calculated the position of the object from panoramic images referenced to northern direction and horizon. Balali et al. [52] built a point cloud by photogrammetry techniques using a three parameter pinhole model. Wang et al. [53] used stereo Remote Sens. 2020, 12, 442 4 of 15 vision and triangulation techniques. In [54], 3-D reconstruction is conducted by geometric epipolar, taking into account geometric shape of TS.
While TSDR in image is proved as high-performance, reconstruction models for TS 3-D location from image are overcome in precision by 3-D point-cloud-based location. However, little research has paid attention to techniques for TS inventory that jointly takes advantage of TSDR in image and TS 3-D location from the 3-D MLS point cloud. In [32], a method to combine DL with retro-reflective properties is developed for TS extraction and location from a point cloud. In [55], TS candidates are detected on images based on colour and shape features and filtered with point-cloud information. Most authors use point clouds for TSD and images only for TSR [23][24][25][26]28,29].
In contrast to other approaches, in this work a data flow is implemented to minimize processing times by taking advantage of each type of data. First, images are used for TSD and TSR. Image processing is faster than point-cloud processing and allows the application of DL techniques, which right now are state of the art. In addition, the design of a modular workflow allows each network to be replaced in the future as its success rates increase. To maximize a correct TS identification, different networks for TSD and TSR are used, unlike other works that use the same network to detect and classify, see also [44,48]. After image processing, point clouds are used to filter out multiple TS detections and false positives. Point clouds allow more precise geolocation than the use of epipolar geometry of multiple images. Point clouds are not used for detection and classification since: • DL point-cloud processing techniques are computationally more expensive than their equivalent in image processing.

•
The addition of point-cloud data to images increases processing times.

•
TSs are not always in good condition to be detected by their reflectivity, as other authors have proposed [23,56].

•
The low point density does not provide useful information for TSR.

Method
The method consists of four main processes. First, TSs are detected in images. Second, detected TSs are recognized. Third, TSs are 3-D geolocated by the projection of detected signs to the point cloud. Fourth, multiple TS detections of the same sign in different images are filtered. The input data of the method are MMS data: images from a 360 • camera, point clouds, GPS-IMU positioning data, and camera calibration data. Figure 1 shows the workflow of the method.

Traffic Sign Detection
TSD is based on object detection in images. No point-cloud data are used at this stage to speed up the detection process. The input images are acquired with a 360 • RGB camera mounted on the MMS during acquisition. The panoramic image is converted and rectified into six images oriented according to cube sides. Images in trajectory direction I T provide TS information in front of the MMS. Images in the opposite direction I To provide TS information in back of the MMS, either in the same lane or in different lanes. Lateral images are perpendicular to trajectory direction I ⊥T and provide information about signs located on MMS sides. Lateral images I ⊥T are particularly relevant for detecting no-parking signs or no-entry signs. The images forming the top and bottom of the cube are not relevant for TSD. Bottom images are occupied by the camera support. TSs that could be detected on top images are already detected by front images I T .
The object detector implemented in this method is RetinaNet [57]. This detector has been chosen because it is state of the art in standard accuracy metrics, memory consumption, and running times. RetinaNet is a one-stage detector that has good behaviour with unbalanced classes and in images with a high density of objects at several scales, key factors for traffic sign detection. RetinaNet uses ResNet [58] as a basic feature extractor, and in this work is used the ResNet 50.
The RetinaNet detector is applied to each cube-side image I of the set acquired with the MMS during the acquisition. As a result, an array is obtained for each TS detected S(l,I x ,I y ,w,h) where l indicates the label, I x and I y indicate top left corner position of the bounding box, w indicates TS width and h indicates TS height. In order to obtain maximum classification accuracy, the number of classes has been reduced to coincide with shapes of traffic sings. The classes for detection with RetinaNet are five: yield, stop, triangular, circular, and square ( Figure 1). In the recognition phase (Section 3.2) these classes will be classified. geometry of multiple images. Point clouds are not used for detection and classification since: • DL point-cloud processing techniques are computationally more expensive than their equivalent in image processing. • The addition of point-cloud data to images increases processing times.
• TSs are not always in good condition to be detected by their reflectivity, as other authors have proposed [23,56]. • The low point density does not provide useful information for TSR.

Method
The method consists of four main processes. First, TSs are detected in images. Second, detected TSs are recognized. Third, TSs are 3-D geolocated by the projection of detected signs to the point cloud. Fourth, multiple TS detections of the same sign in different images are filtered. The input data of the method are MMS data: images from a 360° camera, point clouds, GPS-IMU positioning data, and camera calibration data. Figure 1 shows the workflow of the method.

Traffic Sign Recognition
In this phase, TSs previously detected by their shape are classified with their final label. In some TSs, their shape coincides with their final class, as in the case of stop signs (octagonal) and yield sign (inverted triangle). TSs of obligation (circular), recommendation-information (square), and danger (triangular) encompass multiple classes that must be classified for a correct inventory. For each of these three sign shapes, an InceptionV3 network [59] has been trained and implemented. The InceptionV3 network needs input samples of fixed size 299 * 299 * 3 pixels, the bounding boxes images of detected signs S are resized to adapt them to the network input.

Traffic Sign 3-D Location
The projection of TSs detected in images onto the georeferenced point cloud is done using the pinhole model [60]. While in other works the four vertices of the detection polygon have been projected, in this work, only the central TS point is projected S c . This saves processing time and minimizes calibration error. Another alternative would be detecting the pole directly or after TS detection. Pole detection would mean more precise positioning, but it has the following limitations: (1) poles may not have enough points to be easily detected, (2) some TSs share a pole, and (3) some TSs are located on traffic lights, light posts, or buildings, so specific detection methods are needed for each case.  (1)).
where s is the scalability factor; S c is the centre of traffic signal S detected in an image I, S c = [u, v, 1] T with u = I x + w and v = I y − h /2 w; K is the intrinsic camera parameters matrix provided by the manufacturer; [R|t] is the extrinsic camera parameters matrix; P s is the centre of the point-cloud traffic The rotation and translation matrix [R|t] positions the camera in the same coordinate system as the point cloud P s , which is already georeferenced.
[R|t] is formed by two rotation-translation provided by the manufacturer; R|t is the extrinsic camera parameters matrix; is the centre of the point-cloud traffic signal = , , , 1 .
The rotation and translation matrix R|t positions the camera in the same coordinate system as the point cloud Ps, which is already georeferenced. R|t is formed by two rotation-translation matrices R|t = | | relates the positioning of the pixels with the image by calibration prior to implementation of the method. Once the matrix | for one image is obtained, it is valid for all images acquired with the same equipment. The calibration is done by manually selecting the four pairs of pixels in images and points in the point cloud per image.
| positions the camera in the optical centre C of each image I.
The TS points in the point cloud Ps form a plane . The TS is located in the intersection between the projection of the line ⃗ following the pinhole model and plane (Figure 2), = ⃗ ∩ .
In order to reduce processing time, a region of interest (ROI) is delimited in the point cloud to calculate possible planes ( Figure 3). First, points located at a distance more than d from the MMS location at the time of taking the image are discarded. Distant TSs from the MMS are considered to have very low point density for correct location. Distant TSs also are detected in successive images captured near the MMS. Second, points located at a larger distance than r from line ⃗ are discarded.
Third, points not located in the image orientation are discarded. TSs detected in images cannot be in a point cloud in a different orientation. For remaining points, planes are detected in order to discard point not in planes. Since TSs are planar elements, planar estimation avoids false locations due to noise points crossing the projection line ⃗ .

Redundant Traffic Sign Filtering
Since the same TS can be detected in multiple images, multiple detections of the same TS must be simplified. The filtering is done with information of the classified TS, because one post can contain TSs of different classes. TSs of the same class grouped in a smaller radius than f are eliminated, leaving only the first detected ( Figure 4). Since the same TS can be detected in multiple images, multiple detections of the same TS must be simplified. The filtering is done with information of the classified TS, because one post can contain TSs of different classes. TSs of the same class grouped in a smaller radius than f are eliminated, leaving only the first detected ( Figure 4).

Equipment and Parametes
The MMS equipment used for this work consisted of a Lynx Mobile Mapper, with a Ladybug5 360° camera and a GPS-IMU Applanix POS LV 520. The cube-images had a resolution of 2448 x 2024 pixels and they were captured with a frequency of 5 meters in MMS trajectory. The point cloud was a continuous acquisition over time. The values of parameters d and r to delimit the ROI were set at 15 m and 2 m, respectively. The value of parameter f was set to 1 m in order to simplify duplicate signals.
For the RetinaNet training, 9500 images were used with 12,036 TSs, obtained by the 360° camera and labelled. The training of the InceptionV3 networks was carried out with data sets of Belgium [50], Germany [61], and images of Spanish traffic signs. The whole process (training and testing in real case studies) was executed on a CPU computer Intel i7 6700, 32 GB RAM, GPU Nvidia 1080ti. The code was combined TensorFlow-Python for TSD and TSR and C++ for 3-D location and filtering. The

Equipment and Parametes
The MMS equipment used for this work consisted of a Lynx Mobile Mapper, with a Ladybug5 360 • camera and a GPS-IMU Applanix POS LV 520. The cube-images had a resolution of 2448 × 2024 pixels and they were captured with a frequency of 5 m in MMS trajectory. The point cloud was a continuous acquisition over time. The values of parameters d and r to delimit the ROI were set at 15 m and 2 m, respectively. The value of parameter f was set to 1 m in order to simplify duplicate signals.
For the RetinaNet training, 9500 images were used with 12,036 TSs, obtained by the 360 • camera and labelled. The training of the InceptionV3 networks was carried out with data sets of Belgium [50], Germany [61], and images of Spanish traffic signs. The whole process (training and testing in real case studies) was executed on a CPU computer Intel i7 6700, 32 GB RAM, GPU Nvidia 1080ti. The code was combined TensorFlow-Python for TSD and TSR and C++ for 3-D location and filtering. The

Case Studies
The methodology was tested in two real case studies: two secondary roads located in Galicia (Spain) denominated EP9701 and EP9703. Road EP9701 case study was 9.2 km long, the point cloud contained 350 million points and was acquired with 7392 images. Road EP9703 case study was 5.5 km long, the point cloud contained 180 million points and was acquired with 4520 images. Both roads were located in rural areas where houses, fields, and wooded areas were interspersed. The roads had frequent crossings and curves. The sign-posting of both roads was abundant and in good condition, with few samples that were damaged or partially occluded. The case studies were processed in 30 and 20 minutes, respectively.
The acquisition was performed at the central hours of the day (to minimize shadows) and on a sunny day without fog, so as not to affect visibility. The MLS maintained a constant driving speed of approximately 50 km/h, although this speed was reduced by following rules at intersections or traffic lights. Point density increased as the driving speed decreased. It was estimated that the points in acquisition direction were 1 cm closer for every 10 km/h that the speed was reduced.

Results
TS accounting was done manually by reviewing acquired images, detected signals, classified signs and their locations in the point cloud. Table 1 shows the image count for each case study.

Case Studies
The methodology was tested in two real case studies: two secondary roads located in Galicia (Spain) denominated EP9701 and EP9703. Road EP9701 case study was 9.2 km long, the point cloud contained 350 million points and was acquired with 7392 images. Road EP9703 case study was 5.5 km long, the point cloud contained 180 million points and was acquired with 4520 images. Both roads were located in rural areas where houses, fields, and wooded areas were interspersed. The roads had frequent crossings and curves. The sign-posting of both roads was abundant and in good condition, with few samples that were damaged or partially occluded. The case studies were processed in 30 and 20 min, respectively.
The acquisition was performed at the central hours of the day (to minimize shadows) and on a sunny day without fog, so as not to affect visibility. The MLS maintained a constant driving speed of approximately 50 km/h, although this speed was reduced by following rules at intersections or traffic lights. Point density increased as the driving speed decreased. It was estimated that the points in acquisition direction were 1 cm closer for every 10 km/h that the speed was reduced.

Results
TS accounting was done manually by reviewing acquired images, detected signals, classified signs and their locations in the point cloud. Table 1 shows the image count for each case study.
TSs were correctly detected at 89.7%, while 10.3 % were not detected. The use of the 360 • camera and the cube-images made it possible to locate TSs in the opposite and lateral directions to the MMS movement. Some of the undetected TSs were partially occluded or were eliminated in the redundant TS filtering process (Section 3.4), since they were traffic signs of the same class separated within a distance f. Figure 6 shows examples of detected TSs. TS filtering process (Section 3.4), since they were traffic signs of the same class separated within a distance f. Figure 6 shows examples of detected TSs. A high percentage of false detections was counted (19.6%). Of these, traffic mirrors represented 81.8% and 37% of false detections in case studies 1 and 2 respectively. Mirrors have a circular shape surrounded by a red ring, so they were detected as false circular signals. Although the use of the point cloud has been considered to eliminate these false positives, since mirrors should not have points due to their high reflectivity, in the case studies the mirrors contained points due to their dirt or deterioration. Nor have any characteristics been found that differentiate mirror points from TS points. The remaining false detections corresponded to different objects on roadsides. Figure 7 shows some examples of false detections. A high percentage of false detections was counted (19.6%). Of these, traffic mirrors represented 81.8% and 37% of false detections in case studies 1 and 2 respectively. Mirrors have a circular shape surrounded by a red ring, so they were detected as false circular signals. Although the use of the point cloud has been considered to eliminate these false positives, since mirrors should not have points due to their high reflectivity, in the case studies the mirrors contained points due to their dirt or deterioration. Nor have any characteristics been found that differentiate mirror points from TS points. The remaining false detections corresponded to different objects on roadsides. Figure 7 shows some examples of false detections. Duplicate TSs were not filtered due to incorrect positioning by the TS 3-D location process (Section 3.3). In the input cube-side images, 382 TSs were detected in case study 1 and 441 signs in case study 2. After the 3-D localization and redundant filtering processes, the set of detections was reduced to 113 and 137 TSs, respectively. Duplicated TSs were 4.2% of the total.
The positioning of a TS was based on the georeferenced point cloud, where the authors assumed that the location of the point cloud corresponded precisely to reality, as in [23]. Authors also considered that the TSs positioned in the correct TS point cloud were correctly located (0 m error). A total 97.5% of the detected TSs corresponded to points belonging to TSs (Figure 8). Only five TSs were positioned with an error of between 0.5 meters and 8 meters to the real location of the sign. These TSs not correctly positioned in the corresponding TS point clouds were manually measured from their incorrect detected location to the real TS location in the point cloud.  Duplicate TSs were not filtered due to incorrect positioning by the TS 3-D location process (Section 3.3). In the input cube-side images, 382 TSs were detected in case study 1 and 441 signs in case study 2. After the 3-D localization and redundant filtering processes, the set of detections was reduced to 113 and 137 TSs, respectively. Duplicated TSs were 4.2% of the total.
The positioning of a TS was based on the georeferenced point cloud, where the authors assumed that the location of the point cloud corresponded precisely to reality, as in [23]. Authors also considered that the TSs positioned in the correct TS point cloud were correctly located (0 m error). A total 97.5% of the detected TSs corresponded to points belonging to TSs (Figure 8). Only five TSs were positioned with an error of between 0.5 m and 8 m to the real location of the sign. These TSs not correctly positioned in the corresponding TS point clouds were manually measured from their incorrect detected location to the real TS location in the point cloud. Duplicate TSs were not filtered due to incorrect positioning by the TS 3-D location process (Section 3.3). In the input cube-side images, 382 TSs were detected in case study 1 and 441 signs in case study 2. After the 3-D localization and redundant filtering processes, the set of detections was reduced to 113 and 137 TSs, respectively. Duplicated TSs were 4.2% of the total.
The positioning of a TS was based on the georeferenced point cloud, where the authors assumed that the location of the point cloud corresponded precisely to reality, as in [23]. Authors also considered that the TSs positioned in the correct TS point cloud were correctly located (0 m error). A total 97.5% of the detected TSs corresponded to points belonging to TSs (Figure 8). Only five TSs were positioned with an error of between 0.5 meters and 8 meters to the real location of the sign. These TSs not correctly positioned in the corresponding TS point clouds were manually measured from their incorrect detected location to the real TS location in the point cloud.  With regard to signal recognition, 92.5% of the detected TSs were correctly classified, both in good condition and partially erased. Since the methodology was tested in real case studies, it was not possible to test all the existing classes of training data. The main classes in the case studies were TSs of dangerous curves, speed bumps, no overtaking, and speed limits. To a lesser extent, there were also traffic signs of yield, stop, roundabouts, no entry, roadworks, and pedestrian crossings. No significant confusion was detected among classes. Errors in confusion were isolated and were corresponded to the results of training.

Discussion
In general, most TSs were detected and positioned correctly, although the algorithm showed a tendency to over-detection. This behaviour was chosen to facilitate monitoring by a human operator. In a correction process, it was considered easier to eliminate false detections than to check all input data for undetected signals. In terms of false positives per image, the false detection rate was low, 0.004 FP/image, compared to [54], where 0.32 and 0.07 were reached in the cases with images of better resolution. Regarding undetected TSs (false negatives), the neural network did not detect 10.3% of all TSs, which was similar to other artificial intelligence works: 10% in [51] and 11% in [28], but far from the best of the state of the art: 6% in [62], based on laser scanner intensity; 5% in [32], based on combining two neural networks; and 4% in [29], based on bag-of-visual-phrases.
The authors are aware that the detection success rate was not as high as in other applications using RetinaNet [63]. This was due to the relative small size of the data set for TSD and the great variability of elements that existed in the road environment. Generating a data set for detection is a costly work and was not the final objective of this work, which was focused on presenting a methodology composed of a series of processes to inventory signals, and not on optimizing the success rates of Deep Learning networks such as RetinaNet and InceptionV3.
The methodology did not reach detection rates as high as reference works in TSD and TSR, such as [50,52], although it is worth mentioning that the latter classifies TS grouped by type. By contrast, the proposed methodology is adaptable for mapping different objects, as it does not focus on exclusive TS features. Particularly, by not using reflectivity, it was possible to detect TSs whose reflectivity had diminished due to the passage of time and incorrect maintenance. With the use of Deep Learning techniques, although they do not explain exactly why false detections occur, it is possible to intuit the underlying problem. Deep Learning techniques allow continuous improvement and updates to the training database with new samples that, in this case, may be the wrong detections once corrected. In this way, the algorithm will be able to avoid them.
The combination of images to TSD and TSR with a point cloud to TS 3-D locations allowed a precise positioning of 97.5% of detected TSs in points belonging to TS point clouds, which was not reached by other works based exclusively on epipolar geometry of multiple images, such as [50], which only achieved a positioning with 26 cm of average error, [53] with 1-3 m of average error using dual cameras, and [64] with 3.6 m of average error using Google Street View images.
While point clouds provide valuable information for locating objects, they also require much more processing time than images. The methodology designed in [23] for TSD and TSR in point clouds was implemented in the two case studies. Processing times using point clouds has reached 45 and 30 min, respectively. The time increment is 50% more than performing TSD and TSR on images and 3-D location in point clouds, as proposed in this work. No relation was found between inventory quality and driving speed changes during acquisition. The work maintained a driving acquisition speed similar to other point-cloud mapping works.

Conclusions
In this work, a methodology for the automatic inventory of road traffic signs was presented. The methodology consists of four main processes: traffic sign detection (TSD), recognition (TSR), 3-D location and filtering. For the TSD and TSR phases, cube-images acquired with a 360 • camera were used and processed by Deep Learning techniques. Five shapes of traffic signs were detected in the cube-side images (stop, yield, triangular, circular and square) applying RetinaNet. Since the stop and yield forms each corresponded to only one TS, in order to recognize the other forms in their respective classes, an InceptionV3 network was trained for each classification. For the 3-D location and filtering phases, the georeferenced point cloud of the environment was used. TSs detected in the images were projected onto the cloud using the pinhole model for correct 3-D geolocation. Finally, the duplicate signals detected in different images were filtered based on the coincidence between classes and distance between them. The methodology was tested in two real case studies with a total of 214 TSs, 89.7% of the TSs were correctly detected, of which 92.5% were correctly classified. The false positive rate per image was only 0.004 and main false detections were due to road mirrors. 97.3% of the detected signals were correctly 3-D geolocated with less than 0.5 m of error.
The effectiveness in the combination of data images and point clouds was demonstrated in this work. Images allow the use of artificial intelligence techniques for detection and classification, which improve their success rates day by day with new networks and designs. In addition, image processing is much faster and more efficient than point cloud processing. The use of a 360 • camera does not require the passage of the MMS in two road directions. Furthermore, point clouds allow a more precise geolocation of signals than only using images.
The entire process of TS inventorying from processing images (first) and point cloud (continued) ensures speed and effectiveness in processing time, 50% faster than other proposals that first treat point clouds and then images with much higher computational costs which, although they provide satisfactory results in terms of success rates, make their inclusion in production processes unfeasible due to cost of time and computer equipment. Due to these advantages, the presented methodology is suitable to be included in the production process of any company. Also, it is conducted automatically without human intervention.
Future work will focus on extending the methodology to more objects important for road safety and for the inventory of objects, as the methodology does not depend on any exclusive feature of TSs. In addition, it is proposed to feed back the network to improve the success rate of detections with corrected images that present the main types of error. It is also considered to test the methodology in other case studies such as highways and urban roads, to analyse the influence of driving speed during acquisition on 3-D point cloud location.
Funding: This research was funded by Xunta de Galicia given through human resources grant (ED481B-2019-061, ED481D 2019/020) and competitive reference groups (ED431C 2016-038), the Ministerio de Ciencia, Innovación y Universidades -Gobierno de España (RTI2018-095893-B-C21). This project has also received funding from the European Union's Horizon 2020 research and innovation programme under grant agreement No 769255. This document reflects only the views of the authors. Neither the Innovation and Networks Executive Agency (INEA) or the European Commission is in any way responsible for any use that may be made of the information it contains. The statements made herein are solely the responsibility of the authors.