Reliability of Historical Car Data for Operating Speed Analysis along Road Networks

: In recent years, innovative progress in information and communication technology (ICT) has introduced new sources for trafﬁc data collection and analysis. On-board sensors like GPS-GPRS boxes, generally installed for insurance purposes, communicate information from circulating vehicles to data centers. Geographic location, date and time, vehicles’ speed and direction, are systematically transmitted and stored as Historical Car Data (HCD) from probe vehicles in the trafﬁc stream. These databases provide a good opportunity to analyze the vehicles’ motion both in the temporal and spatial domains. The aim of this study is to pay attention to the reliability of this kind of data gathering. Since instrumented vehicles account for a small percentage of the entire vehicle ﬂeet, it is important to understand if they can be considered as a sample representative of the whole population. The paper presents a comparison of speed data obtained from HCD with the ones recorded by inductive-loop detectors and microwave radar sensors; the performed analysis required the deﬁnition of speciﬁc methodologies and procedures. The obtained results show a high correspondence between the two sets of data. Therefore, HCD can be proposed for the detailed monitoring of, and studies on, the operating conditions of mobility along road networks.


Introduction
Operating speeds are one of the crucial aspects for the management, monitoring and analysis of a road network; therefore, their assessment is an interesting and constantly evolving research topic [1,2]. In order to define traffic operating conditions based on actual speed data, different traffic measurement devices have been used over the years, which can be divided into two groups: static and dynamic ones.
Among the static devices, there are traditional traffic detectors, such as manual counting, as well as more advanced ones such as road detection stations, automatic traffic counters, video cameras, radar and laser guns, and point-based sensors, like microwave radars and acoustic sensors [3][4][5][6][7][8][9][10]. These methods have the valuable ability to sample the entire traffic flow, but at the same time show the impossibility of being able to guarantee a full coverage of the road network as it would require a significant economic investment. To overcome this limitation, innovative methods of collecting speed data have been tested through the use of vehicles equipped with GPS devices, capable of recording and sending travel information to operation centers with a high sampling rate [11][12][13][14][15][16]. This new data source allows researchers to observe real driver behavior and develop detailed models of operating speeds.
In contrast to punctual and isolated measurements, continuous speed data are useful for creating and studying actual driver speed profiles, in relation to the behavior adopted in different territorial contexts [17,18].
Recent innovative progress in the field of information and communication technologies (ICT) have further expanded and improved the connection between road users and Sci 2022, 4, 18 2 of 22 the mobile network, simplifying the acquisition and exchange of a large amount of georeferenced data [19]. Specifically, the increasing use of mobile digital devices equipped with GPS, such as smartphones, tablets and black boxes mounted on vehicles for insurance purposes, generates a large and complex set of data, which are known as Big Data. It is therefore essential to define new tools and new methods able to store, manage and process them, through the definition of data-mining models [20][21][22][23][24].
The wide-scale collection of georeferenced data from road vehicles, known as Floating Car Data (FCD) if processed in real time or Historical Car Data (HCD) if collected and analyzed in different periods, represents one of the traffic monitoring tools today. It is a very effective and low-cost instrument for the study and evaluation of road traffic conditions. Its increasing use in road mobility analyses is due to the simplicity of the data acquisition method, as each vehicle anonymously sends data relating to the geolocation, speed, direction and date and travel time to a processing center. In particular, the FCD/HCD represent an alternative solution to measuring speeds, travel times and performances of the mobility system along a single road or the road network.
Since the vehicles which transmit data are free to move anywhere on the road network, they may be considered as floating probes in the traffic flow. Several studies have shown the potential of such data to deepen aspects such as drivers' travel route prediction, proper perception by road users of the traffic signals, traffic conditions' estimation and consistency evaluation between drivers' behavior and theoretical design speeds in the various road elements [13,15,22,[25][26][27][28].
Although probe vehicle data provide a large amount of information and represent an excellent technology to analyze the traffic conditions and to manage road mobility, they have the disadvantage of achieving a relatively low penetration rate. In fact, various studies state that vehicles equipped with GPS account for around 2-5% of the entire vehicle fleet [15,22,29,30], although in last few years this percentage has rapidly increased.
In order to conduct correct analyses of operating speeds, it is necessary to ensure that the data from the probe vehicles is adequately representative of the entire fleet, in terms of the number and type of vehicles. In recent years, several studies have tried to verify the reliability of these data samples, relying mainly on examining the accuracy of the FCD/HCD travel times and speeds' estimation by comparing them with data recorded by point-based sensors; the assessments were carried out with regard to different categories of infrastructures and vehicles [31,32].
One of the earliest studies investigated the benefits and limitations of FCD technology by analyzing a particular case study that compared data obtained from a taxi-FCD fleet with data from license plate recognition (LPR). Travel times' evaluation indicated that the taxi-FCD system could provide good information about travel times, but it was not sufficient for direct, sole application, due to the limited data available and the high variation in data coverage. However, it became a valuable tool for integrating the data collected by local sensors [33].
Subsequently, for the Beijing highways, which showed a high penetration rate of FCD, a regression analysis was performed between the data detected by the Remote Traffic Microwave Sensors (RTMS) and the speeds of the FCD, obtaining a high correlation between the two data samples with R 2 equal to 0.97 [34].
Other studies have then assessed the speed quality of the dynamic surveys, comparing the measurements of the probe vehicles with the measurements of speed by the radar sensors. Good results were generally obtained, and in one case a regression analysis had showed a non-linear relationship with a correlation coefficient of 0.82 [35], while in another study it was observed that radar sensors tended to measure slower speeds than loop detectors during periods of free flow conditions [36]. In the latter case, it was found that radar measurements are generally good, but several aspects have been identified to consider before their implementation. Among these aspects, which can influence the quality of the measurements, are: lag, different data biases during the two phases of free and congested traffic, vulnerability to rainfall and sensitivity to the device mounting angle.
On the other hand, FCD are certainly an important data source as they guarantee a higher spatial coverage and lower costs than the detectors installed. However, the accuracy and representativeness of the FCD sample is closely related to the number of probe vehicles and the quality of GPS geolocation and data transmission. Consequently, data from multiple sources are often managed and analyzed, i.e., static and floating sensors, because when used together they are capable of improving the knowledge of traffic status and of reducing the uncertainty of individual sources [14,37].
In this paper, in accordance with the scientific literature, the reliability of the HCD sample has been evaluated by directly comparing its information with the data recorded by inductive ring detectors and microwave radar sensors, in terms of speed distribution. The obtained results show a high correlation between the two data sources. A good representativeness of this type of speed data is demonstrated, despite the relatively low penetration rate of FCD.

Data and Methods
The analysis of the operating speed is an important topic to study in order to improve the knowledge of the operating conditions of a road network, as it allows the observation of the relationships between road users' actual behavior and the characteristics of the infrastructure. Especially in road safety analyses, the operating speed is used to observe how it can be correlated with the frequency and severity of accidents along a route [38,39].
Generally, according to scientific literature, the operating speed is assumed to be the 85th percentile of the distribution of the actual speeds practiced by users [40,41]. Therefore, it is necessary that the data source adopted is as reliable as possible, in order to obtain operating speed profiles that effectively represent the actual trend. In the past, most operational speed models were based on speed data collected on specific road sections, through static acquisition methods, able to provide data only in the temporal domain. Subsequently, attempts were made to overcome this limitation by introducing and experimenting with innovative methods, in particular by collecting speed data from vehicles equipped with GPS systems. This new methodology allowed the researchers to observe the actual behavior of drivers and to develop more effective operating speed models, so providing a more accurate representation of the real phenomenon [42][43][44][45][46][47][48][49]. It has to be noted that the data from the point-based sensors refer only to the road sections in which they are installed, providing information only in the temporal domain. The probe vehicles, on the other hand, allow the observation of the data in the temporal and spatial domains, providing information on mobility for the entire route. Despite the advantages related to this new method of data collection, the reliability of speed data obtained from "floating cars" needs to be verified, due to the notable lack of instrumented vehicles when compared to the entire fleet.
In order to investigate this aspect, in this research the performed analyses were focused on identifying a correspondence between the HCD obtained from a sample of instrumented vehicles and the data returned by one of the classic traffic detection methods. In particular, it was decided to carry out a comparison of the HCD with the data collected by the Automatic Statistical Traffic Detection System of the Italian national road network manager, ANAS SpA. The Traffic Observatory is a structure that the company mainly uses to provide traffic data and information to users; all the sensors send their data to a central Platform for Monitoring and Analysis-called PANAMA-and the reliability of the acquired data is ensured by a series of control procedures [50].
In particular, the study analyzed the data that came from the measurement sections placed along the two-lane rural roads of the Veneto Region, managed by the ANAS Company, as shown in Figure 1: A detailed traffic survey was available for each control unit. Among the information, the variables employed for the analysis are shown in Table 1. In det Reference is the date and time of the data acquisition, Lane represents the lane o the vehicle was traveling, Direction indicates the direction of travel, Speed (km value of the speed recorded by the control unit and Vehicle Class is a code ass identify the different types of vehicles (i.e., 1-9 classes). The sample of Historical Car Data, on the other hand, was acquired from a cial operator who has GPS data coming anonymously from over four million bla mounted on passenger cars and heavy vehicles, in addition to those generated b lion Apps downloaded to customers' smartphones [51].
The actual speeds' extraction from Big Data did not take place in real time HCD were used; these data were subsequently processed by means of a relatio base management software with which it is possible to store and manage large am data. Table 2 shows the most important information that can be extracted from th A detailed traffic survey was available for each control unit. Among the different information, the variables employed for the analysis are shown in Table 1. In detail, Time Reference is the date and time of the data acquisition, Lane represents the lane on which the vehicle was traveling, Direction indicates the direction of travel, Speed (km/h) is the value of the speed recorded by the control unit and Vehicle Class is a code assigned to identify the different types of vehicles (i.e., 1-9 classes). The sample of Historical Car Data, on the other hand, was acquired from a commercial operator who has GPS data coming anonymously from over four million black boxes mounted on passenger cars and heavy vehicles, in addition to those generated by 1.5 million Apps downloaded to customers' smartphones [51].
The actual speeds' extraction from Big Data did not take place in real time, but the HCD were used; these data were subsequently processed by means of a relational database management software with which it is possible to store and manage large amounts of data. Table 2 shows the most important information that can be extracted from the dataset for the proposed study: Identification code; Longitude (in WGS 84 coordinates); Latitude (in WGS 84 coordinates); Direction is the vehicle's travel direction expressed as azimuth angle; Speed; Date and Time of the signal emission; Signal Quality; Vehicle ID is a code assigned by the collection center to each individual vehicle during its travel; Vehicle Type. The two data sources (point-based sensors and HCD) have provided information relating to a three months' period, respectively, August 2018, February 2019 and May 2019, and located within the Veneto Region. Specifically, for the time period and the location analyzed, the HCD data sample amounts to almost one billion items of data, as shown in the following scheme (Table 3). A study of the statistical reliability of the data detected by the measurement sections was preliminarily performed. At first, the traffic flows recorded by the control units were divided into the two main travel directions, characterized by the lane (1 or 2)-direction (ascendant and descendant) combination, equal to "1A" and "2D". Subsequently, the overtaking situations were identified by the combinations "1D" and "2A".
The raw data show a trend similar to the Gaussian distribution, as it is generally recognized in the literature [52]. As can be seen from Figure 2, the average value is located near the center of the distribution and the trend is mainly symmetrical to it. Furthermore, most of the surveys are located around the mean value, specifically it is noted that 75% of the speed values are in the range (µ − σ, µ + σ), in terms of mean square deviations, and that 95% of the speed values are in the range (µ − 2σ, µ + 2σ).
These trends confirm that the experimental data can be represented by a statistical model, whose parameters can be known and are shown, as an example, in Figure 2. Indeed, the trend is in agreement with the classical statistical modeling of traffic flows; the data sample does not present anomalies due to local or special phenomena. From the observation of the reliability and representativeness of this sample, it is possible to use them for more advanced studies, such as the one addressed in this study.
The HCD data sample contains all the data sent from the instrumented vehicles within the Veneto Region; therefore, data filtering operations were performed, in order to continue with their reliability research. The study started with the geometric reconstruction of the road alignments, using an automated, economical and rapid method capable of identifying the elements of existing road layouts through the georeferenced vertices of the road graph [53]. Once the horizontal alignment of the road layouts were known, a map-matching procedure was performed, by which the geographic coordinates of the vehicles were matched to the graph of each road [54][55][56][57][58][59][60]. the road graph [53]. Once the horizontal alignment of the road layouts were known, a map-matching procedure was performed, by which the geographic coordinates of the vehicles were matched to the graph of each road [54][55][56][57][58][59][60]. The map matching was carried out using a programming code, with which the raw HCD near the examined road are preliminarily identified and subsequently projected onto the curvilinear abscissa. However, since the GPS signal on board the vehicles could be affected by intrinsic or accidental errors, the position accuracy could be low; the area of investigation was expanded to avoid eliminating from the analysis those vehicles that had a slightly inaccurate location compared to their actual position. At the end of the mapmatching procedure, the point speed data extracted from the HCD were divided into the two main travel directions.
The two analyzed data samples differed in the domain in which they were defined. The point-based sensors provided a very detailed traffic survey in the temporal domain, at the forced and constant sections where they were installed. Instead, the HCD allowed The map matching was carried out using a programming code, with which the raw HCD near the examined road are preliminarily identified and subsequently projected onto the curvilinear abscissa. However, since the GPS signal on board the vehicles could be affected by intrinsic or accidental errors, the position accuracy could be low; the area of investigation was expanded to avoid eliminating from the analysis those vehicles that had a slightly inaccurate location compared to their actual position. At the end of the map-matching procedure, the point speed data extracted from the HCD were divided into the two main travel directions.
The two analyzed data samples differed in the domain in which they were defined. The point-based sensors provided a very detailed traffic survey in the temporal domain, at the forced and constant sections where they were installed. Instead, the HCD allowed the observation of a tiny sample of the vehicle fleet along the entire road network, providing data in both the temporal and spatial domains. For this reason, to carry out a direct comparison between the two data samples, the proposed analysis method involved a preliminary restriction of the spatial domain of the HCD, taking into consideration the only data was located 10 m forward and 10 m behind the control unit. The reason for this spatial interval is to consider the possibility that the vehicle recorded by the control unit may have emitted the signal not exactly in correspondence with the measurement section, but a few meters before or after it. The HCD sample is constituted of temporal-spatial information whose sampling frequency is 1 Hz. This high sampling rate is sufficient to consider the HCD located just 10 m forward and 10 m behind the control unit. If the frequency was lower the spatial interval would be much bigger, with the risk of contributing to a less accurate analysis, because of the speeds changing along the explored road stretch.
After the first filtering operation on the HCD sample, it was possible to evaluate a representative rate of the GPS data around each analyzed control unit. The evaluation was performed as a ratio of the HCD, located near the control unit, and the vehicles recorded by the control unit itself. Table 4 lists the values of the relationship between the probe vehicles and the entire vehicle flow passing through the sections where the point-based sensors are located, which are about 1-2‰, with an average value of 1.4‰. A study of the statistical parameters of the GPS data set has been performed through a preliminary histogram representation of the HCD speeds, as shown in Figure 3. The speed data suggest again a shape similar to the one of a Gaussian distribution, even if the two diagrams show some irregularities due to the low amount of data in some speed classes. However, it has to be noticed that the statistical parameters of the distributions of both sensors and HCD are comparable. Therefore, it can be stated that a small sample of data (the HCD) can represent the behavior of the entire traffic flow. speed data suggest again a shape similar to the one of a Gaussian distribution, even if the two diagrams show some irregularities due to the low amount of data in some speed classes. However, it has to be noticed that the statistical parameters of the distributions of both sensors and HCD are comparable. Therefore, it can be stated that a small sample of data (the HCD) can represent the behavior of the entire traffic flow.  Finally, in order to complete the overall evaluation of the reliability of the HCD and their correspondence with the data recorded by the control units, it has been necessary to extend the analysis of the two samples also in the temporal domain. The second filtering operation of the HCD sample identified the data from the probe vehicles in the same time frame defined by the traffic measurement stations. A methodology was therefore defined that made it possible to directly compare the speed data extracted from the two different sources, evaluating their coherence and correspondence through a linear regression. The Finally, in order to complete the overall evaluation of the reliability of the HCD and their correspondence with the data recorded by the control units, it has been necessary to extend the analysis of the two samples also in the temporal domain. The second filtering operation of the HCD sample identified the data from the probe vehicles in the same time frame defined by the traffic measurement stations. A methodology was therefore defined that made it possible to directly compare the speed data extracted from the two different sources, evaluating their coherence and correspondence through a linear regression. The reliability of the obtained result can thus be interpreted by means of the coefficient of determination R 2 , which measures the weak or strong linear relationship between the two variables compared, assuming a value between 0 and 1.

Results
The reliability evaluation of the HCD has been performed by direct comparison with the traffic data recorded by a set of point-based sensors placed in the investigated road network. Due to the big difference between the two datasets, as the HCD sample provides space-time information while the control units only record data in the time domain, it has been necessary to properly filter the HCD.
Firstly, the HCD close to the measurement stations have been identified and selected; then, as shown in Figure 4, the speed data extracted from the two different sources have been overlapped in the same graph as a function of time. In Figure 4, the representation of the Gaussian distributions of the two different data samples is flanked (on the right side) by the speed-time graph (on the left side).

Results
The reliability evaluation of the HCD has been performed by direct comparison with the traffic data recorded by a set of point-based sensors placed in the investigated road network. Due to the big difference between the two datasets, as the HCD sample provides space-time information while the control units only record data in the time domain, it has been necessary to properly filter the HCD.
Firstly, the HCD close to the measurement stations have been identified and selected; then, as shown in Figure 4, the speed data extracted from the two different sources have been overlapped in the same graph as a function of time. In Figure 4, the representation of the Gaussian distributions of the two different data samples is flanked (on the right side) by the speed-time graph (on the left side).   From the graphs in Figure 4, the following observations can be deduced:  Observing the Gaussian distributions, an overlap of the two trends can be noticed, and it demonstrates how the two samples can be considered statistically coincident, as their main parameters are almost equal;  The mean and the standard deviations linear regression lines of the two data samples in the speed-time diagrams approximate the mean and standard deviations of the Gaussian distributions displayed in the diagrams on the right;  The point cloud of the control units' data completely encloses the HCD values. As has been said before, the probe vehicles account for a small percentage of the vehicle fleet, which is, instead, completely detected by the fixed sensors;  Fluctuations in the graphs, especially their peaks and troughs, are similar between the HCD and the point-based-sensor point cloud; therefore, a matching between the qualitative trends of the data samples can be observed;  The linear regressions on the data show almost constant values as the lines are slightly inclined, with angular coefficients close to zero and intercept equal to the average speed; From the graphs in Figure 4, the following observations can be deduced: • Observing the Gaussian distributions, an overlap of the two trends can be noticed, and it demonstrates how the two samples can be considered statistically coincident, as their main parameters are almost equal; • The mean and the standard deviations linear regression lines of the two data samples in the speed-time diagrams approximate the mean and standard deviations of the Gaussian distributions displayed in the diagrams on the right; • The point cloud of the control units' data completely encloses the HCD values. As has been said before, the probe vehicles account for a small percentage of the vehicle fleet, which is, instead, completely detected by the fixed sensors; • Fluctuations in the graphs, especially their peaks and troughs, are similar between the HCD and the point-based-sensor point cloud; therefore, a matching between the qualitative trends of the data samples can be observed; • The linear regressions on the data show almost constant values as the lines are slightly inclined, with angular coefficients close to zero and intercept equal to the average speed; • There is a minimal difference between the average speed values, evaluated, respectively, from the two data samples, which generally assumes a value of around 3 km/h. These minor differences are probably caused by systematic errors, linked to sensor calibration defects or to errors caused by the relative angle between the signals emitted and received by the radar sensors and the vehicles' driving direction. The point-based sensors are in fact located on the sides of the carriageway and are not in line with the lanes; • The two linear regression lines have been moved vertically on and under by a value equal to the corresponding standard deviation. In this way, it can be observed that the data dispersion is almost coincident between the two samples and that the only difference is due to the systematic error; • Most of the HCD fall into the range defined by the ± σ linear regression parallel, thus demonstrating the strong reliability of the sample, which is located around the average speed values.
The two tables in Appendix A (Tables A1 and A2) show the results from all the measurement sections that were analyzed. The results indicate the values of the angular coefficient m, the intercept q of the regression lines, the average speed µ referred to each monitored month and the standard deviation σ of the speeds for both the data recorded by the control units and for the HCD.
A second data filtering has been performed on the HCD to identify the information emitted by the probe vehicles in a time interval coincident with the measurements made by the point-based sensors. The results have been plotted and are shown in Figure 5: the graphs show the HCD speeds on the x-axis and the speeds recorded by the control units on the y-axis.
As shown in Figure 5, the reliability assessment was performed considering the linear regression between the speed data from the two different data sources, and the following observations are carried out:

•
The very good correspondence between the data is proved by the points' arrangement and concentration along the diagram bisector: speed data acquired by the HCD has been also recorded by the control units, with minimal deviations; • A regression line facilitates the graph readability: it immediately shows how much the dispersion of the recorded data approaches or deviates from the bisector, by means of its angular coefficient value; • Therefore, the data correspondence is also readable through the coefficient of determination R 2 , which is almost always close to 1; • Minor deviations come from outlier points within the sample, generally related to very low speed values of the HCD. Probably these anomalies are related to vehicles, close to the control unit, performing maneuvers outside the carriageway. This is an intrinsic limit due to the characteristics of the GPS systems; • The problem observed in the previous point cannot be completely ignored, but paying attention to the central zone of the graph, it is always found that most of the points are near the bisector, and it corresponds to the most plausible speeds assumed along the examined roads.
The two diagrams shown in Figure 6 summarize the results obtained, shown as a cumulative curve, with which it is possible to observe the percentage of cases with a specific value of R 2 . The percentage distribution of the results shows that almost 80% of the analyzed comparisons return an R 2 greater than 0.8, demonstrating the reliability of the HCD sample and its application for more advanced monitoring studies. control units and for the HCD.
A second data filtering has been performed on the HCD to identify the information emitted by the probe vehicles in a time interval coincident with the measurements made by the point-based sensors. The results have been plotted and are shown in Figure 5: the graphs show the HCD speeds on the x-axis and the speeds recorded by the control units on the y-axis. As shown in Figure 5, the reliability assessment was performed considering the linear regression between the speed data from the two different data sources, and the following observations are carried out: • The very good correspondence between the data is proved by the points' arrangement and concentration along the diagram bisector: speed data acquired by the HCD has been also recorded by the control units, with minimal deviations; • A regression line facilitates the graph readability: it immediately shows how much the dispersion of the recorded data approaches or deviates from the bisector, by means of its angular coefficient value; • Therefore, the data correspondence is also readable through the coefficient of determination R 2 , which is almost always close to 1; • Minor deviations come from outlier points within the sample, generally related to very low speed values of the HCD. Probably these anomalies are related to vehicles, The two tables (Tables A3 and A4) presented in Appendix A, one per each driving direction, are presented to summarize the angular coefficient m, the intercept q of the regression lines, and the coefficient of determination values obtained for all the measurement sections. The two tables (Tables A3 and A4) presented in Appendix A, one per each driving direction, are presented to summarize the angular coefficient m, the intercept q of the regression lines, and the coefficient of determination values obtained for all the measurement sections.

Discussion
The proposed original method for evaluating the reliability of a sample of HCD obtains good results, by comparing speed values from probe vehicles and measurement stations. However, as can be seen in Tables A3 and A4, not all the results show a good correspondence between the two samples of data, since the angular coefficient m and the coefficient of determination R 2 in some cases are far from value 1. However, these results are acceptable considering the hypotheses and the aims of this study. First, it should be noted that the research carried out does not attempt to establish an instantaneous comparison in time and space between the GPS data and the traffic measurements performed by the control units, but a correspondence of data both in statistical terms and in terms of actual vehicular speeds. Therefore, thanks to the study, a reliable and representative data source of the driving population can be identified in the small sample of HCD, even if a very high match is not achieved in all cases.
The high sampling rate of GPS data has allowed us to carry out an accurate analysis through the different filtering phases. The most important filtering operation was to consider in the analysis only the HCD emitted 10 m forward and 10 m behind the control unit. When HCD have been recorded in mountainous environments, technical problems related to both poor satellite coverage and quality of GPS signal were found. The probe vehicles, in fact, send little or incorrect information in that territory. Another reason to explain anomalous information can be related to the location of some measurement sections along the infrastructure, for example near tunnels. The presence of the tunnel determines a GPS signal loss, resulting in low data quality; consequently, this kind of data could be not representative and have been excluded from the analysis. Observing Figure  5, the presence of outliers affects the final outcome of the analysis, which results in R 2 values not close to 1. By examining the proposed calculation method, it can be said that outliers may have been generated by different causes. It is likely that some vehicles had a lower-than-average emission frequency, and so did not register in the defined spatial and temporal interval around the control unit. As a result, the data recorded by the control unit did not find the corresponding vehicle in the filtered sample of the HCD. In relation to the frequency, it may happen that a given GPS signal is emitted inside the analyzed interval, but towards the beginning of it: in this case the driver could change the driving speed in the meantime, so that the control unit may have detected the same vehicle but at

Discussion
The proposed original method for evaluating the reliability of a sample of HCD obtains good results, by comparing speed values from probe vehicles and measurement stations. However, as can be seen in Tables A3 and A4, not all the results show a good correspondence between the two samples of data, since the angular coefficient m and the coefficient of determination R 2 in some cases are far from value 1. However, these results are acceptable considering the hypotheses and the aims of this study. First, it should be noted that the research carried out does not attempt to establish an instantaneous comparison in time and space between the GPS data and the traffic measurements performed by the control units, but a correspondence of data both in statistical terms and in terms of actual vehicular speeds. Therefore, thanks to the study, a reliable and representative data source of the driving population can be identified in the small sample of HCD, even if a very high match is not achieved in all cases.
The high sampling rate of GPS data has allowed us to carry out an accurate analysis through the different filtering phases. The most important filtering operation was to consider in the analysis only the HCD emitted 10 m forward and 10 m behind the control unit. When HCD have been recorded in mountainous environments, technical problems related to both poor satellite coverage and quality of GPS signal were found. The probe vehicles, in fact, send little or incorrect information in that territory. Another reason to explain anomalous information can be related to the location of some measurement sections along the infrastructure, for example near tunnels. The presence of the tunnel determines a GPS signal loss, resulting in low data quality; consequently, this kind of data could be not representative and have been excluded from the analysis. Observing Figure 5, the presence of outliers affects the final outcome of the analysis, which results in R 2 values not close to 1. By examining the proposed calculation method, it can be said that outliers may have been generated by different causes. It is likely that some vehicles had a lower-than-average emission frequency, and so did not register in the defined spatial and temporal interval around the control unit. As a result, the data recorded by the control unit did not find the corresponding vehicle in the filtered sample of the HCD. In relation to the frequency, it may happen that a given GPS signal is emitted inside the analyzed interval, but towards the beginning of it: in this case the driver could change the driving speed in the meantime, so that the control unit may have detected the same vehicle but at a different speed. Moreover, by investigating geographical maps, in cases where the examined road section presents another parallel road close to it, it is possible that the geolocalized data does not belong to the studied road, due to errors related to the GPS detection system. In general, although sometimes less valid results in terms of determination coefficient have been found, in most cases the reliability of the HCD sample should be considered acceptable. In Figure 5, ignoring the outliers, it is notable that there is a concentration of the speed data along the bisector, especially in the speed range between 50-80 km/h, which represents indeed the range of operating speeds on the examined roads. Therefore, the point cloud thus distributed shows that the data obtained from probe vehicles are actually reliable and can be considered representative of the entire vehicle fleet, except for some individual cases that relate to the availability or quality of HCD.

Conclusions
This paper presents a reliability study of Historical Car Data (HCD), evaluating their correlation in terms of actual vehicular speeds with the data recorded by sensors installed along existing roads. The sample of the analyzed data revealed that it is possible to obtain information from them to describe operational variables of the entire vehicular flow, in terms of speed profile and the typical behavior of user drivers.
The main objective of this study is to understand the reliability of this kind of data, with the aim of validating them, so the data can be used for all analyses that are needed in road safety studies and road network management. Regardless of the different traffic conditions across different time periods of the day, a daily analysis of the data has not been developed because that was not a specific aim of the study. Another limitation of the study is that, considering that the environmental and traffic conditions are different between rural and urban roads, the research wanted to observe the operational parameters of traffic flow only under the unconstrained conditions that are typical of rural roads. Therefore, the results of the study allow general conclusions to be obtained about the reliability of HCD which are valid only for this kind of context.
The achieved results show a high HCD reliability and representativeness of the traffic flow, achieving R 2 > 0.9 in most cases. The statistical comparison of the GPS data with a typically recognized source, such as that of the control units, allows the opportunity to be extended to utilize HCD samples for safety and management assessments on road networks, despite the fact that, to date, the instrumented vehicles account for a small percentage of the entire vehicle fleet.
The sample of the control units captures a great proportion of all the traffic flow, with the exception of those vehicles which, for example, decide to enter or exit from the surveyed road, right before the sensor location; therefore, it represents a large sample, that it might almost represent the entire population. In contrast, the HCD sample is very small, and, on average, accounts for 1.4‰ of the traffic surveyed by the sensor; however, focusing on their statistical distributions and related variables, they have lot of similarities to each other. As a result, both the control units and HCD can be used to represent key features of the whole population, like traffic flow characteristics and operating conditions. In particular, the control units provide complete information about traffic stream and its sensitivity to certain variables, such as environmental conditions, differences between hours of night and day or the presence of heavy vehicles in the traffic flow. The HCD, instead, are useful to make data available about both the single driver's behavior and the vehicular path and speed along the entire route. In other words, the control units are especially useful for data about traffic patterns, while GPS data are suitable to perform a monitoring of the vehicular motion along the entire infrastructure. This conclusion agrees with the literature, where HCD have been proposed to study drivers' travel route prediction, the proper perception by road users of the traffic signals, traffic conditions' estimation and consistency evaluation between drivers' behavior and theoretical design speeds in the various road elements.
More generally, the results obtained from this research support the thesis that Big Data, and specifically HCD, can be adopted as an essential data source to carry out important analyses on the operational conditions of road traffic. Furthermore, by exploiting the large amount of information associated with each type of data, it is possible to extend the survey by identifying many factors that influence the traffic conditions, even if the evaluation analyses require specific methodologies and procedures.
In particular, the Historical Car Data allows the extraction of continuous operating speed profiles along infrastructures and networks, because of the data's geolocation along the entire road layout and the high sampling rate. It should be noted that this data source allows the main limits associated with the traditional methods of traffic surveys to be overcome, with which the information is related only to the sections in which the sensors are installed.
In light of the statistical reliability and trustworthiness of the HCD sample, the authors want to extend the research to implementing innovative mobility analysis processes, thanks to the quantitative and qualitative advantages offered by the HCD. In this way, a detailed and continuous monitoring of the actual operating conditions of road traffic can be performed.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author. The data are not publicly available due to confidentiality issues and respect for privacy.

Acknowledgments:
We thank the national public company ANAS for making its Traffic Observatory information available and for the technical support.

Conflicts of Interest:
The authors declare no conflict of interest. Table A1. HCD and sensors angular coefficients m, and intercepts q of the regression lines, average speeds µ, and the standard deviations σ referred to each monitored month in Dir AB.