A Data Correction Algorithm for Low-Frequency Floating Car Data

The data collected by floating cars is an important source for lane-level map production. Compared with other data sources, this method is a low-cost but challenging way to generate high-accuracy maps. In this paper, we propose a data correction algorithm for low-frequency floating car data. First, we preprocess the trajectory data by an adaptive density optimizing method to remove the noise points with large mistakes. Then, we match the trajectory data with OpenStreetMap (OSM) using an efficient hierarchical map matching algorithm. Lastly, we correct the floating car data by an OSM-based physical attraction model. Experiments are conducted exploiting the data collected by thousands of taxies over one week in Wuhan City, China. The results show that the accuracy of the data is improved and the proposed algorithm is demonstrated to be practical and effective.


Introduction
With the development of self-driving vehicles, lane-level maps have drawn much attention from researchers, internet firms, and carmakers. Currently, lane-level map generation methods mainly include the following three approaches: an integrated navigation system (INS) with three-dimensional (3D) Lidar [1,2]; a vision-based approach [3]; and a floating car-based approach [4][5][6]. The precision of the map can reach the centimeter level using the Lidar approach, but the cost of the devices and production is the most expensive of the three approaches. The vision-based approach captures the map information by monocular camera or stereo camera, however, the quality of the map is influenced by the condition of the light to a great degree; moreover, the cost of production is also high. The cost of the map is low if the data is collected by a floating car. However, the positional accuracy of global positioning system (GPS) traces can only reach 5-30 m because GPS traces are prone to errors due to the multipath effect and the loss of satellite signals. Therefore, it is a challenging task to produce a map with floating car data.
In this paper, we propose a new data correction method for low-frequency floating vehicle data. We developed an adaptive density optimization method to remove a fraction of the noise points by using a Delaunay triangulation network to construct clusters of points. As OpenStreetMap (OSM) has become one of the most successful projects in Volunteered Geographic Information (VGI) project, it is free and has a range of applications. We attempt to match GPS traces with OSM maps and correct the GPS traces by an OSM-based physical attraction model.

Problem Statement
In the floating car system, the location of a car is recorded by the GPS. A trajectory is a collection of GPS points arranged in a time sequence, = { , , … … }, as shown in Figure 2. Each of the points has attributes = { , , }, where ( , ) are the longitude and latitude of the point, respectively, and ti is the time the point was collected. is the ID of the GPS point.
The positional accuracy of GPS traces can reach 5-30 m, but the error will increase when the satellite signals are obstructed by tall buildings, trees, and tunnel. Therefore, it is necessary to remove the points with large error first. However, it is still hard to generate a lane-level map using this GPS data because some points appear on the opposite road, which will influence the results of the lane position. The aim of this paper is to correct the GPS points to the right location by a physical attraction model based on the OSM map information.

Trajectory Preprocessing
To remove the noise points with a large error, the trajectory should be preprocessed first. In general, points with fewer neighboring points are identified as outliers. However, as shown in Figure 3, there is a big difference in density between different roads because of the different grades of the roads and the divisions of urban zoning. The density of the points in a city center is larger than that of the points at the edge of a city. If we were to use the same threshold to distinguish the noise in both the high and low density areas, the correct points would be recognized as noise in the areas with low density. Therefore, it is necessary to choose an adaptive density threshold to preprocess the data.

Problem Statement
In the floating car system, the location of a car is recorded by the GPS. A trajectory is a collection of GPS points arranged in a time sequence, T = {P 1 , P 2 , . . . . . . P n }, as shown in Figure 2. Each of the points P i has attributes P i = {x i , y i , t i }, where (x i , y i ) are the longitude and latitude of the point, respectively, and t i is the time the point was collected.

Problem Statement
In the floating car system, the location of a car is recorded by the GPS. A trajectory is a collection of GPS points arranged in a time sequence, = { , , … … }, as shown in Figure 2. Each of the points has attributes = { , , }, where ( , ) are the longitude and latitude of the point, respectively, and ti is the time the point was collected. is the ID of the GPS point.
The positional accuracy of GPS traces can reach 5-30 m, but the error will increase when the satellite signals are obstructed by tall buildings, trees, and tunnel. Therefore, it is necessary to remove the points with large error first. However, it is still hard to generate a lane-level map using this GPS data because some points appear on the opposite road, which will influence the results of the lane position. The aim of this paper is to correct the GPS points to the right location by a physical attraction model based on the OSM map information.

Trajectory Preprocessing
To remove the noise points with a large error, the trajectory should be preprocessed first. In general, points with fewer neighboring points are identified as outliers. However, as shown in Figure 3, there is a big difference in density between different roads because of the different grades of the roads and the divisions of urban zoning. The density of the points in a city center is larger than that of the points at the edge of a city. If we were to use the same threshold to distinguish the noise in both the high and low density areas, the correct points would be recognized as noise in the areas with low density. Therefore, it is necessary to choose an adaptive density threshold to preprocess the data.  The positional accuracy of GPS traces can reach 5-30 m, but the error will increase when the satellite signals are obstructed by tall buildings, trees, and tunnel. Therefore, it is necessary to remove the points with large error first. However, it is still hard to generate a lane-level map using this GPS data because some points appear on the opposite road, which will influence the results of the lane position. The aim of this paper is to correct the GPS points to the right location by a physical attraction model based on the OSM map information.

Trajectory Preprocessing
To remove the noise points with a large error, the trajectory should be preprocessed first. In general, points with fewer neighboring points are identified as outliers. However, as shown in Figure 3, there is a big difference in density between different roads because of the different grades of the roads and the divisions of urban zoning. The density of the points in a city center is larger than that of the points at the edge of a city. If we were to use the same threshold to distinguish the noise in both the high and low density areas, the correct points would be recognized as noise in the areas with low density. Therefore, it is necessary to choose an adaptive density threshold to preprocess the data.
The density of a GPS point can be described by the null distribution [23]. The null distribution of point P can be defined as follows: Equation (1) is the expression of the null distribution, where N(S) is the number of points for a spatial point dataset SD, λ is the intensity of the point dataset, and |B| is the area of SD. In this paper, we calculate the density of GPS point P through the number of points in buffer D. As shown in Figure 2, the probability of the number of points X < n i in a buffer D can be calculated via Equation (2). The value ofλ can be calculated by Equation (3), where N D is the number of points in buffer D, and |D| is the area of buffer D. Through Equation (4), we can calculate the probability of X ≥ n i in buffer D.
As already noted, on a different road, the density of GPS points is different. Therefore, we used a Delaunay triangulation network to calculate the radius of the buffer: where Mean(AT) is the average value of the length of the sides in the Delaunay triangulation network and Variation(AT) is the variance of the lengths of the sides. In the center of the city, the density of points is large, so the value of R S is small. In contrast, the value of R S is large at the edge of the city, as shown in Figure 4. The positional accuracy of GPS traces can reach 5-30 m, but the error will increase when the satellite signals are obstructed by tall buildings, trees, and tunnel. Therefore, it is necessary to remove the points with large error first. However, it is still hard to generate a lane-level map using this GPS data because some points appear on the opposite road, which will influence the results of the lane position. The aim of this paper is to correct the GPS points to the right location by a physical attraction model based on the OSM map information.

Trajectory Preprocessing
To remove the noise points with a large error, the trajectory should be preprocessed first. In general, points with fewer neighboring points are identified as outliers. However, as shown in Figure 3, there is a big difference in density between different roads because of the different grades of the roads and the divisions of urban zoning. The density of the points in a city center is larger than that of the points at the edge of a city. If we were to use the same threshold to distinguish the noise in both the high and low density areas, the correct points would be recognized as noise in the areas with low density. Therefore, it is necessary to choose an adaptive density threshold to preprocess the data.  The density of a GPS point can be described by the null distribution [23]. The null distribution of point P can be defined as follows: Equation (1) is the expression of the null distribution, where N(S) is the number of points for a spatial point dataset SD, λ is the intensity of the point dataset, and |B| is the area of SD. In this paper, we calculate the density of GPS point P through the number of points in buffer D. As shown in Figure 2, the probability of the number of points X < ni in a buffer D can be calculated via Equation (2). The value of can be calculated by Equation (3), where ND is the number of points in buffer D, and |D| is the area of buffer D. Through Equation (4), we can calculate the probability of X ≥ ni in buffer D.
As already noted, on a different road, the density of GPS points is different. Therefore, we used a Delaunay triangulation network to calculate the radius of the buffer: where Mean(AT) is the average value of the length of the sides in the Delaunay triangulation network and Variation(AT) is the variance of the lengths of the sides. In the center of the city, the density of points is large, so the value of RS is small. In contrast, the value of RS is large at the edge of the city, as shown in Figure 4.

Hierarchical Map Matching Algorithm (HST-Matching)
After the preprocessing, the points with large position error have been removed. However, the accuracy of GPS trajectory still cannot meet the requirement of the lane-level map. In this section, we propose hierarchical map matching (HST-Matching) method, by improving the ST-matching algorithm [17], to match low-frequency floating car data with the OSM map. The HST-Matching algorithm consists of two parts: preliminary matching and ST-matching.
The OSM map, as crowdsourced geographic data, is one of the current research trends. The main advantage is its low cost. From the literature [24][25][26], we know that most of the OSM map data have high accuracy. However, the precision of the GPS points in the floating car measurement is 5-30 m. Therefore, we can attempt to match the GPS points with the OSM map to improve the accuracy of the GPS traces.

Hierarchical Map Matching Algorithm (HST-Matching)
After the preprocessing, the points with large position error have been removed. However, the accuracy of GPS trajectory still cannot meet the requirement of the lane-level map. In this section, we propose hierarchical map matching (HST-Matching) method, by improving the ST-matching algorithm [17], to match low-frequency floating car data with the OSM map. The HST-Matching algorithm consists of two parts: preliminary matching and ST-matching. The OSM map, as crowdsourced geographic data, is one of the current research trends. The main advantage is its low cost. From the literature [24][25][26], we know that most of the OSM map data have high accuracy. However, the precision of the GPS points in the floating car measurement is 5-30 m. Therefore, we can attempt to match the GPS points with the OSM map to improve the accuracy of the GPS traces.
The OSM map consists of three parts: nodes, ways, and relations [27]. The nodes can be used to define standalone point features or the shape of a way. The ways are used to represent the linear features, for example, rivers and roads. A way consists of an ordered list of nodes between 2 and 2000 and is defined as a polyline. A relation is used to describe the relationship between two or more data elements, including turn restrictions. The ways in the OSM map not only include the roads, but also rivers, subways, and the boundary river. We only selected the types of roads on which vehicles can drive, including motorway, trunk, primary, secondary, tertiary, service, and residential roads.
A complete road is divided into many segments in the OSM map, R = {e 1 , e 2 , . . . . . . e n }. Each segment contains a starting point, e i .start, an endpoint, e i .end, and the nodes that control the shape of the road, e i .control, as shown in Figure 5. The OSM map consists of three parts: nodes, ways, and relations [27]. The nodes can be used to define standalone point features or the shape of a way. The ways are used to represent the linear features, for example, rivers and roads. A way consists of an ordered list of nodes between 2 and 2000 and is defined as a polyline. A relation is used to describe the relationship between two or more data elements, including turn restrictions. The ways in the OSM map not only include the roads, but also rivers, subways, and the boundary river. We only selected the types of roads on which vehicles can drive, including motorway, trunk, primary, secondary, tertiary, service, and residential roads.
A complete road is divided into many segments in the OSM map, = { , , … … }. Each segment contains a starting point, .
, and the nodes that control the shape of the road, .
, as shown in Figure 5. Before we introduce the map matching algorithm, it is necessary to describe the assumptions used in this paper.

Assumption 1. The vehicle runs on the roads, so the trajectory can be matched to at least one road.
Assumption 2. The path of the car tends to be direct, rather than a roundabout route. This means that the matching path between two GPS points will most likely be the shortest path.

Preliminary Matching
In the preliminary matching step, we generated the buffer of each point Pi, 1 ≤ i ≤ n, with radius r in a trajectory T = {P1, P2 … Pn} to retrieve the candidate segments and candidate points of Pi. As we had already preprocessed the trajectory, the noises with the largest deviations had been removed. The radius of the buffer was set as 30 m.
An example is shown in Figure 6a. In the buffer of point Pi, there are three candidate segments = { , , } . The distances between Pi and the candidate segments are { , , } , and the candidate points of Pi are { , , }. As azimuth information in floating car data is lacking, we calculate the angle differences between the vector connecting points Pi and Pi+1 and the candidate segments direction { , , }. As shown in Figure 6b, the angle differences are { , , }. We use a threshold to filter out parts of candidate segments. If > , the corresponding segment of angle is removed. If only one candidate segment, , remains, point Pi is counted as a high-confidence tracking point (HCTP) according to Assumption 1 and segment is the matched road of point Pi. Algorithm 1 shows the details of the preliminary matching procedure. Before we introduce the map matching algorithm, it is necessary to describe the assumptions used in this paper. Assumption 1. The vehicle runs on the roads, so the trajectory can be matched to at least one road.

Assumption 2.
The path of the car tends to be direct, rather than a roundabout route. This means that the matching path between two GPS points will most likely be the shortest path.

Preliminary Matching
In the preliminary matching step, we generated the buffer of each point P i , 1 ≤ i ≤ n, with radius r in a trajectory T = {P 1 , P 2 . . . P n } to retrieve the candidate segments and candidate points of P i . As we had already preprocessed the trajectory, the noises with the largest deviations had been removed. The radius of the buffer was set as 30 m.
An example is shown in Figure 6a. In the buffer of point P i , there are three candidate segments The distances between P i and the candidate segments are As azimuth information in floating car data is lacking, we calculate the angle differences between the vector connecting points P i and P i+1 and the candidate segments direction e 1 i , e 2 i , e 3 i . As shown in Figure 6b, the angle differences are θ 1 i , θ 2 i , θ 3 i . We use a threshold T θ to filter out parts of candidate segments. If θ j i > T θ , the corresponding segment of angle θ j i is removed. If only one candidate segment, e 1 i , remains, point P i is counted as a high-confidence tracking point (HCTP) according to Assumption 1 and segment e 1 i is the matched road of point P i . Algorithm 1 shows the details of the preliminary matching procedure. for j = 1 to C.count do 5: = |azi_Pi-azi_ |; 6: if < then 7: CandidatedList.add ( ); 8: end if 9: end for 10: if CandidateList.count == 1 then 11: HCTPlist.add (Pi); 12: end if 13: end for If a point was counted as high-confidence tracking point (HCTP), we only retained the candidate segment which meets the requirement of threshold ; other candidate segments will be deleted. There is no need to calculate further. This can reduce the uncertainty and running time of the matching algorithm. However, for other points which do not count as HCTPs, we calculated it by the ST-matching algorithm.

Spatial-Temporal Matching
After the preliminary matching step, some points will have matched with the OSM map. However, there will be many points left to be matched. We choose an ST-matching algorithm to match these points [17,28]. ST-matching is a stable global optimization matching algorithm which can integrate the geometrical, topological, and speed information of traces and map. It includes two steps: spatial analysis and temporal analysis. First, the observation probability is calculated by identifying the shortest distance between the GPS points and the candidate points. Then, the transmission probability is estimated by comparing the shortest path of the GPS points and the candidate points. The temporal probability is calculated by the cosine distance to measure the similarity between the actual speed of the path and the road speed constraints. Finally, the candidate segment with the largest value is considered to be the final matching result.

Algorithm 1 Preliminary Matching Algorithm
Input: if θ j i < T θ then 7: CandidatedList.add (c j i ); 8: end if 9: end for 10: if CandidateList.count == 1 then 11: HCTPlist.add (P i ); 12: end if 13: end for If a point was counted as high-confidence tracking point (HCTP), we only retained the candidate segment which meets the requirement of threshold T θ ; other candidate segments will be deleted. There is no need to calculate further. This can reduce the uncertainty and running time of the matching algorithm. However, for other points which do not count as HCTPs, we calculated it by the ST-matching algorithm.

Spatial-Temporal Matching
After the preliminary matching step, some points will have matched with the OSM map. However, there will be many points left to be matched. We choose an ST-matching algorithm to match these points [17,28]. ST-matching is a stable global optimization matching algorithm which can integrate the geometrical, topological, and speed information of traces and map. It includes two steps: spatial analysis and temporal analysis. First, the observation probability is calculated by identifying the shortest distance between the GPS points and the candidate points. Then, the transmission probability is estimated by comparing the shortest path of the GPS points and the candidate points. The temporal probability is calculated by the cosine distance to measure the similarity between the actual speed of the path and the road speed constraints. Finally, the candidate segment with the largest value is considered to be the final matching result.
We improve this algorithm in three ways. In addition to the HCTPs, we give different weights to the observation and transmission probabilities. Considering that the shortest path is more reliable than the position accuracy of GPS. Therefore, the transmission probability has a higher reliability than the observation probability. Its weight should be set to be larger than the observation probability. Beyond this, the road speed constraint should be a range rather than a simple value. Therefore, we use a new method to calculate the temporal value. This is calculated by comparing the probability of the actual speed and the road speed constraint range.

Spatial Analysis
According to Reference [9], the position error of the GPS can be described with a normal distribution, and the formula of the observation probability can be calculated via: where d j i is the distance between point P i and candidate segment e j i , d j i = dist c j i , P i , and µ and σ are the mean and variance value of normal distribution.
Because of the position error of the GPS, it is not enough to only consider the Euclidean distance between the GPS trace and the segment. For example, as shown in Figure 7, there are two candidate points for P i , c 1 i , c 2 i . The observation probabilities of these two candidate points are equal. Obviously, the correctly matched point should be c 2 i according to Assumption 2. Hence, topological information is important for map matching, by which we can exclude certain points. The formula of the transmission probability is shown in Equation (7): where dis(P i−1 , P i ) is the Euclidean distance between tracking points P i−1 and P i and S (c k i−1 →c j i ) represents the shortest path between candidate points c k i−1 and c j i . There are many algorithms for the shortest path computation, such as the Dijkstra, Floyd, and A* algorithms [29]. Considering the efficiency of the algorithms, we choose the A* algorithm to calculate the shortest path. We improve this algorithm in three ways. In addition to the HCTPs, we give different weights to the observation and transmission probabilities. Considering that the shortest path is more reliable than the position accuracy of GPS. Therefore, the transmission probability has a higher reliability than the observation probability. Its weight should be set to be larger than the observation probability. Beyond this, the road speed constraint should be a range rather than a simple value. Therefore, we use a new method to calculate the temporal value. This is calculated by comparing the probability of the actual speed and the road speed constraint range.

Spatial Analysis
According to Reference [9], the position error of the GPS can be described with a normal distribution, and the formula of the observation probability can be calculated via: where is the distance between point Pi and candidate segment , = , , and and are the mean and variance value of normal distribution.
Because of the position error of the GPS, it is not enough to only consider the Euclidean distance between the GPS trace and the segment. For example, as shown in Figure 7, there are two candidate points for Pi, ( , ) . The observation probabilities of these two candidate points are equal. Obviously, the correctly matched point should be according to Assumption 2. Hence, topological information is important for map matching, by which we can exclude certain points. The formula of the transmission probability is shown in Equation (7): where ( , ) is the Euclidean distance between tracking points and and → represents the shortest path between candidate points and . There are many algorithms for the shortest path computation, such as the Dijkstra, Floyd, and A* algorithms [29]. Considering the efficiency of the algorithms, we choose the A* algorithm to calculate the shortest path. Combining Equations (6) and (7), the spatial analysis function is calculated as follows: where and represent the weight of the observation and the transmission probabilities, respectively, and + = 1. Combining Equations (6) and (7), the spatial analysis function is calculated as follows: where w 1 and w 2 represent the weight of the observation and the transmission probabilities, respectively, and w 1 + w 2 = 1.

Temporal Analysis
The spatial analysis can match the GPS trace to the OSM map in most cases. However, there are lots of elevated roads in China, and the shape and location of the elevated roads are similar to the roads underneath them. Therefore, it is difficult to match the GPS trace using only spatial analysis. However, as shown in Table 1, the road speed constraints are different for the elevated roads than for other roads. Therefore, it is feasible to use the road speed constraint information to refine our analysis. We calculate the probability of the actual speed of a vehicle in different road speed constraint ranges, as shown in the equations below: where R.u and R.σ represent the mean value and variance of the candidate road speed, respectively,  Combining Equations (8) and (9), the final ST-matching function is: After the HST-matching steps, the value of F c k i−1 → c j i should be calculated for each candidate point. We select the highest score as the matching result between two HCTPs, as shown by the red line in Figure 8. Algorithm 2 shows the process of ST-matching. Combining Equations (8) and (9), the final ST-matching function is: After the HST-matching steps, the value of → should be calculated for each candidate point. We select the highest score as the matching result between two HCTPs, as shown by the red line in Figure 8. Algorithm 2 shows the process of ST-matching.  i is the ID of candidate point. The points P i and P i+k are the HCTP points, so there is only one candidate point. The black arrows represent the candidate matching path from P i to P i+1 , and the red arrows represent the final matching results.

Trajectory Correction Algorithm
The accuracy of the trajectory improves after the optimization approach; however, it is not enough for the accurate generation of a lane-level map. As shown in Figure 9, the red points are matched to the yellow road which should locate at the road from right to left. However, some points appear on the opposite road because of the multipath effect, which will decrease the accuracy for generating the lane-level map. Thus, to address this issue, this paper proposes a physical attraction model based on the matched OSM map.
According to [7,30], in the physical attraction model two types of forces act on the trajectories. One is an attractive force from the other traces in the same direction and on the same road. The other is a spring force to prevent the trace from moving away from its original position, as shown in Figure 10. All the traces in the same direction will be grouped together by these forces. The accuracy of this approach can reach the road level, but the original information of the trace is lost. Moreover, this approach makes mistakes at crossings, and it is time-consuming to calculate the distance between one trace and all the other traces. Thus, this paper introduces the matched OSM map algorithm to address these concerns.

Trajectory Correction Algorithm
The accuracy of the trajectory improves after the optimization approach; however, it is not enough for the accurate generation of a lane-level map. As shown in Figure 9, the red points are matched to the yellow road which should locate at the road from right to left. However, some points appear on the opposite road because of the multipath effect, which will decrease the accuracy for generating the lane-level map. Thus, to address this issue, this paper proposes a physical attraction model based on the matched OSM map. According to [7,30], in the physical attraction model two types of forces act on the trajectories. One is an attractive force from the other traces in the same direction and on the same road. The other is a spring force to prevent the trace from moving away from its original position, as shown in Figure 10. All the traces in the same direction will be grouped together by these forces. The accuracy of this approach can reach the road level, but the original information of the trace is lost. Moreover, this approach makes mistakes at crossings, and it is time-consuming to calculate the distance between where and are the two experimental parameters that determine the potential energy of the attractive force, K is the spring constant, is the distance from to the matched OSM map points, ̅ is the average distance of , and ( − ) is the difference of the distance between the new and the original position of point in order to keep ( ) equal to ( ). As shown in Figure 11, we set the direction of the OSM road as the x-axis and the left side of the road as the y-axis.
is the new position of . The details are shown in Algorithm 3.

Algorithm 3 Physical Attraction Model
Input: Trajectory P1 → P2 … → Pn; OSM-WayID-List; Output: New Trajectory → … → ; 1: for t = 1 to n do 2: T = 0; K = ∞; 3: ̅ = meandistance (d1, d2 … dn); While T ≤ 20 && K > 0.  In general, when a car drives on the road, it tends to keep driving in the same lane, unless an emergency occurs or it drives to an intersection. Therefore, the traces tend to keep the same distance with the matched OSM map. We calculated the attractive force by the relative distance between the trace and OSM road. Compared to calculating the distance between one trace and all the other traces, this method can greatly decrease the running time of the algorithm. At the intersection, our method is more reliable because the OSM map constrains the direction of the force. The equations are shown below: where M and σ are the two experimental parameters that determine the potential energy of the attractive force, K is the spring constant, d i is the distance from P i to the matched OSM map points, d is the average distance of d i , and (y − d i ) is the difference of the distance between the new and the original position of point P i in order to keep F 1 (P i ) equal to F 2 (P i ). As shown in Figure 11, we set the direction of the OSM road as the x-axis and the left side of the road as the y-axis. P i is the new position of P i . The details are shown in Algorithm 3.

Experimental Data
To test the algorithms proposed in this paper, we collected about 40 million GPS points from thousands of taxis within one week in Wuhan, China. The sampling frequency ranged from 1 s to 10 min, as shown in Figure 12. Over 60% of the points are included in the sampling frequency range of 1-40 s.
(a) (b) Figure 12. The experimental data. (a) GPS points collected from taxis; the yellow dots represent the GPS points whose sampling frequency ranged from 2-10 min; the gray dots are the points who sampling frequency ranged from 1-2 min; the orange dots are the points whose sampling frequency ranged from 40-60 s; and the blue dots are the points whose sampling frequency ranged from 1-40 s. (b) Distribution of the sampling frequencies.

Trajectory Preprocessing
The trajectory preprocessing algorithm is an adaptive density optimization method. The result is shown in Figure 13. The red dots and black dots represent the selected GPS points and outliers, respectively; the noise points that are removed by this method.

Experimental Data
To test the algorithms proposed in this paper, we collected about 40 million GPS points from thousands of taxis within one week in Wuhan, China. The sampling frequency ranged from 1 s to 10 min, as shown in Figure 12. Over 60% of the points are included in the sampling frequency range of 1-40 s.

Experimental Data
To test the algorithms proposed in this paper, we collected about 40 million GPS points from thousands of taxis within one week in Wuhan, China. The sampling frequency ranged from 1 s to 10 min, as shown in Figure 12. Over 60% of the points are included in the sampling frequency range of 1-40 s.
(a) (b) Figure 12. The experimental data. (a) GPS points collected from taxis; the yellow dots represent the GPS points whose sampling frequency ranged from 2-10 min; the gray dots are the points who sampling frequency ranged from 1-2 min; the orange dots are the points whose sampling frequency ranged from 40-60 s; and the blue dots are the points whose sampling frequency ranged from 1-40 s. (b) Distribution of the sampling frequencies.

Trajectory Preprocessing
The trajectory preprocessing algorithm is an adaptive density optimization method. The result is shown in Figure 13. The red dots and black dots represent the selected GPS points and outliers, Figure 12. The experimental data. (a) GPS points collected from taxis; the yellow dots represent the GPS points whose sampling frequency ranged from 2-10 min; the gray dots are the points who sampling frequency ranged from 1-2 min; the orange dots are the points whose sampling frequency ranged from 40-60 s; and the blue dots are the points whose sampling frequency ranged from 1-40 s. (b) Distribution of the sampling frequencies.

Trajectory Preprocessing
The trajectory preprocessing algorithm is an adaptive density optimization method. The result is shown in Figure 13. The red dots and black dots represent the selected GPS points and outliers, respectively; the noise points that are removed by this method.

Map Matching
The matching data was labeled by real people. Compared to the synthetic trajectory data used in reference [17], it is more reliable. The labeled data contains 34 traces covering about 494 km, as represented in Figure 14.

Evaluation Approach
To evaluate the matching quality, we calculated the accuracy and recall, as shown in the following equations:

Parameter Selection
In Section 3.2.1, we used the threshold to screen out the candidate segments. The accuracy and recall are different, depending on the choice of , as shown in Figure 15. The recall of the HCTPs

Map Matching
The matching data was labeled by real people. Compared to the synthetic trajectory data used in reference [17], it is more reliable. The labeled data contains 34 traces covering about 494 km, as represented in Figure 14.

Map Matching
The matching data was labeled by real people. Compared to the synthetic trajectory data used in reference [17], it is more reliable. The labeled data contains 34 traces covering about 494 km, as represented in Figure 14.

Evaluation Approach
To evaluate the matching quality, we calculated the accuracy and recall, as shown in the following equations:

Parameter Selection
In Section 3.2.1, we used the threshold to screen out the candidate segments. The accuracy and recall are different, depending on the choice of , as shown in Figure 15. The recall of the HCTPs is about 86-92%, and the points matched as HCTP do not need to be matched further. This reduces

Parameter Selection
In Section 3.3.1, we used the threshold T θ to screen out the candidate segments. The accuracy and recall are different, depending on the choice of T θ , as shown in Figure 15. The recall of the HCTPs is about 86-92%, and the points matched as HCTP do not need to be matched further. This reduces the running time of the algorithm. The accuracy of HCTP is about 90%. This means that Assumption 1, as proposed above, is dependable. According to these results, when we set T θ = 90 • the accuracy and recall reach their maximums. 1, as proposed above, is dependable. According to these results, when we set = 90° the accuracy and recall reach their maximums. In the ST-matching step, the parameters included the following: , , , , , . , . . and are the mean value and variance of the normal distribution, respectively. In this paper, we set = 0 and = 10. and are the weight of the observation and the transmission probabilities, respectively. As explained above, the transmission probability is more reliable than the observation probability; therefore, we set = 0.3 and = 0.7. The value of was 10 km/h. . and . are the mean value and variance of the road speed constraints, respectively. The values for the different roads are shown in Table 2. Table 2. The mean ( . ) and variance ( . ) of the road speed constraints for the different roads.

Matching Result
We compared the HST-matching results with those of the ST-matching algorithm. To evaluate the quality and efficiency of the two algorithms, we compared the accuracy and running time. Figure 16 represents the accuracy comparison results. When the number of points on a trajectory was in the range 5-15, the HST-matching algorithm significantly outperformed the ST-matching algorithm; the accuracy of the HST-matching algorithm showed about a 15% improvement. With an increasing number of points in a trajectory, the performance of these two algorithms became more similar, but the HST-matching algorithm still showed about an 8% improvement over the ST-matching algorithm.  In the ST-matching step, the parameters included the following: µ, σ, w 1 , w 2 , τ, R.σ, R.u. µ and σ are the mean value and variance of the normal distribution, respectively. In this paper, we set µ = 0 and σ = 10. w 1 and w 2 are the weight of the observation and the transmission probabilities, respectively. As explained above, the transmission probability is more reliable than the observation probability; therefore, we set w 1 = 0.3 and w 2 = 0.7. The value of τ was 10 km/h. R.u and R.σ are the mean value and variance of the road speed constraints, respectively. The values for the different roads are shown in Table 2.

Matching Result
We compared the HST-matching results with those of the ST-matching algorithm. To evaluate the quality and efficiency of the two algorithms, we compared the accuracy and running time. Figure 16 represents the accuracy comparison results. When the number of points on a trajectory was in the range 5-15, the HST-matching algorithm significantly outperformed the ST-matching algorithm; the accuracy of the HST-matching algorithm showed about a 15% improvement. With an increasing number of points in a trajectory, the performance of these two algorithms became more similar, but the HST-matching algorithm still showed about an 8% improvement over the ST-matching algorithm.

Running Time
As shown in Figure 17, it is clear that the HST-matching algorithm is faster than the ST-matching algorithm, especially when the number of points is in the range 5-15. This is because we calculate the HCTPs first. This method greatly reduces the number of unnecessary calculations, especially when there are fewer points on the trajectory. As the number of points increases, the time cost of the two methods increases quickly and tends to become more similar. This is because the algorithms need more time to calculate the shortest path with an increased number of points. quality and efficiency of the two algorithms, we compared the accuracy and running time. Figure 16 represents the accuracy comparison results. When the number of points on a trajectory was in the range 5-15, the HST-matching algorithm significantly outperformed the ST-matching algorithm; the accuracy of the HST-matching algorithm showed about a 15% improvement. With an increasing number of points in a trajectory, the performance of these two algorithms became more similar, but the HST-matching algorithm still showed about an 8% improvement over the ST-matching algorithm.  As shown in Figure 17, it is clear that the HST-matching algorithm is faster than the ST-matching algorithm, especially when the number of points is in the range 5-15. This is because we calculate the HCTPs first. This method greatly reduces the number of unnecessary calculations, especially when there are fewer points on the trajectory. As the number of points increases, the time cost of the two methods increases quickly and tends to become more similar. This is because the algorithms need more time to calculate the shortest path with an increased number of points.

Parameter Selection
In this part, there are three main parameters that need to be set： , , and . According to reference [30], we set = 1, = 10, and k = 0.005. Figure 18 shows the original data. The data is messy, and there are a many points that appear on the wrong side of the road, against the traffic regulations. After the data correction algorithm proposed in this paper is used the position accuracy of the data improved. The trajectory no longer appears in the opposite lane; the points corrected to the right position, as shown in Figure 19.

Parameter Selection
In this part, there are three main parameters that need to be set: M, σ, and k. According to reference [30], we set M = 1, σ = 10, and k = 0.005. Figure 18 shows the original data. The data is messy, and there are a many points that appear on the wrong side of the road, against the traffic regulations. After the data correction algorithm proposed in this paper is used the position accuracy of the data improved. The trajectory no longer appears in the opposite lane; the points corrected to the right position, as shown in Figure 19. As shown in Figure 17, it is clear that the HST-matching algorithm is faster than the ST-matching algorithm, especially when the number of points is in the range 5-15. This is because we calculate the HCTPs first. This method greatly reduces the number of unnecessary calculations, especially when there are fewer points on the trajectory. As the number of points increases, the time cost of the two methods increases quickly and tends to become more similar. This is because the algorithms need more time to calculate the shortest path with an increased number of points.

Parameter Selection
In this part, there are three main parameters that need to be set： , , and . According to reference [30], we set = 1, = 10, and k = 0.005. Figure 18 shows the original data. The data is messy, and there are a many points that appear on the wrong side of the road, against the traffic regulations. After the data correction algorithm proposed in this paper is used the position accuracy of the data improved. The trajectory no longer appears in the opposite lane; the points corrected to the right position, as shown in Figure 19.  The algorithm proposed in reference [30] groups the traces with the same direction together; the   The algorithm proposed in reference [30] groups the traces with the same direction together; the gap between them is less than 0.5 m, as represented in Figure 20. By this approach, the accuracy of Figure 19. The results of the correction algorithm represented in this paper. The points from different directions separate well.

Correction Result
The algorithm proposed in reference [30] groups the traces with the same direction together; the gap between them is less than 0.5 m, as represented in Figure 20. By this approach, the accuracy of the data can only reach the road level, beyond which a lane-level map cannot be generated. Additionally, original information of the floating car traces is lost. Moreover, there are still some mistakes at the intersections when using this approach, as shown in Figure 21b; there are some incorrect edges that need to be removed. In contrast, in our method points are previously matched to the OSM, so they can be correctly clustered especially at the intersection, as shown in Figure 21a. the data can only reach the road level, beyond which a lane-level map cannot be generated. Additionally, original information of the floating car traces is lost. Moreover, there are still some mistakes at the intersections when using this approach, as shown in Figure 21b; there are some incorrect edges that need to be removed. In contrast, in our method points are previously matched to the OSM, so they can be correctly clustered especially at the intersection, as shown in Figure 21a. The time complexity of the algorithm in reference [30] is (M 2 ), where M is the number of nodes in the GPS dataset. For each node, a dataset was a square (100 × 100 m) centered at the node. It took at least 15 s to calculate the data for each node. However, the time complexity of the algorithm proposed in this paper is (M). Our algorithm only needs 150 ms to calculate the data for each pointa marked improvement on previous algorithms.

Conclusions
In this paper, we proposed a data correction algorithm for low-frequency floating car data. After preprocessing the data, we employed an HST-matching algorithm to match the GPS trajectories with the OSM map. The accuracy and running time of this algorithm were compared with those of the STmatching algorithm. The accuracy of the HST-matching algorithm was higher; the accuracy of the HST-matching algorithm was always 8-15% higher than that of the ST-matching algorithm. Moreover, we needed less time to calculate the results because we adopted a hierarchical algorithm to calculate the HCTP first. Next, the data was corrected by a physical attraction model based on the matched OSM map. A verification experiment was conducted based on the data of actual taxi trajectories. The results showed that the accuracy of the data after the correction was improved, especially at the crossroads. Moreover, we improved the time efficiency by 150 times.
This paper proved that OSM can be used to improve the accuracy of low-floating car data. This study was also useful for increasing the precision of the production of lane-level maps, which were  [30]. The data with the same direction clustered together, and the maximum gap was 0.5 m. the data can only reach the road level, beyond which a lane-level map cannot be generated. Additionally, original information of the floating car traces is lost. Moreover, there are still some mistakes at the intersections when using this approach, as shown in Figure 21b; there are some incorrect edges that need to be removed. In contrast, in our method points are previously matched to the OSM, so they can be correctly clustered especially at the intersection, as shown in Figure 21a. The time complexity of the algorithm in reference [30] is (M 2 ), where M is the number of nodes in the GPS dataset. For each node, a dataset was a square (100 × 100 m) centered at the node. It took at least 15 s to calculate the data for each node. However, the time complexity of the algorithm proposed in this paper is (M). Our algorithm only needs 150 ms to calculate the data for each pointa marked improvement on previous algorithms.

Conclusions
In this paper, we proposed a data correction algorithm for low-frequency floating car data. After preprocessing the data, we employed an HST-matching algorithm to match the GPS trajectories with the OSM map. The accuracy and running time of this algorithm were compared with those of the STmatching algorithm. The accuracy of the HST-matching algorithm was higher; the accuracy of the HST-matching algorithm was always 8-15% higher than that of the ST-matching algorithm. Moreover, we needed less time to calculate the results because we adopted a hierarchical algorithm to calculate the HCTP first. Next, the data was corrected by a physical attraction model based on the matched OSM map. A verification experiment was conducted based on the data of actual taxi trajectories. The results showed that the accuracy of the data after the correction was improved, especially at the crossroads. Moreover, we improved the time efficiency by 150 times. The time complexity of the algorithm in reference [30] is (M 2 ), where M is the number of nodes in the GPS dataset. For each node, a dataset was a square (100 × 100 m) centered at the node. It took at least 15 s to calculate the data for each node. However, the time complexity of the algorithm proposed in this paper is (M). Our algorithm only needs 150 ms to calculate the data for each point-a marked improvement on previous algorithms.

Conclusions
In this paper, we proposed a data correction algorithm for low-frequency floating car data. After preprocessing the data, we employed an HST-matching algorithm to match the GPS trajectories with the OSM map. The accuracy and running time of this algorithm were compared with those of the ST-matching algorithm. The accuracy of the HST-matching algorithm was higher; the accuracy of the HST-matching algorithm was always 8-15% higher than that of the ST-matching algorithm. Moreover, we needed less time to calculate the results because we adopted a hierarchical algorithm to calculate the HCTP first. Next, the data was corrected by a physical attraction model based on the matched OSM map. A verification experiment was conducted based on the data of actual taxi trajectories. The results showed that the accuracy of the data after the correction was improved, especially at the crossroads. Moreover, we improved the time efficiency by 150 times.
This paper proved that OSM can be used to improve the accuracy of low-floating car data. This study was also useful for increasing the precision of the production of lane-level maps, which were generated by the corrected data.
However, although we greatly improved the time efficiency, it still took a long time to calculate all the data because of the huge quantity of floating car data. In the future, we will continue to improve the calculation efficiency of this algorithm and to research the production of lane-level maps.

Conflicts of Interest:
The authors declare no conflicts of interest.