Map-Matching Using Hidden Markov Model and Path Choice Preferences under Sparse Trajectory

: In the ﬁeld of map matching, algorithms using topological relationships of road networks along with other data are normally suitable for high frequency trajectory data. However, for low frequency trajectory data, the above methods may cause problems of low matching accuracy. In addition, most past studies only use information from the road network and trajectory, without considering the traveler’s path choice preferences. In order to address the above-mentioned issue, we propose a new map matching method that combines the widely used Hidden Markov Model (HMM) with the path choice preference of decision makers. When calculating transition probability in the HMM, in addition to shortest paths and road network topology relationships, the choice preferences of travelers are also taken into account. The proposed algorithm is tested using sparse and noisy trajectory data with four different sampling intervals, while compared the results with the two underlying algorithms. The results show that our algorithm can improve the matching accuracy, especially for higher frequency locating trajectory. Importantly, the method takes into account the route choice preferences while correcting deviating trajectory points to the corresponding road segments, making the assumptions more reasonable. The case-study is in the city of Beijing, China.


Introduction
The application of GPS location data in traffic research has become a common phenomenon in the digital society. In the meantime, location data can be easily integrated with geographic information system (GIS) for its geographical characteristics, which leads to a large number of applications based on geographic location, such as vehicle navigation, location-based service, etc. In the above integration, the crucial step is to combine location data with the spatial network data to identify the correct road segment and determine the position on the road network, which is referred to as map matching (MM) [1]. Effective map matching can be achieved when location data and road networks have high accuracies.
However, the devil is in the detail. In general, the quality of GPS measurements can be influenced by atmospheric disturbances, the synchronization of clocks, the geographical features of the observed region, inadequate transmitting formats, or unknown human or instrumental errors, which leads to a certain deviation between the collected data and the real location data. Militino et al. [2] assumed that when all the GPS observables are taken in the same conditions, it is still possible to estimate the positional errors as the difference between the real coordinates and those measured by the GPS. In the field of MM, it is typical to set a positioning error threshold to filter out the road segments corresponding to the trajectory points. In general, the error of the device collecting the GPS trajectory is known. In addition, the road network used by researchers was of limited quality, which was either outdated or did not cover all existing roads. These issues can turn a simple problem into a complex one, so it is critical to detect and resolve these data quality issues when performing map matching.
According to the sampling density of trajectory data, approaches for map matching algorithms can be categorized into high frequency sampling algorithms and low frequency sampling algorithms [3]. When the GPS sampling frequency is high, generally, the sampling interval is less than 10 s, and even if only the track points are matched to the nearest road segment, a high accuracy ratio can be obtained. Road network matching techniques based on high frequency sampling data have been widely adopted in commerce. However, due to the expensive storage and transmission costs of high-frequency sampling data, low-frequency trajectory is generally used in practice. For trajectories with long sampling intervals, existing methods cannot fully guarantee that the matched paths match the actual choices of decision makers because there are multiple candidate paths between neighboring localization points. Moreover, it is also challenging to deal with such problems in complex urban road networks, which is the focus of this paper.
For the low sampling frequency trajectory, a large number of map matching methods are proposed [4,5]. While some advanced methods use techniques such as fuzzy logic and Bayesian inference, most studies are based on the work of Newson [6], which is based on Hidden Markov Models (HMM). In detail, such algorithms only use the geometric and topological information of the road network to estimate the parameters of the HMM, which ignore the decision of the driver.
To address this issue, we propose a new algorithm which integrate the route choice behavior and HMM algorithm for low sampling trajectories. In the process of selecting candidate matched positions for each trajectory point, except the shortest distance between the trajectory point and the selected link, other factors such as the instantaneous velocity, the speed limitation and route choice habits of drivers are also taken into account. For example, people generally tend to select roads with higher speed limits in a trip. On the other hand, when inferring the paths between adjacent trajectory points, we use the shortest path algorithm and the geometric and topological relationships of the road network.
The rest of this paper is organized as follows. Section 2 reviews previous related work in the area of map matching. Section 3 presents the map matching based on a path selection model with an HMM algorithm. Section 4 present the results of numerical experiments and compares them with the two underlying algorithms. In Section 5, conclusions are presented and future research directions are discussed.

Literature Review
Map matching (MM) algorithms date back to the 1980s, when researchers initially used geometric and topological information to perform matching. It only uses the geometric information of the spatial-temporal road network for matching, such as distance, angle, shape, etc., without considering the connectivity between road sections. A curve-to-curve method was provided by White et al. [7] to prove that using the topological information of road networks can improve matching quality. However, such approaches are sensitive to noise and cannot be corrected in time in case of a false match.
Then, a probabilistic map matching algorithm became more popular for its robustness. Jong-Sun et al. [8] presented a map matching method using the multiple hypothesis technique. He succeeded in determining a road between neighboring trajectory points that are far away by applying a probabilistic method, but the algorithm is not applicable to sparse localization data. Bierlaire et al. [9] proposed a probabilistic map matching approach by generating a set of potential true paths and associating a likelihood with each of them. The disadvantage is that the method works well in high-frequency trajectory data, while the mismatch at intersections is high. Later, more advanced methods were introduced in the field of map matching, such as Kalman filter, fuzzy logic model and Hidden Markov Model (HMM). These methods focus on the perspective of the overall situation for all Sustainability 2021, 13, 12820 3 of 15 locating data and their candidate road segments rather than calculating between individual points and nearby candidate links [10]. It typically takes three steps. In the initial step, we need to obtain all road segments with loosely constrained distance near the positioning point, which are considered as candidate links. The value of this distance depends on the relevant features of the traffic road network and the localization error of the trajectory data. The second step is to allocate a matching probability to each candidate link. This probability depends on the topological relationship between the trajectory point and the candidate road segments, such as the similarity of azimuths, the distance between them, the speed of the point and the speed limit of the road segments, etc. The final step is to deduce the most probable path between adjacent trajectory points, which is obtained by calculating the possible paths between their respective alternative points. At this stage, previous studies generally treat the shortest path between candidate points as the actual path, which is inconsistent with the reality.
Sparse trajectory is a common issue in map matching. To fill the paths between adjacent sparse trajectory points, Lou et al. [11] proposed to use the most probable path. Their algorithm is based on two intuitive assumptions: the actual paths tend to be straightforward, and the trajectories are constrained by speed limits. Quddus and Washington [4] propose a map matching algorithm by considering the connectivity of connections and the turning limits of intersections, combined with weight-based shortest paths. Hsueh and Chen [12] perform map matching under low sampling rate conditions by exploring the real-time movement direction of vehicles. However, the above studies only consider spatial characteristics such as distance information and topology of the road network, as well as the speed constraint, ignoring the human choice factor.
To overcome the above limitations of the existing HMM-based map matching algorithm, we combine the HMM with the path selection model, which takes into account the driving habits of the travelers and ensures the feasibility of the matched paths. The experimental results in Section 4 show that our algorithm has higher matching accuracy than the two benchmark algorithms.

Methods and Techniques
This section introduces the road network matching method proposed in this paper, which consists of four main parts. Firstly, defining the preliminary notations and terms used for map matching. Secondly, the basic workflow of the paper is briefly explained. Thirdly, introducing the data that will be used in this paper and the process of preparing them. In the end, we state the key ideas of the approach adopted by the Hidden Markov Model and path selection model.

Notations and Definitions
Definition 1 (Road intersection). A road intersection v is an intersection of multiple roads in a road network, which is associated with a longitude (v.lon) and a latitude (v.lat).

Definition 2 (Road segment).
A road segment e is a directed edge represented by one start node (e.start), one end node (e.end) and a segment length (e.length). A two-way street is regarded as two individual segments.
Definition 3 (Road network). A road network is a directed graph G(V, E); here, V is the vertex set representing the road intersections and E is the edge set representing the road segments.
Definition 4 (Observation point). An observation point p contains the spatial-temporal information of a person when using GPS devices to collect their travel information. Each point is associated with longitude (p.lon) and latitude (p.lat) coordinates with a timestamp (p.time).
Definition 5 (Trajectory). In general, a trajectory T is a sequence of observation point ordered by the timestamps (i.e., p 0 → p 1 → · · · → p n ). Definition 7 (Inferred route). A inferred route L is a sequence of connected road segments which a vehicle is believed to travel given the observation of a trajectory T. In the work presented in this paper, an inferred route does not necessarily start and end at nodes. It could start or end at any point that lies along the centerline of any road segment.
Given a sequence of N observations o 1:N = {o 1 , o 2 , · · · , o n } and a road network G, the map matching problem is to find the inferred route L in G corresponding to o 1:N .

Algorithm Overview
The proposed algorithm is designed to select the best matched road segment for each GPS trajectory point. Figure 1 shows the architecture of our algorithm, which consists of two parts: data preprocessing and the construction of HMM. Given a sequence of N observations 1: ⋅⋅⋅ and a road network G, the map matching problem is to find the inferred route L in G corresponding to .

Algorithm Overview
The proposed algorithm is designed to select the best matched road segment for each GPS trajectory point. Figure 1 shows the architecture of our algorithm, which consists of two parts: data preprocessing and the construction of HMM. This part contains two steps: trajectory clustering and road network processing. The GPS trajectory data were collected from dozens of volunteers in step 1. In the process of data integration and cleansing, converting the raw data into a format that allows for data mining. After that, the DBSCAN clustering algorithm is used to exclude the redundant trajectory points to improve the matching efficiency.
In the step of road network processing, the original road network, which presented in XML, is easily used to send data but has poor availability. By extracting relative information from the original file, standard road network can be established. Except the typical component of road network G, it also includes other elements, such as the road class, oneway road or not, the road's name, the length and orientation of a road segment, that can be used in later procedure.
After the above data processing operations, the data suitable for the subsequent model construction are prepared. This part contains two steps: trajectory clustering and road network processing. The GPS trajectory data were collected from dozens of volunteers in step 1. In the process of data integration and cleansing, converting the raw data into a format that allows for data mining. After that, the DBSCAN clustering algorithm is used to exclude the redundant trajectory points to improve the matching efficiency.
In the step of road network processing, the original road network, which presented in XML, is easily used to send data but has poor availability. By extracting relative information from the original file, standard road network can be established. Except the typical component of road network G, it also includes other elements, such as the road class, one-way road or not, the road's name, the length and orientation of a road segment, that can be used in later procedure.
After the above data processing operations, the data suitable for the subsequent model construction are prepared.

• Part 2: Hidden Markov Model
Firstly, the network topology information obtained previously is used to calculate the observation probability. Secondly, the shortest path algorithm is applied to measure the shortest path and distance between candidate segments of time i and i + 1, and then the Sustainability 2021, 13, 12820 5 of 15 transition probability is computed. Lastly, HMM is used to calculate the probabilities of various inferred paths corresponding to track T, and the route with the highest probability is chosen as the optimal route.

•
In summary, the overall flow of the proposed algorithm is as follows.
(1) Input: Road network G(V, E), each road segment (lon, lat) ∈ E is associated with a list of properties, such as road ID, name, the start and end node, grade, speed limit, azimuth and a sequence of N observations o 1:N = {o 1 , · · · , o n }; (2) For a threshold value R, set the road segments within the proximity of observation t(t = 1, 2, . . . N) as the candidate links, as shown in Figure 2. Then, the observation probability for each candidate link is calculated; is used to compute the state at time step t; (5) At the last time step t = N, the path with the highest probability is taken as the optimal path.
• Part 2: Hidden Markov Model Firstly, the network topology information obtained previously is used to calculate the observation probability. Secondly, the shortest path algorithm is applied to measure the shortest path and distance between candidate segments of time and , and then the transition probability is computed. Lastly, HMM is used to calculate the probabilities of various inferred paths corresponding to track T, and the route with the highest probability is chosen as the optimal route.

•
In summary, the overall flow of the proposed algorithm is as followsL (1) Input: Road network , each road segment is associated with a list of properties, such as road ID, name, the start and end node, grade, speed limit, azimuth and a sequence of N observations ; (2) For a threshold value , set the road segments within the proximity of observation as the candidate links, as shown in Figure 2. Then, the observation probability for each candidate link is calculated; (3) When , is treated as the initial state; (4) When , the Viterbi algorithm is used to compute the state at time step ; (5) At the last time step , the path with the highest probability is taken as the optimal path. The following subsections present details of these steps.

GPS Trajectory Data
In 2019, we employed about 60 volunteers to record GPS trajectory data of their daily activities while using a vehicle. Each observation point contains five fields: volunteer ID, time, longitude, latitude and speed, as shown in Table 1. The sampling frequency is 5 s, and the total size is more than 80,000 pieces of records collected from 196 trips. Since this paper addresses the sparse data map matching problem, certain GPS points are removed from the original dataset, and datasets with sampling intervals of 30, 45 and 60 s are generated. The following subsections present details of these steps.

GPS Trajectory Data
In 2019, we employed about 60 volunteers to record GPS trajectory data of their daily activities while using a vehicle. Each observation point contains five fields: volunteer ID, time, longitude, latitude and speed, as shown in Table 1. The sampling frequency is 5 s, and the total size is more than 80,000 pieces of records collected from 196 trips. Since this paper addresses the sparse data map matching problem, certain GPS points are removed from the original dataset, and datasets with sampling intervals of 30, 45 and 60 s are generated. Different types of data problems such as data loss or data duplication often occur due to human negligence, equipment anomalies and other factors [13]. Direct analysis of these problematic data will produce erroneous or meaningless results and must be corrected through a process of data integration and cleansing prior to building data mining models.
After integrating and cleaning the original data, the next operation is to filter the dense track points using clustering. For two consecutive trajectory points, when the speed is very low, the distance will be too small to affect the subsequent map matching process. In this procedure, DBSCAN is applied to exclude dense points.
After data sampling, integration and cleansing and clustering, one volunteer trajectory will only contain several tens of geographic coordinates after preprocessing, just as Figure 3 illustrates.  Different types of data problems such as data loss or data duplication often occur due to human negligence, equipment anomalies and other factors [13]. Direct analysis of these problematic data will produce erroneous or meaningless results and must be corrected through a process of data integration and cleansing prior to building data mining models.
After integrating and cleaning the original data, the next operation is to filter the dense track points using clustering. For two consecutive trajectory points, when the speed is very low, the distance will be too small to affect the subsequent map matching process. In this procedure, DBSCAN is applied to exclude dense points.
After data sampling, integration and cleansing and clustering, one volunteer trajectory will only contain several tens of geographic coordinates after preprocessing, just as Figure 3 illustrates.

Road Network Data
We download the urban data of Beijing from OpenStreetMap, a website that aims to create and provide free geographic data, such as street maps, to anyone [14]. The original data file represents physical features on the ground (e.g., roads or buildings) using tags attached to their basic data structures, including nodes, ways and relations. The main attributes include the longitude and latitude of a node, the sequence of connecting nodes in a way, the ID and other tags of way, such as the name, if it is a highway and one-way or not.
Firstly, in order to construct the road network of Beijing, we need to extract information from label ways that equal highways. In detail, a highway has labels including "motorway", "trunk", "primary", "secondary", "tertiary", "unclassified", "motor-way_link", "trunk_link", "primary_link", "secondary_link", "tertiary_link" that are selected in our work, for only the above types of roads are suitable for vehicles. Secondly, a

Road Network Data
We download the urban data of Beijing from OpenStreetMap, a website that aims to create and provide free geographic data, such as street maps, to anyone [14]. The original data file represents physical features on the ground (e.g., roads or buildings) using tags attached to their basic data structures, including nodes, ways and relations. The main attributes include the longitude and latitude of a node, the sequence of connecting nodes in a way, the ID and other tags of way, such as the name, if it is a highway and one-way or not.
Firstly, in order to construct the road network of Beijing, we need to extract information from label ways that equal highways. In detail, a highway has labels including "motorway", "trunk", "primary", "secondary", "tertiary", "unclassified", "motorway_link", "trunk_link", "primary_link", "secondary_link", "tertiary_link" that are selected in our work, for only the above types of roads are suitable for vehicles. Secondly, a way composited by ordered nodes is knitted, the name and one-way or not is also recorded in this step.
Then, according to the Road Speed Limit Regulations of Beijing, the maximum speed is set for various roads. Finally, the length and bearing of the selected road is computed based on its longitude and latitude.
The selected road network contains 10,545 road segments and 9188 points. Apart from the linking sections, the average length of road segment is 91.54 m. Figure 4 presents the actual scene of the experiment and the display of the original trajectory data on the map of this paper. Figure 5 shows the basic structure of the road network that we extracted. way composited by ordered nodes is knitted, the name and one-way or not is also recorded in this step. Then, according to the Road Speed Limit Regulations of Beijing, the maximum speed is set for various roads. Finally, the length and bearing of the selected road is computed based on its longitude and latitude. The selected road network contains 10,545 road segments and 9188 points. Apart from the linking sections, the average length of road segment is 91.54 m. Figure 4 presents the actual scene of the experiment and the display of the original trajectory data on the map of this paper. Figure 5 shows the basic structure of the road network that we extracted.  It should be noted that the OSM road network data were accessed in 2021, while the GPS trajectory data were collected in 2019. There is no big change in the road network from 2019 to 2021 in Beijing, particularly within the central part, where the data we collected. Minor changes in road networks would significantly affect the matching results, so better results could be anticipated if high-quality road network data are available.  way composited by ordered nodes is knitted, the name and one-way or not is also recorded in this step. Then, according to the Road Speed Limit Regulations of Beijing, the maximum speed is set for various roads. Finally, the length and bearing of the selected road is computed based on its longitude and latitude. The selected road network contains 10,545 road segments and 9188 points. Apart from the linking sections, the average length of road segment is 91.54 m. Figure 4 presents the actual scene of the experiment and the display of the original trajectory data on the map of this paper. Figure 5 shows the basic structure of the road network that we extracted.  It should be noted that the OSM road network data were accessed in 2021, while the GPS trajectory data were collected in 2019. There is no big change in the road network from 2019 to 2021 in Beijing, particularly within the central part, where the data we collected. Minor changes in road networks would significantly affect the matching results, so better results could be anticipated if high-quality road network data are available. It should be noted that the OSM road network data were accessed in 2021, while the GPS trajectory data were collected in 2019. There is no big change in the road network from 2019 to 2021 in Beijing, particularly within the central part, where the data we collected. Minor changes in road networks would significantly affect the matching results, so better results could be anticipated if high-quality road network data are available.

HMM in Map Matching
A Hidden Markov Model is a type of Markov Chain, which is defined as "A stochastic model describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event" [15]. It is useful in events in which one is interested but may not be directly observable in the world. In the field of map matching, an HMM models a system assumed to be a Markov process with unobservable states and observable observations [16].
In general, an HMM has the following two assumptions [17]: • Markov Assumption: The probability of a particular state is dependent only on the immediate previous state: p(s t |s 1 , o 1 , · · · , s t−1 , o t−1 ) = p(s t |s t−1 ) t = 1, 2, · · · , T • Independent Assumption: The probability of an observation o t is dependent only on the state that produced the observation S i and not on any other state or any other observations: Figure 6 illustrates the model used. For a hidden state point S t at time step t, we can obtain the corresponding observation point o t with observation probability b t . This state point could transfer to the next hidden state point S t+1 with a transition probability a t+1 . At first, it should have an initial state π, which denotes the probabilities of being in various states of the start time. Transition probability A, observation probability B and initial state probability vector π are called the three elements in HMM [18].

HMM in Map Matching
A Hidden Markov Model is a type of Markov Chain, which is defined as "A stochastic model describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event" [15]. It is useful in events in which one is interested but may not be directly observable in the world. In the field of map matching, an HMM models a system assumed to be a Markov process with unobservable states and observable observations [16].
In general, an HMM has the following two assumptions [17]: • Markov Assumption: The probability of a particular state is dependent only on the immediate previous state:  In an HMM, there could be multiple sequences of states that are consistent with a given sequence of observations. The most likely state sequence can be efficiently calculated using the Viterbi algorithm [19]. The following introduces the workflow of this part.

Transition Probability
The transition probability between a state at time step and another state at time step depends on the features of the optimal path between them. For example, two candidates who lie on the same or consecutive road segment are more likely In an HMM, there could be multiple sequences of states that are consistent with a given sequence of observations. The most likely state sequence can be efficiently calculated using the Viterbi algorithm [19]. The following introduces the workflow of this part.

Transition Probability
The transition probability between a state S t-1,j at time step t-1 and another state S t,k at time step t depends on the features of the optimal path between them. For example, two candidates who lie on the same or consecutive road segment are more likely than on a parallel one, for it is unlikely that the object jumps from one road to the other across two sequential observations. Meanwhile, the road level of the candidate segments will not differ greatly.
In this paper, the path with the minimum free-flow travel time [20] and the difference in road class between the candidate road segments are considered as the main influence factors. This is calculated by assuming that the Euclidean distance between two candidates should be similar to the road-network distance between them. Haversine formula [21] is used to calculate the distance of two points, and Dijkstra's algorithm [22] is applied when searching for the shortest path distance in road network between two candidates.
Firstly, we consider a temporal implausibility in order to allocate low transition probabilities to paths that are not possible within the time ∆T unless driving exceeds the speed limit. It defined as follows: where: T f ree = the free-flow travel time in seconds; ∆T = the time interval between time steps t + 1 and t in seconds. Generally, the shortest path distance between two candidate points at time step t + 1 and t is usually consider the most important factor [23], which is defined as the following distance variance probability: here: D r , D e are the shortest path distance and Euclidean distance of two candidate points, respectively; β D = the parameter of the exponential distribution; R= the radius of the Earth; φ A , φ B are the latitude of point 1 and point 2 (in radians), respectively; L A , L B are the longitude of point 1 and point 2 (in radians), respectively.
In addition to the topological information of the path, it is also important to consider the path selection habits of the driver [20]. In this part, we use a logit model to estimate the driver's preference and consider all drivers to be identical as follows: where: χ arc , β arc are the average road class and its parameter for the candidate route, respectively; χ nce , β nce are the number of class changes and their parameters for the candidate route, respectively.
Therefore, the transition probability can be defined as follows: p(s t |s t−1 ) = p time ·p D ·p choice

Observation Probability
The observation probability is the likelihood of each candidate point for a GPS point belonging to that point. For example, a candidate road segment closer to the raw observation point would have a higher measurement probability than one that is further away.
Similar to previous studies, the observation probability model refers only to the current candidate point and the observation point [24]. In this paper, observation probability is determined by distance, bearing deviation and speed constraint rules between observation and candidate road segments. In addition, we regard the point closest to the observation on the road segment as the candidate point.
As in previous studies, we treat the shortest distance between a trajectory point and a candidate road segment as the most important factor and define it as the distance constraint probability. In the case of the true state, the distance between itself and the observed location is the location measurement error, which is generally assumed to follow a Gaussian distribution with zero mean. The distance constraint probability is computed as follows: where: dis = the great circle distance of two candidate points. It is worth noting that, in practice, the location measurement error may not strictly conform to the above model, especially in dense urban networks. Irrespective of the positioning technology used, the error is known to exhibit non-Gaussian characteristics and geographical variations. However, the model based on Gaussian distribution, though simple, has been shown to be effective in several previous works on map matching [25].
In addition to the distance constraint, the azimuth difference is another important factor to be considered, which is defined as the azimuth constraint probability. The bearing θ of node A is the angle between the North and the direction to the next node B on the route. The value can be calculated with the following haversine formula: where: θ obs = the instantaneous bearing of the observation point; θ candi = the azimuth of the candidate road segment. Finally, considering traffic rules, such as the speed of a vehicle cannot exceed the speed limit, traffic speed constraint probability is defined as follows: where: α spe = the instantaneous velocity of the observation point; α 0 = the maximum speed limit of candidate road. Therefore, the observation probability can be introduced as follows:

Viterbi Algorithm
The Viterbi algorithm takes the series of hidden states (candidate points) along with their associated measurement or transition probabilities as inputs. The first matched point is found simply as the candidate with the highest observation probability, as there are no transition probabilities for this base case. After this, the sets of candidates (for each GPS point) are iterated through and the candidate with the highest combination of measurement and transition probability is selected, considering only the observation transitions to the previous candidate points as per the Markov assumption.
It calculates the most probable sequence in the HMM using the following recurrence formulas: where: p(o t s t,k ) = the observation probability at time step t; p(s t,k s t−1,j ) = the transition probability from time step t-1 to t; V t,k = the maximum probability among all paths at time step t when selecting candidate link k.

Maximum Distance
In the process of MM, road segments with a distance less than the maximum distance from the GPS trajectory point will be selected as a candidate link, and those with a greater distance will be excluded. In previous research results, the maximum distance is determined by the measurement error and standard deviation of an observation.
The measurement error ε m , which influences the efficiency and accuracy of map matching, includes trajectory point positioning errors and road data errors. ε is represented by Equation (15): where: ε p = positioning error; here, we take its value as 20 m; ε r = road data error. ε r is mainly caused by the difference between the actual road width and the road line data, and its calculation is as shown in Equation (16) [26]: (16) where: w = the width of road, and the value of 40 m is taken here, which is approximately the width of eight lanes in both directions; α = the angle between two intersecting roads. In order to simplify the calculation, the angle is generally considered to be 90 degrees.
In a normal distribution, 95% of the data fall within three standard deviations that are taken as 7.6 m [27], as follows: According to the above influencing factors, the maximum distance is finally determined to be 50 m.

Parameters of the DBSCAN
DBSCAN is a density-based clustering algorithm: Given a set of points in some space, it combines those points that are closely aligned together and marks those points that lie individually in low-density regions as outliers. It requires two parameters: ε indicates the maximum distance of a neighborhood, and MinPts describes the minimum number of points required to form a dense neighborhood [28].
In the prepared road network, excluding the connecting roads, the average length of the road segments is about 90 m. To construct a sparse trajectory, MinPts is set to 3 and ε takes twice the average length of the section as 180 m to ensure that the contiguous observation points do not correspond to adjacent sections.

Assessment Criteria
In order to perform a comprehensive evaluation of the proposed algorithm, two evaluation metrics are used for comparison, including one accuracy criteria and one criterion related to the computation time.
The first is the accuracy ratio of matched points (AR), defined as follows [29,30]: where: CN k = the number of points matched to the correct road segment in trajectory k; N k = the total number of points in trajectory k; T = the number of trajectories. The second measurement criterion is the average time consumption for per trajectory point (AT), which is defined as follows: where: TC k = the time consumption of trajectory k.

Results
As mentioned above, we compared the performance of the three algorithms at three sampling intervals of 20, 30, 45 and 60 s. The overall and partial results before and after map matching with a sampling interval of 120 s are shown in Figure 7, and the measurement results with variable sampling intervals are presented in Table 2.   In addition to self-comparisons at different sampling intervals, we also compare the quality of the proposed shortest path-based HMM algorithm with two benchmark MM algorithms: a shortest path-based map matching algorithm (SPM) [4] and the HMM map matching algorithm (HMM) described in [31], using the dataset described in 3.3. The results of the experiment are shown in Figure 8. According to Table 2 and Figure 8, we can draw the following conclusions: • As shown in Table 2, the accuracy ratio of matching decreases rapidly, and the average time consumption increases slowly when the sampling interval is increased. For the measurement of AT, when a larger sampling interval is selected, the amount of map data used to calculate the shortest path between adjacent trajectory points does not change much, resulting in constant average time consumption. Furthermore, when examining the mismatched points, we found that most of the In addition to self-comparisons at different sampling intervals, we also compare the quality of the proposed shortest path-based HMM algorithm with two benchmark MM algorithms: a shortest path-based map matching algorithm (SPM) [4] and the HMM map matching algorithm (HMM) described in [31], using the dataset described in 3.3. The results of the experiment are shown in Figure 8.  In addition to self-comparisons at different sampling intervals, we also compare the quality of the proposed shortest path-based HMM algorithm with two benchmark MM algorithms: a shortest path-based map matching algorithm (SPM) [4] and the HMM map matching algorithm (HMM) described in [31], using the dataset described in 3.3. The results of the experiment are shown in Figure 8. According to Table 2 and Figure 8, we can draw the following conclusions: • As shown in Table 2, the accuracy ratio of matching decreases rapidly, and the average time consumption increases slowly when the sampling interval is increased. For the measurement of AT, when a larger sampling interval is selected, the amount of map data used to calculate the shortest path between adjacent trajectory points does not change much, resulting in constant average time consumption. Furthermore, when examining the mismatched points, we found that most of the According to Table 2 and Figure 8, we can draw the following conclusions: • As shown in Table 2, the accuracy ratio of matching decreases rapidly, and the average time consumption increases slowly when the sampling interval is increased. For the measurement of AT, when a larger sampling interval is selected, the amount of map data used to calculate the shortest path between adjacent trajectory points does not change much, resulting in constant average time consumption. Furthermore, when examining the mismatched points, we found that most of the mismatches occurred because the traveler did not choose the shortest path, which was inconsistent with our hypothesis and resulted in lower matching accuracy at larger sampling intervals. • Figure 8a shows the matching accuracy of the three algorithms for four different sampling intervals, the horizontal axis represents the sampling frequency and the vertical axis represents the percentage of accuracy. The AR values of all three methods are higher than 0.9, while the proposed algorithm has higher matching accuracy than the other two. We also note that all three algorithms can achieve good matching precision at high frequency sampling, and the shortest path-based HMM performs better than the HMM and shortest path algorithms when the sampling frequency is low.

•
In Figure 8b, the HMM algorithm has the smallest ATC for all four sampling intervals and the longest average time consumed by the SPM algorithm. On the other hand, when the sampling interval is greater than 30 s, the proposed algorithm and the HMM algorithm generate relatively stable ATC, while the ATC of the SPM algorithm gradually increases.
In summary, the proposed algorithm has higher matching precision than the two compared benchmark algorithms, especially when the trajectory points have a lower sampling frequency. However, its average time consumption is between the two algorithms, which is not satisfactory.

Conclusions and Discussions
In the paper, we propose a new map matching algorithm by integrating the HMM methods with the path choice preference. As the traditional shortest path-based MM algorithm for sparse trajectories, we first assume that the decision maker will choose the shortest path to travel between two candidate points. Furthermore, as with typical HMM-based algorithms, geometric topology information was applied, which includes both velocity constraints and time constraints. However, in addition to these strategies adopted above, the path selection preferences of drivers were also taken into account in this paper. This approach, which are better fit with the actual scene, can significantly increase matching efficiency without introducing excessive time consumption.
Based on real GPS trajectory data in Beijing, extensive experiments were conducted to verify the matching performance of the proposed algorithm. The experimental results show that when the sample interval is larger than 30 s, the proposed method improves the matching accuracy by 1% compared to the two benchmark algorithms. Although the algorithm can significantly improve the precision of map matching, its matching efficiency is slightly reduced, especially in the condition of sparse trajectory data. Without considering the time cost, the approach can help transportation researchers and practitioners process GPS track datasets more efficiently to match their own GIS data.
The matching efficiency of the proposed algorithm is not entirely satisfactory, and we guess that it consumes too much time using the shortest path strategy. Furthermore, there are situations where decision makers face more options when making path choices than just considering the shortest path and simple path selection preferences. Future research can explore more matching strategies to address these shortcomings and to improve the efficiency of matching and the plausibility of assumptions.