A Spatial-Temporal-Semantic Neural Network Algorithm for Location Prediction on Moving Objects

: Location prediction has attracted much attention due to its important role in many location-based services, such as food delivery, taxi-service, real-time bus system, and advertisement posting. Traditional prediction methods often cluster track points into regions and mine movement patterns within the regions. Such methods lose information of points along the road and cannot meet the demand of speciﬁc services. Moreover, traditional methods utilizing classic models may not perform well with long location sequences. In this paper, a spatial-temporal-semantic neural network algorithm (STS-LSTM) has been proposed, which includes two steps. First, the spatial-temporal-semantic feature extraction algorithm (STS) is used to convert the trajectory to location sequences with ﬁxed and discrete points in the road networks. The method can take advantage of points along the road and can transform trajectory into model-friendly sequences. Then, a long short-term memory (LSTM)-based model is constructed to make further predictions, which can better deal with long location sequences. Experimental results on two real-world datasets show that STS-LSTM has stable and higher prediction accuracy over traditional feature extraction and model building methods, and the application scenarios of the algorithm are illustrated.


Introduction
With the rapid growth of positioning technology, locations can be acquired by many devices such as mobile phones and global position system (GPS)-based equipment.Location prediction is of great significance in many location-based services.For example, while a package is being delivered, customers are eager to know where the courier is and which place he would visit next, in order to estimate the arrival time of the package.The same scenario applies to food delivery.Moreover, in public transportation systems, passengers are curious about where the nearest taxi or bus will go so as to estimate their waiting time.In real-time advertising systems, places where customers will go are important because they determine which kinds of ads to be posted.Location prediction can be determined in the following way: given a series of locations, pre-collected or real-time-dependent, location prediction techniques will infer the next location where the object is most likely to go.
Due to the continuity of space and time, trajectory is not suitable to be directly imported to a prediction model.Before using prediction models, each of the points in a trajectory is first preprocessed in order to convert the real continuous values associated to the geospatial coordinates of latitude and longitude, into discrete codes associated to specific regions.Traditional prediction methods usually start with clustering trajectories into frequent regions or stay points, or simply partition trajectories into cells.Trajectories are transformed into clusters or grids with discrete codes, then pattern mining or model building techniques are utilized to find frequent patterns along the clusters.For example, the historical trajectories of a person show that he always go to the restaurant after the gym.If the person is now in the gym, it is a distinct possibility that the next place he will visit is the restaurant.
However, traditional cluster-based or cell-based methods ignore trajectories between the clusters, which may contain critical information for specific applications.For example, trajectories in Figure 1a contain 350 track points and only two frequent regions are grouped.Traditional methods can infer region 2 from region 1, but fail to predict where the courier is between the two regions.Many location-based services such as delivery-pickup system and transportation system pay significant attention to predicting where the objects exactly are along the road.
Algorithms 2017, 10, 37 2 of 25 of latitude and longitude, into discrete codes associated to specific regions.Traditional prediction methods usually start with clustering trajectories into frequent regions or stay points, or simply partition trajectories into cells.Trajectories are transformed into clusters or grids with discrete codes, then pattern mining or model building techniques are utilized to find frequent patterns along the clusters.For example, the historical trajectories of a person show that he always go to the restaurant after the gym.If the person is now in the gym, it is a distinct possibility that the next place he will visit is the restaurant.However, traditional cluster-based or cell-based methods ignore trajectories between the clusters, which may contain critical information for specific applications.For example, trajectories in Figure 1a contain 350 track points and only two frequent regions are grouped.Traditional methods can infer region 2 from region 1, but fail to predict where the courier is between the two regions.Many location-based services such as delivery-pickup system and transportation system pay significant attention to predicting where the objects exactly are along the road.Moreover, frequent patterns or location sequences generated by clustering method are relatively short.Traditional models, such as the hidden Markov model (HMM) and the recurrent neural networks (RNN), are good at handling short sequences.However, when considering points along the road, the length of location sequences generated from raw trajectories become longer.Performance of traditional models may decline with long sequences.For example, as shown in Figure 2, the courier has two frequent patterns.He always goes to Neighborhood 1 from Restaurant 1 by the route in red, and goes to Neighborhood 2 from Restaurant 2 by the route in blue.One day he started from Restaurant 1 and traveled at the blue spot.Traditional models that deal with short sequences may suggest that the courier started from Restaurant 2, because it is not far away from the courier.Then the model will predict that the next location of the courier is Neighborhood 2 according to the historical patterns.However, the courier is actually going to Neighborhood 1.Moreover, frequent patterns or location sequences generated by clustering method are relatively short.Traditional models, such as the hidden Markov model (HMM) and the recurrent neural networks (RNN), are good at handling short sequences.However, when considering points along the road, the length of location sequences generated from raw trajectories become longer.Performance of traditional models may decline with long sequences.For example, as shown in Figure 2, the courier has two frequent patterns.He always goes to Neighborhood 1 from Restaurant 1 by the route in red, and goes to Neighborhood 2 from Restaurant 2 by the route in blue.One day he started from Restaurant 1 and traveled at the blue spot.Traditional models that deal with short sequences may suggest that the courier started from Restaurant 2, because it is not far away from the courier.Then the model will predict that the next location of the courier is Neighborhood 2 according to the historical patterns.However, the courier is actually going to Neighborhood 1.
Algorithms 2017, 10, 37 2 of 25 of latitude and longitude, into discrete codes associated to specific regions.Traditional prediction methods usually start with clustering trajectories into frequent regions or stay points, or simply partition trajectories into cells.Trajectories are transformed into clusters or grids with discrete codes, then pattern mining or model building techniques are utilized to find frequent patterns along the clusters.For example, the historical trajectories of a person show that he always go to the restaurant after the gym.If the person is now in the gym, it is a distinct possibility that the next place he will visit is the restaurant.However, traditional cluster-based or cell-based methods ignore trajectories between the clusters, which may contain critical information for specific applications.For example, trajectories in Figure 1a contain 350 track points and only two frequent regions are grouped.Traditional methods can infer region 2 from region 1, but fail to predict where the courier is between the two regions.Many location-based services such as delivery-pickup system and transportation system pay significant attention to predicting where the objects exactly are along the road.Moreover, frequent patterns or location sequences generated by clustering method are relatively short.Traditional models, such as the hidden Markov model (HMM) and the recurrent neural networks (RNN), are good at handling short sequences.However, when considering points along the road, the length of location sequences generated from raw trajectories become longer.Performance of traditional models may decline with long sequences.For example, as shown in Figure 2, the courier has two frequent patterns.He always goes to Neighborhood 1 from Restaurant 1 by the route in red, and goes to Neighborhood 2 from Restaurant 2 by the route in blue.One day he started from Restaurant 1 and traveled at the blue spot.Traditional models that deal with short sequences may suggest that the courier started from Restaurant 2, because it is not far away from the courier.Then the model will predict that the next location of the courier is Neighborhood 2 according to the historical patterns.However, the courier is actually going to Neighborhood 1.To overcome the above limitations, a spatial-temporal-semantic neural network algorithm for location prediction (STS-LSTM) is proposed in this paper.First, a spatial-temporal-semantic (STS) feature extraction algorithm is put forward to transform the whole trajectory into discrete location sequences with fixed code that are friendly to the prediction model, and will maintain points along the road; Second, a long short-term memory (LSTM)-based neural network model is proposed.The location sequences are partitioned into multiple sequences with fixed length by a sliding window.The next locations are used as labels for classification.Then, both the historical and the current trajectories are used to train the model and to make predictions.The algorithm is evaluated on two real-world datasets and compared with several classic algorithms from the aspects of both feature extraction and model building.The main novelties of the proposed algorithm and contributions of this paper are listed as follows:

•
Traditional clustering-based prediction algorithms ignore trajectories between the clusters such as points along the road.The proposed STS feature extraction algorithm transforms the trajectory into location sequences with fixed and discrete road IDs.The method can take points along the road into account, which can meet the demand of specific applications.Moreover, the generated location sequences can be better used in the prediction model, which can achieve better prediction results.

•
The location sequences generated by the STS feature extraction algorithm might be very long.
Traditional sequential models such as HMM and RNN may not perform well with long sequences.
The LSTM-based prediction model is proposed to solve the problem.The model can take advantage of location sequences over a long period of time, which can make better predictions.Evaluation results prove that both the STS feature extraction algorithm and the LSTM based prediction model outperform traditional methods.

Related Work
Studies on location prediction have gained increasing popularity.The most direct manner is to derive the next location by speed and direction according to the past locations.This may not be an accurate method, because the next location is affected by various factors such as traffic condition, weather, and the behaviors of the objects.In recent years, many location prediction methods focus on mining historical track data.The movements of individuals may show regularity.For example, in Figure 1, couriers deliver the packages from one neighborhood to another, following the same routes and visiting orders every day.Buses travel along the same routes every day.The combination of historical movements and the latest few locations of the objects can provide accurate prediction result.
In addition to trajectory data, location prediction methods based on other spatial-temporal data such as check in data and event-triggered data are also valuable as references.Related works in the area of location prediction are summarized in Table 1.

Matrix Factorization
Xiong [2] Event data Tensor Factorization Zheng [3] Trajectory Spatial Factorization Bahadori [4] Check-in data Spatial-temporal Tensor Factorization Zhuang [5] Event data Geo-location, time Monereale [6] Trajectory Using crowds to infer personal behavior The next location is highly related to the interest and intentions of an object.Recommendation algorithms are used to find the most possible place the user is interested in visiting.The matrix factorization-based method by Koren [1] factorizes a user-item rating matrix considering multiple features including the geo-location.Xiong [2] extended it as tensor factorization (TF) to be time-aware, by treating time bins as another dimension when factorizing.Zheng [3] modeled spatial information into factorization models.Moreover, Bahadori [4] included both temporal and spatial aspects in TF as two separated dimensions and make location more predictable.Zhuang [5] proposed a recommender that leverages rich context signals from mobile device such as geo-location and time.However, it is hard for factorization-based models to generate movements that have never or seldom appeared in the training data.Monereale [6] proposed a hybrid method considering both the user's own data and crowds that have similar behaviors.Recommendation-based algorithms do not consider current locations of users and the order of the movements, which may lead to a low precision of prediction.
Movement pattern mining techniques find the regularity of movements of objects and combine current movements with historical data for prediction.By transforming trajectories into cells, Jiang [7] studied trajectories of taxis and found they move in flight behaviors.Jeung [8] forecasted the future locations of a user by predefined motion functions and linear or nonlinear models, then movement patterns are extracted by an a priori algorithm.Yavas [9] utilized an improved a priori algorithm to find association rules with support and confidence.These frequent patterns reveal the co-occurrences of locations.Morzy [10] developed a modified PrefixSpan algorithm to discover both the relevance of locations and the order of location sequences.Moreover, sequential pattern methods can be improved by adding temporal information.Giannotti [11] extended travel time to location sequences and generated spatial-temporal patterns.Li [12,13] proposed two kinds of trajectory patterns: the periodic behavior pattern and the swarm pattern.Trajectories are first clustered into reference spots and a Fourier-based algorithm is utilized to detect period.Then periodic patterns are mined by hierarchical clustering.The core step of pattern-based prediction methods is to cluster frequent places.However, due to the limitation of positioning devices, track points will be lost when the satellite signal is low.For example, when buses travel into tunnels or regions covered with tall buildings, the GPS signal is blocked and no points will be collected during that time, which causes the data sparsity problem during clustering.As shown in Figure 1, the courier also frequently visited the place at the bottom-right section of the map, which is not clustered.Only two frequent areas are gathered and trajectories between the frequent regions are abandoned.Clustering-based algorithms may lose a lot of information, leading to low coverage of prediction.
After extracting the frequent regions from raw trajectories, various kinds of models can be used to make prediction.Lathia [14] proposed a time-aware neighborhood-based model paying more attention to recent locations and less to the past.Cheng [15] proposed a multi-center Gaussian model to calculate the distance between patterns.However, neighborhood-based methods do not consider the sequential factors in user's behaviors.The Markov chain (MC) model can take sequential features into consideration.Rendle [16] extended the MC with factorization of the transition matrix and calculated the probability of each behavior based on the past behaviors.Mathew [17] proposed a hybrid hidden Markov model (HMM) to transform the trajectory into clusters and train the HMM with them.Jeung [18] transferred trajectories into frequent regions by a cell partition algorithm, treating them as hidden states and observable states of the HMM.Accuracy of the prediction depends highly on the level of granularity of the cells.For example, in Figure 3, partition by large cells is more precise and the clusters are closed to the actual boundaries of the neighborhoods.However, partition by the smaller cell loses the geo-information of the neighborhoods.more attention to recent locations and less to the past.Cheng [15] proposed a multi-center Gaussian model to calculate the distance between patterns.However, neighborhood-based methods do not consider the sequential factors in user's behaviors.The Markov chain (MC) model can take sequential features into consideration.Rendle [16] extended the MC with factorization of the transition matrix and calculated the probability of each behavior based on the past behaviors.Mathew [17] proposed a hybrid hidden Markov model (HMM) to transform the trajectory into clusters and train the HMM with them.Jeung [18] transferred trajectories into frequent regions by a cell partition algorithm, treating them as hidden states and observable states of the HMM.Accuracy of the prediction depends highly on the level of granularity of the cells.For example, in Figure 3, partition by large cells is more precise and the clusters are closed to the actual boundaries of the neighborhoods.However, partition by the smaller cell loses the geo-information of the neighborhoods.Semantically-based prediction methods claim that many behaviors of users are semantically-triggered.Alvares [19] discovered stops from trajectories and map these stops to semantic landmarks.Then a sequential pattern mining algorithm is used to achieve frequent semantic patterns.Bogorny [20] utilized a hierarchical method to obtain geographic and semantic features from trajectories.Ying [21] proposed a geographic-temporal-semantic pattern mining method for prediction, which also transforms trajectory into stay points.Then trajectory patterns are clustered to build the frequent pattern tree for detecting future movements.Semantically-based methods mainly focus on mining the stay points.The trajectories along the road are abandoned.
Recently, recurrent neural networks (RNNs) have gained a breakthrough in sequence mining.Mikolov [23] developed RNNs in word embedding for sentence modeling.Multiple hidden layers in RNN can adjust dynamically with the input of behavioral history, therefore, an RNN is suitable for modeling temporal sequence.Liu [22] extended traditional RNN with geographical and temporal contexts to handle prediction problem of spatial-temporal data.An RNN performs well with short location sequences, such as clusters and stay points.However, when considering the whole trajectory, the location sequences become longer and the precision of prediction of RNN may decline.
In conclusion, in order to utilize machine learning models, the trajectory should be transformed to be model-friendly.Existing location prediction algorithms usually cluster trajectory into cells, regions, or stay points.Track points along the road are abandoned, but these points can be important to specific applications.To solve this problem, a spatial-temporal-semantic neural Semantically-based prediction methods claim that many behaviors of users are semantically-triggered.Alvares [19] discovered stops from trajectories and map these stops to semantic landmarks.Then a sequential pattern mining algorithm is used to achieve frequent semantic patterns.Bogorny [20] utilized a hierarchical method to obtain geographic and semantic features from trajectories.Ying [21] proposed a geographic-temporal-semantic pattern mining method for prediction, which also transforms trajectory into stay points.Then trajectory patterns are clustered to build the frequent pattern tree for detecting future movements.Semantically-based methods mainly focus on mining the stay points.The trajectories along the road are abandoned.
Recently, recurrent neural networks (RNNs) have gained a breakthrough in sequence mining.Mikolov [23] developed RNNs in word embedding for sentence modeling.Multiple hidden layers in RNN can adjust dynamically with the input of behavioral history, therefore, an RNN is suitable for modeling temporal sequence.Liu [22] extended traditional RNN with geographical and temporal contexts to handle prediction problem of spatial-temporal data.An RNN performs well with short location sequences, such as clusters and stay points.However, when considering the whole trajectory, the location sequences become longer and the precision of prediction of RNN may decline.
In conclusion, in order to utilize machine learning models, the trajectory should be transformed to be model-friendly.Existing location prediction algorithms usually cluster trajectory into cells, regions, or stay points.Track points along the road are abandoned, but these points can be important to specific applications.To solve this problem, a spatial-temporal-semantic neural network algorithm for location prediction is proposed in this paper.First, in order to transform the whole trajectory into location sequences friendly to the prediction model, STS feature extraction is utilized to map trajectory to the reference points and maintain as much information as possible; Second, the LSTM model is built to handle the long sequences generated before, and to make further prediction.

Methodology
The core idea of location prediction algorithms is to model regular patterns hidden in historical data and find the most possible movements based on current observations.Due to the continuity of space and time, trajectory is not suitable to be directly imported to a prediction model.As shown in Figure 4, the courier travels along the same road every day, but the trajectories are not exactly the same.Simply representing a location by continuous coordinates may lead to high computational cost and may not achieve a better prediction result.Before using prediction models, each of the points in a trajectory is first preprocessed in order to convert the real continuous values associated to the geospatial coordinates of latitude and longitude, into discrete codes associated to specific regions.Existing methods use clustering-based algorithms to transform trajectory into discrete cells, clusters and stay points.However, the density of points on the road may be relatively low compared to frequent regions, making it hard for clustering algorithm to identify.Additionally, they cannot deal with points along the road.
Algorithms 2017, 10, 37 6 of 25 Second, the LSTM model is built to handle the long sequences generated before, and to make further prediction.

Methodology
The core idea of location prediction algorithms is to model regular patterns hidden in historical data and find the most possible movements based on current observations.Due to the continuity of space and time, trajectory is not suitable to be directly imported to a prediction model.As shown in Figure 4, the courier travels along the same road every day, but the trajectories are not exactly the same.Simply representing a location by continuous coordinates may lead to high computational cost and may not achieve a better prediction result.Before using prediction models, each of the points in a trajectory is first preprocessed in order to convert the real continuous values associated to the geospatial coordinates of latitude and longitude, into discrete codes associated to specific regions.Existing methods use clustering-based algorithms to transform trajectory into discrete cells, clusters and stay points.However, the density of points on the road may be relatively low compared to frequent regions, making it hard for clustering algorithm to identify.Additionally, they cannot deal with points along the road.As shown in Equation ( 1), given an object and current time , the problem of location prediction can be formulated as estimating the probability of the next location based on the current locations ( , … , ): Then location prediction with discrete locations is like the classification problem.During the model building process, location sequence ( , … , ) is used as features and is the label.However, ( , … , ) generated by the first step might be very long and the performance of traditional models, such as HMM and RNN, will decline.In Section 3.2, a LSTM-based prediction model is proposed to handle long sequences.The flowchart of the STS-LSTM is illustrated in Figure 5.In Section 3.1, a spatial-temporal-semantic feature extraction algorithm is proposed to overcome the difficulties and to discretize trajectory into location sequences, namely, (loc 1 , loc 2 , . . .).As shown in Equation ( 1), given an object Obj i and current time t, the problem of location prediction can be formulated as estimating the probability of the next location loc t+1 based on the current locations (loc t−k , . . . ,loc t ): P(loc t+1 |obj i , (loc t−k , . . . ,loc t )) Then location prediction with discrete locations is like the classification problem.During the model building process, location sequence (loc t−k , . . . ,loc t ) is used as features and loc t+1 is the label.However, (loc t−k , . . . ,loc t ) generated by the first step might be very long and the performance of traditional models, such as HMM and RNN, will decline.In Section 3.2, a LSTM-based prediction model is proposed to handle long sequences.The flowchart of the STS-LSTM is illustrated in Figure 5.

Spatial-Temporal-Semantic Feature Extraction
A trajectory is composed of a series of track points, expressed as = | = 1,2,3, … , , where N is the number of track points.Each track point is composed of spatial information such as longitude, latitude and time stamp, expressed as = ( , , ), which is continuous in space and time.To discretize the trajectory, both spatial and temporal factors should be added into the model.The feature extraction method introduced in this section aims to transform the trajectory to fixed, discrete location sequences without losing much information.

Temporal Feature Matching
To transform continuous temporal information into discrete time, a proper time interval should be selected.Positioning devices such as the GPS module inside the mobile phone collect track points at a fixed sampling rate, usually one every second.However, due to the cost of network transmission and storage problems, points collected are not completely uploaded to the server.For example, the locations of couriers are sent to the server every five seconds and it changes to 30 s in the applications of taxi or bus.The precision of location prediction is affected by the sampling rate.If the time interval between each location is five minutes, the model will predict the next location five minutes from the current one.The time interval is determined by the demand of specific services.Generally, the time interval should be larger than the average sample rate.
Trajectories are usually stored and segmented by natural day.However, the record of tracks may be interrupted because of the GPS blocking problem.First, each trajectory is divided into segments if the time interval between two track points is over 30 min.Intervals under 30 min can be regarded as blocking, which does not affect the continuity of trajectory.After segmentation, track points are allocated to timebins.Time of a day is divided into multiple time bins by the size the time bin.For example, if the size of time bin is 15 min, there is four time bins in an hour and 144 time bins in a day, identified from 0 to 143.Next, time of each track points is mapped to a time bin by Equation (2), where is the time of . is the zero time of a day and () is the integral function: The temporal mapping process is illustrated in Figure 6.After mapping, there might be several points in the same time bin.A representative point is selected by calculating the linear center of the points with the average longitude and latitude.

Spatial-Temporal-Semantic Feature Extraction
A trajectory is composed of a series of track points, expressed as T = {p i |i = 1, 2, 3, . . ., N}, where N is the number of track points.Each track point is composed of spatial information such as longitude, latitude and time stamp, expressed as p i = (lon i , lat i , t i ), which is continuous in space and time.To discretize the trajectory, both spatial and temporal factors should be added into the model.The feature extraction method introduced in this section aims to transform the trajectory to fixed, discrete location sequences without losing much information.

Temporal Feature Matching
To transform continuous temporal information into discrete time, a proper time interval should be selected.Positioning devices such as the GPS module inside the mobile phone collect track points at a fixed sampling rate, usually one every second.However, due to the cost of network transmission and storage problems, points collected are not completely uploaded to the server.For example, the locations of couriers are sent to the server every five seconds and it changes to 30 s in the applications of taxi or bus.The precision of location prediction is affected by the sampling rate.If the time interval between each location is five minutes, the model will predict the next location five minutes from the current one.The time interval is determined by the demand of specific services.Generally, the time interval should be larger than the average sample rate.
Trajectories are usually stored and segmented by natural day.However, the record of tracks may be interrupted because of the GPS blocking problem.First, each trajectory is divided into segments if the time interval between two track points is over 30 min.Intervals under 30 min can be regarded as blocking, which does not affect the continuity of trajectory.After segmentation, track points are allocated to timebins.Time of a day is divided into multiple time bins by the size the time bin.For example, if the size of time bin is 15 min, there is four time bins in an hour and 144 time bins in a day, identified from 0 to 143.Next, time of each track points is mapped to a time bin by Equation (2), where t i is the time of p i .t zero is the zero time of a day and Round() is the integral function: The temporal mapping process is illustrated in Figure 6.After mapping, there might be several points in the same time bin.A representative point is selected by calculating the linear center of the points with the average longitude and latitude.) , where ′ is the representative point in each time bin.The sequence of representative points is sorted in ascending order of the time bin ID.After time matching, temporal information is fixed and discretized.

Geographical Feature Matching
Same as time factor, the geographical information of trajectory such as coordinate is continuous in space, which is difficult to use in a model.Trajectory should be converted to fixed reference points.Existing methods utilizing cell-based partitioning, clustering, and stop point detection cannot handle points on the road.Therefore, a new geographical feature mapping method is proposed to transform all the points in a trajectory.Objects always travel along the road in the city, therefore, the city road networks are selected as reference points.Open Street Map (OSM) is a project that creates and distributes free geographic data for the world [24].Map data from OSM is in .format.It contains all elements including points of interest, roads, and regions, as shown in Figure 7.Each road can be represented as a line = ( , ), with a start node = ( , ) and an end node = ( , ).First, for each track point , find the nearest line by sorting the distance from to all lines in the network.The searching could be time consuming when the network is very large.However, with the indexing technology used in geo-database such as PostgreSQL, the process could be executed in milliseconds.The projection point = , of on is calculated by Equation ( 3), shown as the red points in Figure 8a, where is the slope of : Trajectory T = {p 1 , p 2 , . ..|p i = (lon i , lat i , t i )} is transformed into T temporal = p 1 , p 2 , . . .p i = lon i , lat i , timebinID i , where p i is the representative point in each time bin.The sequence of representative points is sorted in ascending order of the time bin ID.After time matching, temporal information is fixed and discretized.

Geographical Feature Matching
Same as time factor, the geographical information of trajectory such as coordinate is continuous in space, which is difficult to use in a model.Trajectory should be converted to fixed reference points.Existing methods utilizing cell-based partitioning, clustering, and stop point detection cannot handle points on the road.Therefore, a new geographical feature mapping method is proposed to transform all the points in a trajectory.Objects always travel along the road in the city, therefore, the city road networks are selected as reference points.Open Street Map (OSM) is a project that creates and distributes free geographic data for the world [24].Map data from OSM is in .osmformat.It contains all elements including points of interest, roads, and regions, as shown in Figure 7. ) , where ′ is the representative point in each time bin.The sequence of representative points is sorted in ascending order of the time bin ID.After time matching, temporal information is fixed and discretized.

Geographical Feature Matching
Same as time factor, the geographical information of trajectory such as coordinate is continuous in space, which is difficult to use in a model.Trajectory should be converted to fixed reference points.Existing methods utilizing cell-based partitioning, clustering, and stop point detection cannot handle points on the road.Therefore, a new geographical feature mapping method is proposed to transform all the points in a trajectory.Objects always travel along the road in the city, therefore, the city road networks are selected as reference points.Open Street Map (OSM) is a project that creates and distributes free geographic data for the world [24].Map data from OSM is in .format.It contains all elements including points of interest, roads, and regions, as shown in Figure 7.Each road can be represented as a line = ( , ), with a start node = ( , ) and an end node = ( , ).First, for each track point , find the nearest line by sorting the distance from to all lines in the network.The searching could be time consuming when the network is very large.However, with the indexing technology used in geo-database such as PostgreSQL, the process could be executed in milliseconds.The projection point = , of on is calculated by Equation ( 3), shown as the red points in Figure 8a, where is the slope of : Each road can be represented as a line l i = (n s , n e ), with a start node n s = (lon s , lat s ) and an end node n e = (lon e , lat e ).First, for each track point p i , find the nearest line l near−i by sorting the distance from p i to all lines in the network.The searching could be time consuming when the network is very large.However, with the indexing technology used in geo-database such as PostgreSQL, the process could be executed in milliseconds.The projection point p j = lon j , lat j of p i on l neari is calculated by Equation (3), shown as the red points in Figure 8a, where k is the slope of l neari : The projection point could be anywhere on the line, which is not discrete.Several fixed points on the line are set by the distance parameter Threshold d .According to the precision required by different applications, Threshold d could be 5 m, 10 m, or 15 m, smaller than the length of the road.The length of road l i is calculated by Equation ( 4), where ϕ 1 is the radian of lat s , ϕ 2 is the radian of lat e , ∆ϕ is the radian of lon s − lon e , and ∆φ is the radian of lat s − lat e .R is the radius of the earth, which depends on the mapping implementation, and a good choice for the radius is the mean earth radius, R ≈ 6371 km (for the WGS84 ellipsoid).Then, the line is divided into k = Ceil (L/Threshold d ) segments.The fixed points in line l i can be calculated by n s , n e , and k, as shown in Figure 8b.Finally, each projection point is mapped to the nearest fixed points, so do the original track points, as shown in Figure 8c.
After the mapping of temporal and geographic features, the original trajectory is transformed into T ST = {l i |l i = (n s1 , n s2 , . . . ,n e )}, represented by a series of lines l i in the road network, and each line consists of several fixed points n si .
Algorithms 2017, 10, 37 9 of 25 ( ) The projection point could be anywhere on the line, which is not discrete.Several fixed points on the line are set by the distance parameter ℎ ℎ .According to the precision required by different applications, ℎ ℎ could be 5 m, 10 m, or 15 m, smaller than the length of the road.The length of road is calculated by Equation ( 4), where is the radian of , is the radian of , ∆ is the radian of − , and ∆ is the radian of − . is the radius of the earth, which depends on the mapping implementation, and a good choice for the radius is the mean earth radius, ≈ 6371 km (for the WGS84 ellipsoid).Then, the line is divided into = ( / ℎ ℎ ) segments.The fixed points in line can be calculated by , , and , as shown in Figure 8b.Finally, each projection point is mapped to the nearest fixed points, so do the original track points, as shown in Figure 8c.9.A trajectory from a courier in Beijing, China is used to illustrate the process.The result of STS feature extraction is similar to the result of map matching algorithms [25,26].However, map matching only focuses on projecting the track points to the nearest road and the STS transfers trajectory into location sequence that prediction models can recognize.

Semantic Feature Matching
Fixed points in the line should be represented in a way that the model can recognize.The OSM map data stores the semantic tags of each road in the network, as shown in Table 2.The unique ID of the road could be used to represent the line.Semantic tags of the fixed points are assigned by the ID of the road, tag o f n si = roadID i .After the spatial-temporal-semantic feature extraction, the trajectory is transformed into a location sequence represented by the road ID, T STS = {loc i |loc i = (roadID i , timebin id _i)}.A running example is shown in Figure 9.A trajectory from a courier in Beijing, China is used to illustrate the process.The result of STS feature extraction is similar to the result of map matching algorithms [25,26].However, map matching only focuses on projecting the track points to the nearest road and the STS transfers trajectory into location sequence that prediction models can recognize.

Model Description
The recurrent neural networks can handle sequential data and have become a hot spot in the fields of machine learning [22,27].They are networks with loops of several hidden layers that can change monotonously along with the position in a sequence, as shown in Figure 10a.It enables RNN to learn sequential data and Liu [23] has utilized RNN in location prediction with check-in data.Figure 10 shows the structure of RNN.
is the vector of the value of the input layer.represents the value of the hidden layer and is the value of the output layer., are the weight matrix between and , and .
is the weight matrix between the value of the hidden layer at time − 1 and time .The RNN can be expressed as Equations ( 5) and ( 6), where and are the activation functions.The output of the RNN, namely , is affected by the input , , . . ., .This is the reason why the RNN can take advantage of historical sequential data.Moreover, parameters in the RNN can be further learned with the back propagation through time (BPTT) algorithm [28].The error term during the weight gradient computing has the property shown in Equation ( 7), where − is the time interval between the current time and historical time.Other parameters is introduced in [29].When − is large (locations that are long time ago), the value of will grows or shrinks very quickly (depending on greater or less than 1).This may cause the learning problems described in [29].

Model Description
The recurrent neural networks can handle sequential data and have become a hot spot in the fields of machine learning [22,27].They are networks with loops of several hidden layers that can change monotonously along with the position in a sequence, as shown in Figure 10a.It enables RNN to learn sequential data and Liu [23] has utilized RNN in location prediction with check-in data.Figure 10 shows the structure of RNN.x is the vector of the value of the input layer.s represents the value of the hidden layer and o is the value of the output layer.U, V are the weight matrix between x and s, s and o.W is the weight matrix between the value of the hidden layer at time t − 1 and time t.

Model Description
The recurrent neural networks can handle sequential data and have become a hot spot in the fields of machine learning [22,27].They are networks with loops of several hidden layers that can change monotonously along with the position in a sequence, as shown in Figure 10a.It enables RNN to learn sequential data and Liu [23] has utilized RNN in location prediction with check-in data.Figure 10 shows the structure of RNN.
is the vector of the value of the input layer.represents the value of the hidden layer and is the value of the output layer., are the weight matrix between and , and .
is the weight matrix between the value of the hidden layer at time − 1 and time .The RNN can be expressed as Equations ( 5) and ( 6), where and are the activation functions.The output of the RNN, namely , is affected by the input , , . . ., .This is the reason why the RNN can take advantage of historical sequential data.Moreover, parameters in the RNN can be further learned with the back propagation through time (BPTT) algorithm [28].The error term during the weight gradient computing has the property shown in Equation ( 7), where − is the time interval between the current time and historical time.Other parameters is introduced in [29].When − is large (locations that are long time ago), the value of will grows or shrinks very quickly (depending on greater or less than 1).This may cause the learning problems described in [29].
( ) The RNN can be expressed as Equations ( 5) and ( 6), where g and f are the activation functions.The output of the RNN, namely o t , is affected by the input x t , x t−1 , . . ., x 1 .This is the reason why the RNN can take advantage of historical sequential data.Moreover, parameters in the RNN can be further learned with the back propagation through time (BPTT) algorithm [28].The error term δ T k during the weight gradient computing has the property shown in Equation (7), where t − k is the time interval between the current time and historical time.Other parameters is introduced in [29].When t − k is large (locations that are long time ago), the value of δ T k will grows or shrinks very quickly (depending on β greater or less than 1).This may cause the learning problems described in [29].
However, location sequences generated by the STS feature extraction could be very long.The RNN may not perform with long sequences.Long short-term memory networks (LSTMs) are a special kind of RNN, capable of learning information for long periods of time [30].Instead of the simple tan h hidden layer in RNN, LSTM has a more complicated repeating module, as shown in Figure 11.The LSTM uses gates to control the cell states passed from long time ago.The gate is a full connection layer expressed as g(x) = σ(Wx + b), where W is the weight vector and b is the bias.σ is the sigmoid function so that the output of the gate is 0 to 1.
( ) However, location sequences generated by the STS feature extraction could be very long.The RNN may not perform with long sequences.Long short-term memory networks (LSTMs) are a special kind of RNN, capable of learning information for long periods of time [30].Instead of the simple ℎ hidden layer in RNN, LSTM has a more complicated repeating module, as shown in Figure 11.The LSTM uses gates to control the cell states passed from long time ago.The gate is a full connection layer expressed as ( ) = ( + ), where is the weight vector and is the bias. is the sigmoid function so that the output of the gate is 0 to 1. LSTM utilizes two gates to control the cell state.The forget gate decides how much information of the last cell state will keep to the current time .The other is the input gate, which decides how much information of the input of the current networks will keep to cell state .LSTM uses the output gate to control how much information of the cell state will output to .The expression of the forget gate is shown in Equation (8), where is the weight matrix and is the bias of the forget gate.Similar calculations can be used in the input gate and output gate, as shown in Equations ( 9) and ( 12).The cell state of the current input is calculated by the last output and the current input, as shown in Equation (10).As shown in Figure 11, the current cell state is calculated by Equation ( 11).The output of LSTM , as shown in Equation ( 13), is decided by and .The training and parameter derivation process can be found in [30].In conclusion, the conveyor belt-like structure allows LSTM to remove or add information from the very beginning to the current state.Therefore, LSTM is expected to perform well with long sequences.
[ ] ( ) [ ] ( ) LSTM utilizes two gates to control the cell state.The forget gate decides how much information of the last cell state c t−1 will keep to the current time c t .The other is the input gate, which decides how much information of the input of the current networks x t will keep to cell state c t .LSTM uses the output gate to control how much information of the cell state c t will output to h t .The expression of the forget gate is shown in Equation ( 8), where W f is the weight matrix and b f is the bias of the forget gate.Similar calculations can be used in the input gate and output gate, as shown in Equations ( 9) and ( 12).The cell state of the current input c t is calculated by the last output and the current input, as shown in Equation (10).As shown in Figure 11, the current cell state c t is calculated by Equation ( 11).The output of LSTM h t , as shown in Equation ( 13), is decided by o t and c t .The training and parameter derivation process can be found in [30].In conclusion, the conveyor belt-like structure allows LSTM to remove or add information from the very beginning to the current state.Therefore, LSTM is expected to perform well with long sequences.

Model Training and Predicting
The training process.After the STS feature extraction, trajectories are transformed into fixed and discrete sequences with road IDs, expressed as LocSeq = {roadID i |i = 1, 2, 3, . . .}.However, the sequences cannot be directly put into the LSTM model.Each sequence should be divided into segments with fixed length.The length of each segment is determined by specific applications.For example, in the bus system, traces of the past 15 min are used to predict the next location.There are 30 locations collected in 15 min so the length is 30.The length of location segment is defined as maxLength.Each location sequence is scanned by a sliding window with fixed width equals to maxLength.The window moves forward by one location until it reaches the end of each sequence.Locations in the window are gathered as training features.The next location outside the window is used as label.The process is called feature engineering and is illustrated in Figure 12.Then, each location sequence with features and a label is used to train the LSTM model.

Model Training and Predicting
The training process.After the STS feature extraction, trajectories are transformed into fixed and discrete sequences with road IDs, expressed as = | = 1, 2, 3, … .However, the sequences cannot be directly put into the LSTM model.Each sequence should be divided into segments with fixed length.The length of each segment is determined by specific applications.For example, in the bus system, traces of the past 15 min are used to predict the next location.There are 30 locations collected in 15 min so the length is 30.The length of location segment is defined as ℎ.Each location sequence is scanned by a sliding window with fixed width equals to ℎ.The window moves forward by one location until it reaches the end of each sequence.
Locations in the window are gathered as training features.The next location outside the window is used as label.The process is called feature engineering and is illustrated in Figure 12.Then, each location sequence with features and a label is used to train the LSTM model.The predicting process.When a new track point is collected, a prediction is not made until all points in a time bin are gathered.Then a representative point is selected from each time bin by the time matching process.Each representative point is mapped to the nearest road in the city road networks.The mapping point is then classified to the nearest fixed road point along the road.The fixed road points are predefined with the distance threshold, which depends on different applications.Each fixed road point has a unique ID.Thus, the new trajectories is transformed into a location sequence represented by road IDs.Then, those IDs are used as the import of the LSTM model for prediction.The predicting process is shown in Figure 13.The model will be re-trained each day during the free time with the trajectories collected in that day to learn the new movement patterns.The predicting process.When a new track point is collected, a prediction is not made until all points in a time bin are gathered.Then a representative point is selected from each time bin by the time matching process.Each representative point is mapped to the nearest road in the city road networks.The mapping point is then classified to the nearest fixed road point along the road.The fixed road points are predefined with the distance threshold, which depends on different applications.Each fixed road point has a unique ID.Thus, the new trajectories is transformed into a location sequence represented by road IDs.Then, those IDs are used as the import of the LSTM model for prediction.The predicting process is shown in Figure 13.The model will be re-trained each day during the free time with the trajectories collected in that day to learn the new movement patterns.The whole training and predicting processes are shown in Figure 14.

Experiments
This section first describes two datasets prepared for evaluation and several evaluation metrics, then introduces several compared algorithms, including a feature selection method and three models.Finally, the experiments are conducted and the results are showed.

Datasets
Two real-world trajectory datasets are used to evaluate the proposed method.The courier dataset is a self-collected data from a courier of a food delivery platform in Beijing, China.By installing an application that can collect track points on the courier's mobile phone, tracks of the courier on 66 working days from 10 November 2016 to 19 January 2017 are recorded.One of the major concerns of the food delivery system is where the courier will go so as to estimate the travel time to the customer.The courier is in charge of a certain area, so there exists a regular route.As shown in Figure 15a, the tracks in the red rectangle have high density, which means the courier always travel along this routes.Figure 15b shows the trajectories of the courier in the red rectangle in three working days.The whole training and predicting processes are shown in Figure 14.

Experiments
This section first describes two datasets prepared for evaluation and several evaluation metrics, then introduces several compared algorithms, including a feature selection method and three models.Finally, the experiments are conducted and the results are showed.

Datasets
Two real-world trajectory datasets are used to evaluate the proposed method.The courier dataset is a self-collected data from a courier of a food delivery platform in Beijing, China.By installing an application that can collect track points on the courier's mobile phone, tracks of the courier on 66 working days from 10 November 2016 to 19 January 2017 are recorded.One of the major concerns of the food delivery system is where the courier will go so as to estimate the travel time to the customer.The courier is in charge of a certain area, so there exists a regular route.As shown in Figure 15a, the tracks in the red rectangle have high density, which means the courier always travel along this routes.Figure 15b shows the trajectories of the courier in the red rectangle in three working days.

Experiments
This section first describes two datasets prepared for evaluation and several evaluation metrics, then introduces several compared algorithms, including a feature selection method and three models.Finally, the experiments are conducted and the results are showed.

Datasets
Two real-world trajectory datasets are used to evaluate the proposed method.The courier dataset is a self-collected data from a courier of a food delivery platform in Beijing, China.By installing an application that can collect track points on the courier's mobile phone, tracks of the courier on 66 working days from 10 November 2016 to 19 January 2017 are recorded.One of the major concerns of the food delivery system is where the courier will go so as to estimate the travel time to the customer.The courier is in charge of a certain area, so there exists a regular route.As shown in Figure 15a, the tracks in the red rectangle have high density, which means the courier always travel along this routes.Figure 15b shows the trajectories of the courier in the red rectangle in three working days.The MTA Bus dataset is provided by the Metropolitan Transportation Authority (MTA) and is available online [31].The dataset contains trajectories of 5725 buses from 315 bus routes across the New York region, from 1 August to 31 October 2014.The 90-days data contains a very large amount of track points in total, and the average number of points each day is nearly 4.2 million.Trajectories of a bus from route BX36 are selected.As shown in Figure 16a, the red regions in the heat map represent the frequent routes of the bus.There are several frequent routes over a long period of time because the bus may change the route due to rescheduling [29].However, movements of the bus in each route over a short period of time appear regularly.The trajectories in the blue rectangle in the middle of Figure 16a   The MTA Bus dataset is provided by the Metropolitan Transportation Authority (MTA) and is available online [31].The dataset contains trajectories of 5725 buses from 315 bus routes across the New York region, from 1 August to 31 October 2014.The 90-days data contains a very large amount of track points in total, and the average number of points each day is nearly 4.2 million.Trajectories of a bus from route BX36 are selected.As shown in Figure 16a, the red regions in the heat map represent the frequent routes of the bus.There are several frequent routes over a long period of time because the bus may change the route due to rescheduling [29].However, movements of the bus in each route over a short period of time appear regularly.The trajectories in the blue rectangle in the middle of Figure 16a are frequent, as shown in Figure 16b.Trajectories at the top of Figure 16a are frequent, as shown in Figure 16c.The MTA Bus dataset is provided by the Metropolitan Transportation Authority (MTA) and is available online [31].The dataset contains trajectories of 5725 buses from 315 bus routes across the New York region, from 1 August to 31 October 2014.The 90-days data contains a very large amount of track points in total, and the average number of points each day is nearly 4.2 million.Trajectories of a bus from route BX36 are selected.As shown in Figure 16a, the red regions in the heat map represent the frequent routes of the bus.There are several frequent routes over a long period of time because the bus may change the route due to rescheduling [29].However, movements of the bus in each route over a short period of time appear regularly.The trajectories in the blue rectangle in the middle of Figure 16a   Table 3 summarizes the characteristics of the two representative datasets.The courier and the MTA bus dataset contain 198 and 149 trajectories respectively, and the time spans are 66 and 90 days.The courier dataset contains 1746 points per day on average and the sampling rate is 0.5 Hz.The sampling rate of the bus dataset is 0.03 Hz, which leads to a smaller number of points per day.The average daily travel distance of the bus is 58 km, far more than the courier.The bus has longer routes than the courier, so the average distance between two points is 23 times more than the courier.Parameters, such as time bin size and distance thresholds, should be set according to the characteristics of the datasets.Trajectories are consist of track points that are continuous in space and time.Before to using prediction models, each of the should be transformed into discrete codes associated to specific regions.This is done so that the learning of models can be done more efficiently and the prediction result can be more precisely.Traditional cell-based methods such as the Geohash convert points into grids.The proposed STS feature extraction utilizes a more accurate way to map each track point to the fixed point on the nearest road.Both methods will cause approximation errors between the original points and the representative points.The average sphere distance between the original points and the representative points is called the deviation distance, as shown in Figure 17.The deviation distance of Geohash is the average distance between each track point and the central point of the grid.The deviation distance of the STS is the average distance between each track point and the fixed point on the nearest road.The deviation distance is used to evaluate the performance of the STS feature extraction method.
Algorithms 2017, 10, 37 15 of 25 Table 3 summarizes the characteristics of the two representative datasets.The courier and the MTA bus dataset contain 198 and 149 trajectories respectively, and the time spans are 66 and 90 days.The courier dataset contains 1746 points per day on average and the sampling rate is 0.5 Hz.The sampling rate of the bus dataset is 0.03 Hz, which leads to a smaller number of points per day.The average daily travel distance of the bus is 58 km, far more than the courier.The bus has longer routes than the courier, so the average distance between two points is 23 times more than the courier.Parameters, such as time bin size and distance thresholds, should be set according to the characteristics of the datasets.Trajectories are consist of track points that are continuous in space and time.Before to using prediction models, each of the points should be transformed into discrete codes associated to specific regions.This is done so that the learning of models can be done more efficiently and the prediction result can be more precisely.Traditional cell-based methods such as the Geohash convert points into grids.The proposed STS feature extraction utilizes a more accurate way to map each track point to the fixed point on the nearest road.Both methods will cause approximation errors between the original points and the representative points.The average sphere distance between the original points and the representative points is called the deviation distance, as shown in Figure 17.The deviation distance of Geohash is the average distance between each track point and the central point of the grid.The deviation distance of the STS is the average distance between each track point and the fixed point on the nearest road.The deviation distance is used to evaluate the performance of the STS feature extraction method.In order to measure the performances of different methods, several evaluation metrics are used, including precision, recall, and F-score.The problem of location prediction can be defined as the classification problem.The prediction model predicts the location ID with the highest probability among all location IDs that the object may visit.For classification problems, the samples can be divided into four categories, namely true positives, true negatives, false positives, and false negatives, which are shown in Table 4.In order to measure the performances of different methods, several evaluation metrics are used, including precision, recall, and F-score.The problem of location prediction can be defined as the classification problem.The prediction model predicts the location ID with the highest probability among all location IDs that the object may visit.For classification problems, the samples can be divided into four categories, namely true positives, true negatives, false positives, and false negatives, which are shown in Table 4.The precision is the fraction of retrieved documents that are relevant to the query.The recall is the fraction of the documents that are relevant to the query that are successfully retrieved, as shown in Equations ( 14) and (15).For example, if the historical movements of the object shows that there are total of five locations (ID 1 to 5).Four location sequences are going to be predicted and the truth of the next location is (1,1,3,5).The result given by the model is (1, 2, 1, 5).For location 1, TP = 1, FP = 1, TN = 1, FN = 1.Thus, the precision and recall are 0.5, respectively.Usually, the average precision and recall for all locations are calculated.Moreover, in order to measure the overall performance of the model, F-score is considered, which contains both the precision and the recall, as shown in Equation (16).

Selection of Compared Algorithms
Feature extraction methods.Partitioning trajectories into self-defined cells or grids is commonly used [8,17,18,21].However, cell-based methods are easily affected by the granularity of the grid and it is difficult to draw a grid in the global area.Geohash is a geocoding system invented by Gustavo Niemeyer [32].It is a hierarchical spatial data structure that subdivides space into buckets of grid shape and is used as indexing algorithm by many geospatial databases, such as Elasticsearch and MongoDB.For example, a part of trajectory ((−25.383,−49,266), (−25.427,−49.315)) can be transferred to (6gkzwgjz, 6gkzmg1w) by Geohash.Similar to cell-based methods, Geohash extracts the geographic feature of trajectory but may suffer from granularity problems.
Prediction Models.After feature selection, the location sequences are put into different models.Among all of the models described in Table 1, supervised learning models are suitable to fit the location sequences generated by the feature extraction method.The current locations are set to be features and the next location is the label.The location prediction can be regarded as classification problem.The hidden Markov model (HMM) is widely used in location prediction [16][17][18] and gain better prediction results.Moreover, XGBoost [33] is a popular classification and regression model.In many machine learning problems and data science challenges, XGBoost always surpasses other traditional classification models.Recently, the recurrent neural network (RNN) has been employed for mining location sequences [23] and provided a promising future.Therefore, HMM, XGBoost, and RNN are used to compare the proposed method.Brief introductions of HMM and XGBoost are provided below.Details of RNN has been introduced in Section 3.2.1.
The HMM assumes the sequences are generated by a Markov process with unobserved states [34].Each state has a probability distribution over the possible locations to be visited.It also has a probability distribution over the possible transitions of the states that will be predicted.The probability distribution of the hidden variable at time t depends only on the value of the hidden variable at time t -1.This is called the Markov property.The prediction task is to compute the probability of a particular output sequence being observed.The probability of observing a particular sequence in the form Y = < y(1), y(2), . . ., y(L) >, where y(t) is the location visited at time t, is given by In Equation ( 17), the sum runs over all possible hidden-node sequences X = < x(1), x(2), . . ., x(L) >, where x(t) is the hidden state at time at time t.This problem can be solved by the Baum-Welch algorithm.For more information about the derivation of the parameters in HMM, please refer to the tutorial by Rabiner [34].
XGBoost [33] is an ensemble model based on a gradient boosting decision tree (GBDT) [35] and is designed to be highly efficient, flexible, and portable.XGBoost is used for supervised learning problems, where training data with multiple features x i (current location sequence) are used to predict a target variable y i (the next location).For tree ensemble learning, the model is defined as Equation ( 18), where K is the number of trees, f is a function in the functional space F, and F is the set of all possible CARTs.The objective function can be written as Equation (19), where the first part is the training loss and the second part is the complexity of the tree, functioning as the regularization.The training method and the parameter derivation process can be found in [33].5 summarizes the experimental settings.Both the courier dataset and the MTA bus dataset are split into a training set (consisting of 90% of the data) and a testing set (10% of the data), respectively.The courier dataset contains 60 days' data for training, and six days for testing.The MTA bus dataset contains 81 days' data for training, and nine days for testing.The amount of testing samples meet the demand of real-life applications.

Evaluation Based on Internal Comparison
In the STS feature extraction, the size of time bin used to allocate track points into time bins and a representative point is selected from each time bin.Then each representative point is mapped to a fixed point on the road by the distance threshold.As mentioned before, these two parameters are determined by the demand of applications.If the trajectories are sampled at a higher rate, smaller time bin sizes and distance thresholds could be selected to achieve more accurate prediction.The sampling time interval of the courier dataset is 2 s, and 30 s of the MTA bus dataset.Therefore, in the courier dataset, the time bin size is set to 4 s, 8 s, and 16 s separately, and the distance threshold is set to 5 m, 10 m, and 15 m.In the bus dataset, time bin size is set to 30 s, 60 s, and 90 s.The average distance of two sampled points is 82.2 m, so the distance threshold of the bus is set to 50 m, 100 m, and 150 m.Then features are imported to the same LSTM model, with fixed maxLength = 40.Different prediction precisions are recorded in Table 6 (the courier dataset) and Table 7 (the MTA bus dataset).

Evaluation Based on External Comparison
In this section, the STS feature extraction method is compared with Geohash, and the LSTM-based model is compared separately with several classic models.First, the Geohash partitioning algorithm is implemented to transform trajectory into cells.The optimal parameter of the Geohash is set to 8, indicating that the side length of a cell is 19 m.The size of time bin and the distance threshold in the courier dataset are set to 16 s and 10 m, which become 30 s and 50 m in the MTA bus dataset.maxLength of the location sequence is set to 20 and other parameters of the model are fixed during the comparison.The result is shown in Table 8. the cost used in the forward part of the forward-backward algorithm is set to 0.1, learningRate = 0.01, maxIteration = 100.The XGBoost method is implemented in Python by the "XGBoost" package.The tuning parameters in XGBoost are numRound, eta, maxDepth, scalePosWeight, minChildWeight, etc. Five-fold cross-validation is used to tune the parameters.Parameter tuning method can be found in [36] in detail.Optimal parameters are set as follows: numRound = 500, eta = 0.05, maxDepth = 6, scalePosWeight = 5, minChildWeight = 1.
RNN and LSTM are implemented in Python using the "Keras" package.As introduced in Section 3.2, RNN and LSTM have similar ideas of the hidden loop layer but LSTM is more complicated.There exists similarities in the parameter setting of RNN and LSTM.They both share the activation function as tanh and learning rate = 0.05.To avoid over-fitting, the dropout of each layer is set to 0.2.The categorical cross-entropy is selected as the loss function and the RMSprop algorithm is utilized to update the weights.Batch size defines the number of samples that are going to be propagated through the network and is set to 100.epochs = 20 is used to train the data.The experiments were executed on a MacOS Sierra platform with a 2.8 GHz i7 CPU and 4 GB RAM.
Evaluation based on performance.The size of time bin and the distance threshold in the courier dataset are set to 16 s and 10 m, which become 30 s and 50 m in the MTA bus dataset.The same location sequences from the STS feature extraction with maxLength = 20 are used as the input.After tuning the optimal parameters for each model, training data is imported and the results of the prediction of different prediction models are listed in Table 9. Evaluation based on the length of each sequence.The main difference between RNN and LSTM is the ability to learn long sequences.In the follow experiment, all other parameters and training data are kept the same, except for the maxLength.maxLength is set from 10 to 50 and the precisions of RNN and LSTM on two datasets are shown in Figure 18.The horizontal axis represents the length of each sequence and the vertical axis is the precision of prediction.Furthermore, to evaluate the performance of the LSTM model, several classic models are used for comparison.The HMM model is implemented in Python.Optimal parameters are set as follows: the cost used in the forward part of the forward-backward algorithm is set to 0.1, = 0.01, = 100.The XGBoost method is implemented in Python by the "XGBoost" package.The tuning parameters in XGBoost are numRound, eta, maxDepth, scalePosWeight, minChildWeight, etc. Five-fold cross-validation is used to tune the parameters.Parameter tuning method can be found in [36] in detail.Optimal parameters are set as follows: = 500, = 0.05, ℎ = 6, ℎ = 5, ℎ ℎ = 1.
RNN and LSTM are implemented in Python using the "Keras" package.As introduced in Section 3.2, RNN and LSTM have similar ideas of the hidden loop layer but LSTM is more complicated.There exists similarities in the parameter setting of RNN and LSTM.They both share the activation function as ℎ and = 0.05.To avoid over-fitting, the dropout of each layer is set to 0.2.The categorical cross-entropy is selected as the loss function and the RMSprop algorithm is utilized to update the weights.ℎ defines the number of samples that are going to be propagated through the network and is set to 100.ℎ = 20 is used to train the data.
The experiments were executed on a MacOS Sierra platform with a 2.8 GHz i7 CPU and 4 GB RAM.
Evaluation based on performance.The size of time bin and the distance threshold in the courier dataset are set to 16 s and 10 m, which become 30 s and 50 m in the MTA bus dataset.The same location sequences from the STS feature extraction with ℎ = 20 are used as the input.After tuning the optimal parameters for each model, training data is imported and the results of the prediction of different prediction models are listed in Table 9.   Evaluation based on the size of datasets.Experiments are executed to evaluate the influence of data size on the performance of the LSTM-based model.Two datasets are used and the training data are sampled randomly by 20%, 40%, 60%, 80%, and 100% of the original size.Testing data remains the same.The number of samples are listed in Column 1, and the results are shown in Table 10.

Evaluation Based on Visualization
The proposed algorithm is designed to solve practical problems in applications such as food delivery or bus system.One basic need of these services is to visualize the result of the predicted location on the map.In this section, the result of the proposed algorithm drawn by the QGIS software, cooperated with the map data stored in the PostgreSQL.A trajectory from the courier dataset that has never been used in the training is used as test data, showed as the blue line in Figure 15a.During the STS feature extraction, time bin size is set to 2 s, and the distance threshold is the max length of each road for better visualization.Points mapped to the same road will be represented by one road ID and the location is visualized by the road segment.The location sequence after the feature extraction is (391182785,391187049,162362370,163825162,121991160,121657041,122100440,177353870,22799950,177353867,22799950,177353867), represented by the orange lines with road id in Figure 15a.For better and concise illustration, the location sequences of the fixed road IDs are replaced as (ID0, ID1, ID2, ID3, ID4, ID5, ID6, ID7, ID8, ID9, ID8, ID9).
The maxLength of the LSTM model is set to 5. Suppose that the current and latest four locations are collected and used as test data, as shown in Figure 15b.The rest of the sequence is used as the ground truth.The predicted location of (ID0, ID1, ID2, ID3, ID4) given by the LSTM model is ID5.The distance between the actual point and the prediction is calculated by the deviation distance, which is 11.6 meters.Then ID5 is added to the location sequences and a new sequence (ID1, ID2, ID3, ID4, ID5) is imported to the model.The output of the model is 11.5 m away from the actual point.The process is repeated and the results are listed in Table 11.All seven locations in the ground truth are successfully predicted because the deviation distance is relatively small.The seven predicted locations are drawn step by step with different colors in Figure 19c.The LSTM model will continuously predict even though the raw trajectory has ended.The next five locations are shown in Figure 19d.6 and 7.There exists an overall trend that the precision, recall, and F-score of the courier datasets are higher than the bus datasets.From the comparison between the two datasets, as shown in Table 3, the courier data is more precise because the average sampling rate and the average distance between two track   6 and 7.There exists an overall trend that the precision, recall, and F-score of the courier datasets are higher than the bus datasets.From the comparison between the two datasets, as shown in Table 3, the courier data is more precise because the average sampling rate and the average distance between two track points are much less than the bus data.Furthermore, predictions with smaller time bin size are better.With smaller time bin, more locations are extracted from raw trajectories, so the model will learn more and offer better location where the courier is right now.Moreover, the result of prediction can be used to estimate the arrival time.For example, location sequences of bus 296 can be generated based on its current location.The time interval of each location is 30 s, and the arrival time to the next stop can be calculated, as shown in Figure 20b.

Conclusions
In this paper, a spatial-temporal-semantic neural network algorithm for location prediction is proposed.The algorithm consists of two steps, the spatial-temporal-semantic (STS) feature extraction and the LSTM-based model building.Aiming at solving the problem that traditional methods based on cell partitioning or clustering ignore a large amount of information about points along the road, the STS feature extraction algorithm is proposed.The method transforms the trajectory into location sequences by mapping track points to fixed and discrete points in the road networks.The STS can take advantage of points along the road, which meets the demand of services like delivering systems and transportation systems.After feature extraction, trajectories are transformed into long location sequences.Traditional models cannot perform well with long sequential data; therefore, an LSTM-based model is proposed to overcome the difficulty.The model can better deal with long location sequences to gain better prediction results.Experiments on two real-world datasets show that the proposed STS feature extraction algorithm has lower deviation distance.The location sequences generated by the STS can improve the prediction precision by 27%.Moreover, the proposed LSTM-based model can improve the precision of 25% compared with traditional classification models.The STS-LSTM algorithm has stable and better performance on both datasets.
There are several potential extensions of this paper.Aiming at solving practical problems, the algorithm should be robust to trajectory of poor quality.More data preprocessing works, such as data compression, noise filtering, and data filling could be done in the future.Moreover, the LSTM-based model could be further modified to accept more features and dimensions, such as temporal features like month, week, and day, and semantic features, like stay points.Moreover, LSTM could be combined and intercepted into other classification models, or replaced by a similar but simpler structure called the gated recurrent unit (GRU) [37].

Figure 1 .
Figure 1.Trajectories of a courier during continuous working day from Hangzhou, China, which appeal regularity.(a) Track on 12 August 2016; (b) Track on 13 August 2016; (c) Track on 14 August 2016.

Figure 2 .
Figure 2. The frequent patterns of the courier.

Figure 1 .
Figure 1.Trajectories of a courier during continuous working day from Hangzhou, China, which appeal regularity.(a) Track on 12 August 2016; (b) Track on 13 August 2016; (c) Track on 14 August 2016.

Figure 1 .
Figure 1.Trajectories of a courier during continuous working day from Hangzhou, China, which appeal regularity.(a) Track on 12 August 2016; (b) Track on 13 August 2016; (c) Track on 14 August 2016.

Figure 2 .
Figure 2. The frequent patterns of the courier.Figure 2. The frequent patterns of the courier.

Figure 2 .
Figure 2. The frequent patterns of the courier.Figure 2. The frequent patterns of the courier.

Figure 3 .
Figure 3.The accuracy of the prediction is influenced by the granularity of the cell partition.Track points from a courier in Hangzhou, China.(a) Partitioning into large cells and smaller cells; (b) The actual boundaries of the neighborhoods.

Figure 3 .
Figure 3.The accuracy of the prediction is influenced by the granularity of the cell partition.Track points from a courier in Hangzhou, China.(a) Partitioning into large cells and smaller cells; (b) The actual boundaries of the neighborhoods.

Figure 4 .
Figure 4.The necessity of transforming trajectories into discrete and fixed locations.

Figure 4 .
Figure 4.The necessity of transforming trajectories into discrete and fixed locations.

Figure 7 .
Figure 7.The road network data of Beijing, used as reference points in the geo-feature mapping.

Figure 6 .
Figure 6.Process of the mapping of time bins.

Figure 7 .
Figure 7.The road network data of Beijing, used as reference points in the geo-feature mapping.

Figure 7 .
Figure 7.The road network data of Beijing, used as reference points in the geo-feature mapping.

Figure 8 .
Figure 8. Process of the geographic feature mapping.(a) Projection of each track point to the nearest line; (b) calculated fixed points on the line by ℎ ℎ ; (c) mapping track points to the nearest fixed points.

Figure 8 .
Figure 8. Process of the geographic feature mapping.(a) Projection of each track point to the nearest line; (b) calculated fixed points on the line by Threshold d ; (c) mapping track points to the nearest fixed points.

Figure 9 .
Figure 9.A running example of the spatial-temporal-semantic feature extraction.(a) The blue dots are the original trajectory.The red dots is the track points after temporal matching.(b) The red dots are the location sequences after the geographical and semantic feature matching.

Figure 10 .
Figure 10.The structure of the RNN.

Figure 9 .
Figure 9.A running example of the spatial-temporal-semantic feature extraction.(a) The blue dots are the original trajectory.The red dots is the track points after temporal matching.(b) The red dots are the location sequences after the geographical and semantic feature matching.

Figure 9 .
Figure 9.A running example of the spatial-temporal-semantic feature extraction.(a) The blue dots are the original trajectory.The red dots is the track points after temporal matching.(b) The red dots are the location sequences after the geographical and semantic feature matching.

Figure 10 .
Figure 10.The structure of the RNN.

Figure 10 .
Figure 10.The structure of the RNN.

Figure 11 .
Figure 11.The structure of the repeating module in the LSTM.

Figure 11 .
Figure 11.The structure of the repeating module in the LSTM.

Figure 12 .
Figure 12.The process of feature engineering.

Figure 12 .
Figure 12.The process of feature engineering.

Figure 13 .
Figure 13.The process of feature engineering.

Figure 14 .
Figure 14.The processes of the model training and location predicting.

Figure 14 .
Figure 14.The processes of the model training and location predicting.

Figure 14 .
Figure 14.The processes of the model training and location predicting.

Figure 15 .
Figure 15.Visualization of the trajectories of a courier from the courier dataset.(a) The heat map of all the tracks in the courier dataset.(b) Trajectories from 10 November to 12 November.

Figure 16 .
Figure 16.Visualization of the MTA bus dataset.Trajectories of bus whose ID is 296 are selected.(a) The heat map of all the tracks in the MTA bus dataset.(b) Trajectories on 2 and 4 August.(c) Trajectories on 12 and 15 August.

Figure 15 .
Figure 15.Visualization of the trajectories of a courier from the courier dataset.(a) The heat map of all the tracks in the courier dataset.(b) Trajectories from 10 November to 12 November.

Figure 15 .
Figure 15.Visualization of the trajectories of a courier from the courier dataset.(a) The heat map of all the tracks in the courier dataset.(b) Trajectories from 10 November to 12 November.

Figure 16 .
Figure 16.Visualization of the MTA bus dataset.Trajectories of bus whose ID is 296 are selected.(a) The heat map of all the tracks in the MTA bus dataset.(b) Trajectories on 2 and 4 August.(c) Trajectories on 12 and 15 August.

Figure 16 .
Figure 16.Visualization of the MTA bus dataset.Trajectories of bus whose ID is 296 are selected.(a) The heat map of all the tracks in the MTA bus dataset.(b) Trajectories on 2 and 4 August.(c) Trajectories on 12 and 15 August.

Figure 17 .
Figure 17.The calculation of deviation distance.(a) The deviation distance of the Geohash.(b) The deviation distance of the STS feature extraction.

Figure 17 .
Figure 17.The calculation of deviation distance.(a) The deviation distance of the Geohash.(b) The deviation distance of the STS feature extraction.

4. 1 . 4 .
Experimental Settings The evaluation consists of internal comparison and external comparison.The internal comparison focuses on the parameter settings in the STS feature extraction.The external comparison aims to compare the performance with different feature selection methods and model building methods.Table

Figure 18 .
Figure 18. Figure of the comparison between RNN and LSTM of the ability to handle long sequences.(a) Results on the courier dataset; (b) Results on the MTA bus dataset.

Figure 18 .
Figure 18. Figure of the comparison between RNN and LSTM of the ability to handle long sequences.Results on the courier dataset; (b) Results on the MTA bus dataset.

Figure 19 .
Figure 19.Visualization of the prediction results.(a) Blue line represents the raw trajectory, and the orange line is the location sequence after STS feature extraction.(b) Current locations of the courier.(c) The next five predicted locations by the algorithm in five iterations, marked as different colors.(d) Further predictions that exceed the original trajectory.

4. 5 .
Discussion Performance of the STS feature extraction.During the experiments, internal parameters of the STS feature extraction including the size of time bin and the distance threshold are adjusted to observe the variation of the prediction precision.The results are shown in Tables

Figure 19 .
Figure 19.Visualization of the prediction results.(a) Blue line represents the raw trajectory, and the orange line is the location sequence after STS feature extraction.(b) Current locations of the courier.(c) The next five predicted locations by the algorithm in five iterations, marked as different colors.(d) Further predictions that exceed the original trajectory.

4. 5 .
Discussion Performance of the STS feature extraction.During the experiments, internal parameters of the STS feature extraction including the size of time bin and the distance threshold are adjusted to observe the variation of the prediction precision.The results are shown in Tables

Algorithms 2017 ,Figure 20 .
Figure 20.Application scenarios of the STS-LSTM algorithms.(a) In a food delivery system, given the last location (the red circle) and the current location (the blue point), the next location can be predicted (the red point on the left).(b) In a bus system, the location predicted can be used to estimate the arrival time of each bus stop.

Table 1 .
Summary of location prediction methods.

Table 2 .
Semantic tags of the road networks.

Table 2 .
Semantic tags of the road networks.

Table 3 .
Statistics of two trajectory datasets.

Table 3 .
Statistics of two trajectory datasets.

Table 4 .
Four categories of classification problems.

Table 5 .
Summary of experimental settings.

Table 6 .
The evaluation results of prediction with different parameter settings of the courier dataset.

Table 7 .
The results of prediction with different parameter settings of the MTA bus dataset.

Table 8 .
The precision of prediction between Geohash and STS feature extraction on two datasets.Furthermore, to evaluate the performance of the LSTM model, several classic models are used for comparison.The HMM model is implemented in Python.Optimal parameters are set as follows:

Table 9 .
The performance of different models on two datasets.

Table 9 .
The performance of different models on two datasets.

Table 10 .
The influence of data quantity on the performance of the LSTM models on two datasets.

Table 11 .
Running example of the input and output of the prediction model.