Identifying Travel Mode with GPS Data Using Support Vector Machines and Genetic Algorithm

: Travel mode identification is one of the essential steps in travel information detection with Global Positioning System (GPS) survey data. This paper presents a Support Vector Classification (SVC) model for travel mode identification with GPS data. Genetic algorithm (GA) is employed for optimizing the parameters in the model. The travel modes of walking, bicycle, subway, bus, and car are recognized in this model. The results indicate that the developed model shows a high level of accuracy for mode identification. The estimation results also present GA’s contribution to the optimization of the model. The findings can be used to identify travel mode based on GPS survey data, which will significantly enhance the efficiency and accuracy of travel survey and data processing. By providing crucial trip information, the results also contribute to the modeling and analyzing of travel behavior and are readily applicable to a wide range of transportation practices.


Introduction
Travel surveys are one of the most important ways of obtaining critical information needed for transportation planning and decision making.Traditionally, a travel survey used to be conducted using different methods such as telephone/face-to-face interviews and computer-based reporting to maintain a OPEN ACCESS diary [1].These have proven to be a burden for participants to use, as well as being expensive and time consuming [2].In addition, in these surveys, respondents often miss short trips, and round up the travel mode and travel time.The GPS-based travel survey can address this problem by recording travelers' GPS traces automatically.With the advantages of reducing respondents' participant burden and increasing data accuracy, GPS-based travel surveys have been conducted in many large cities, such as Beijing and New York.Nevertheless, GPS records cannot provide us with trip information that can be applied directly when analyzing and modeling travel behavior.In order to obtain the needed trip information, several major modeling steps, e.g., trip identification, mode detection, and travel time determination, have to be conducted based on the raw GPS data.This paper will present a model for travel mode detection, which is one of the crucial steps in trip information identification.Compared with the previous studies, it will try to enhance the identification accuracy by employing the SVC as well as using more GPS records.
The remainder of this paper is organized as follows.In Section 2, a review of identification of travel mode with GPS data in general is presented.Section 3 is a description of the data available for the study.This section is followed by a presentation of SVC and GA in Section 4. Section 5 presents the constructing process and modeling results of the mode detection model.The paper closes with some major conclusions and a discussion of future research directions.

Existing Literatures
Being a crucial step in GPS-based travel behaviors identifying, mode recognition has been investigated by many studies.Some researchers designed a threshold for parameters to detect travel mode, which is also called the criteria-based method.For example, Gong et al. [3] detected walking, subway, bus, and car by setting criteria for variables of speed and acceleration, etc., with the GPS data collected in New York.Liu and Zheng [4] used parameters including average speed and median speed to identify walking, bicycle, car, bus, and subway.Besides GPS data, some studies also introduced Geographic Information System (GIS) technology to recognize travel modes, especially bus and subway.For example, GIS and actual road network information were utilized in Stopher and Greaves's study [1] for bus and subway recognition.Bohte and Maat [5] introduced a method combining GPS data, GIS information, and a web-based validation application to identify walking, bicycle, car, and railway.
Most of the previous studies conducted mode identification by building detection models.A lot of methods in this category, including neural network [6,7], decision tree [8][9][10], Bayesian network [9,11,12], Support Vector Machines (SVM) [9,13,14], and conditional random field [9], have been applied in detecting travel mode.Most of the input variables come from the GPS data itself.In these studies, Zhang et al. [13] presented an SVM model, which achieved a relatively higher average accuracy rate, which proves the better performance of SVM than other methodologies in mode detection.
Although the previous studies proposed many methods for mode identification, most of these studies' sample sizes were quite small.For example, Zhang [15] used the GPS data collected from 23 volunteers, Gong et al. [3] employed 35 respondents' GPS traces, Vij et al. [16] conducted the mode detection by using 11 travelers' GPS data, and Du and Aultman-Hall [17] monitored 12 volunteers' GPS traces.The inadequate data sample size may limit the procedure of model construction and estimation, and turn out to impact the performance of mode identification.Moreover, mode identification in big cities is more complicated due to complex traffic conditions and infrastructures, like urban canyons, bus routes, and subway networks.A large sample size can benefit data processing so as to provide more records for data analysis and model constructing.Bolbol et al. [18] took advantage of the GPS data of a large-scale study conducted in the Netherlands in 2007, which collected 1104 respondents' one-week survey data.Their results indicated that, for mode detection in a one-day GPS-based travel survey, the sample size needed is 289 and 271, respectively, for bus and car trips.By enhancing the sample size of GPS data, the accuracy rate of mode detection can be further enhanced.Therefore, this paper will construct the mode detection by using GPS survey data collected from 900 respondents, which is also part of a large-scale OD survey in Beijing, 2010.In order to enhance the performance of mode identification, SVC will be employed to establish the detection model.

Data
In this section the major information on survey methods and data processing is introduced.Beijing is chosen as the study area of this paper.Being the capital city of China, it has an integrated urban land use and composite transportation network, which includes a complex road network and one of the largest public transit systems in the world.The composite land use and transportation conditions make the data survey and mode detection very complicated.The problem of signal loss due to urban canyon and subway trips also has to be addressed in data processing.

Data Survey
This study takes advantage of GPS survey data collected as part of a large-scale OD survey conducted in Beijing in 2010.The data sample consists of 900 respondents' more than one million travel records collected during the survey period from 20 October 2010 to 26 November 2010.Each respondent has an average of about 1100 records.The survey area was 16,410.54km 2 and the residing population of Beijing in 2010 was about 19.6 million.During the survey period, in addition to one-day GPS records, the 900 respondents also reported their travel diaries and socio-demographic information by filling in a paper form.With records containing missing values eliminated, our final sample consists of 1,872,431 GPS records.
Each record in the dataset represents a GPS signal that was captured consecutively at the 5-s interval by the GPS device (i-gotU GT-600) and contains information on index, date and time (universal coordinated time, UTC), latitude, longitude, altitude (m), speed (m/h), course (°), distance (m), EHPE (estimated horizontal position error, cm), and satellite ID.

Data Processing
For each GPS record, the first step of data processing is to change the geodetic coordinates (latitude and longitude) into geographical coordinates (X and Y coordinate) as well as the time and date in UTC value into local time.Being crucial measures for examining the quality of GPS records, satellite ID, EHPE, speed, and position (altitude, longitude, and latitude) are applied to remove invalid records in the dataset.Then GPS points are combined into trips and activities.Zong et al. [19] show the detailed process of identifying trips and activities with GPS data.Consisting of four sub-steps, namely, dividing status segments, identifying activities, recognizing trips, and determining intermediate stops, the identification process can determine trips and activities based on GPS logs with a high level of accuracy [19].
Eight typical identification parameters, as shown in Table 1, concerning the speed, acceleration, travel time, and trip distance of each trip, are then employed to represent the mode characteristics based on previous studies [7].Then, the threshold ranges for each identification parameter are defined, according to the traffic condition and complex building environment in Beijing as well as the results that previous studies presented [20].The thresholds of identification parameters (shown in Table 1) are then used to filter the data samples.After data filtering, the final sample consists of 85,120 GPS trips.The daily activity and travel pattern of respondent #010 are shown in Figure 1.The major trip characteristics as well as the starting and ending time of each trip are also presented.

Methodologies
In this section the major methods that will be used in mode detection are discussed.Being one type of SVM, SVC will be leveraged to identifying the traffic mode with GPS data.Compared to the multinomial logit (MNL) model, which has the limitation of determining alternatives with significant correlation (such as bus and car), SVC can provide us with higher prediction accuracy.Besides, GA will also be employed to optimize the major parameters in SVC in order to enhance the solution quality and calculation efficiency of the model.

SVC
Being a popular disaggregate model, the MNL model has been widely used in mode prediction and identification.However, it imposes the restriction that the distribution of the random error terms is independent and identical over alternatives.This restriction leads to limitation of random taste variation and the independence of irrelevant alternatives, which causes the cross-elastics among all pairs of alternatives to be identical [21].Compared to the MNL model, SVC has been more frequently applied in modeling disaggregate choices in recent years due to its ability to enhance the prediction accuracy and calculation efficiency [22].Therefore, this paper will introduce SVC in estimating of travel mode with GPS data.
SVC is one of the SVM (i.e., machine learning) methods analyzing data and recognizing patterns.Given a set of input-output data pairs (x1, y1), (x2, y2), …, (xl, yl) , l is the number of training samples) that are randomly and independently generated from an unknown function, SVM estimates the function using the following Equation [23]: where ( ) represents the high-dimensional feature spaces that are nonlinearly mapped from the input space x. w denotes a parameter vector and b is the threshold [24].If the interpretation y only takes category values, i.e., −1 and +1, it denotes SVC.Otherwise, if the domain of output space y contains continuous real values, the learning problem then refers to Support Vector Regression (SVR) [25].
For classification about the training data, SVM's linear soft margin algorithm is to solve the following regularized risk function: The first term 2 2 1 w is called the regularized term, which is used as a measurement of the function flatness.

The second term ] [ f R emp
is the so-called loss function to measure the empirical error.C is a regularization constant that determines the trade-off between the training error and the generalization performance.
Here, the ε-insensitive loss function is employed to measure empirical error: Equation ( 3) defines a ε tube (shown in Figure 2).The loss is zero if the predicted value is within the tube.If it is outside the tube, the loss is the magnitude of the difference between the predicted value and the radius ε of the tube.Both C and ε are user-determined parameters.Two positive slack variables ξ, ξ* are used to cope with the constraints of the optimization problem.To get the estimation of w and b, Equation (2) can be transformed to a primal objective function, Equation (4): This constrained optimization problem is solved by using the following primal Lagrangian form: where L is the Lagrangian, and i  are Lagrange multipliers.Hence the dual variables in Equation ( 5) have to satisfy the positive constraints: η ,η ,α ,α 0 The above problem can be converted into a dual problem where the task is to optimize the Lagrangian multipliers, α i and α i  .The dual problem contains a quadratic objective function of α i and α i  with one linear constraint: By introducing kernel function K(xi,xj), Equation ( 9) can be rewritten as follows: where K(xi, xj) is the so-called kernel function which is equal to the inner product of two vectors xi and xj in the feature space The kernel function would be more simply without the compution of Ф(X).Some popular kernel functions are the linear kernel, polynomial kernel, and radial-basis function (RBF) kernel.Using different kernel functions, one can construct different learning machines with arbitrary types of decision surfaces.In general, the RBF kernel, as a nonlinearly kernel function, is a reasonable first choice [26].Thus, the RBF kernel is chosen in this work: where σ is a parameter that determines the area of influence this support vector has over the data space.
As user-determined parameters in SVC and the RBF kernel, C, ε, and σ will greatly influence the estimation efficiency and prediction accuracy of the models, especially for large-scale or real-time feature practice application.This paper will optimize the parameters by employing GA.Being a part of evolutionary computing, GA is a rapidly growing area of artificial intelligence.The process of GA is presented as follows.

Parameter Optimization with GA
Being the key elements in SVC, the parameters C, ε, and σ directly decide the prediction performance of the model.Therefore, the parameter optimization is an important factor for improving the prediction accuracy of SVC.In this paper, GA is applied to optimize C, ε, and σ in SVC.GA is inspired by evolutionary biology processes like inheritance, selection, crossover, and mutation.Based on a fitness function, GA attempts to retain relatively good genetic information from generation to generation.The process of GA can be divided into six steps, which will be described briefly in the following section.

Encoding of Chromosomes
GA starts with a set of solutions (represented by chromosomes) called population.The individuals comprising the population are known as chromosomes.In most GA applications, the chromosomes are encoded as a series of zeroes and ones, or a binary bit string.For the mode detection model with SVC, the real encodings were adopted since the parameters C, ε, and σ are continuous-valued.To represent the parameters in SVC, each chromosome consists of n gen 1 , n gen 2 , and n gen 3 (n refers to the current generation), which represent three parameters, respectively.
To reduce the search space referring to previous literature using SVC, the three parameters should within the range  [27].An example of the encoding of a chromosome is shown in Table 2. Table 2.An example of chromosome encoding.
gen n 2 gen n 3

Fitness Function
Fitness function determines possible solutions to the problem and is used to estimate the quality of the represented solution (chromosome).For parameter optimizations in SVC, the best solution is able to maximize the accuracy rate of prediction.Generally, GA is an optimal searching method to find the maximum fitness of the individual chromosome.Thus, Hit ratio is adopted in this paper.Here, Hit ratio (HitR) refers to the fitting degree of the identification results to the observed samples.
where N1 is the number of hit records (travel mode) predicted by the model and N2 is the total number of observations.

Crossover Operator
Crossover is a reproduction technique that takes two parent chromosomes and produces two child chromosomes.In this paper, an arithmetic crossover is used to create new offspring [28].
gen is a pair of "parent" chromosomes; gen , is a pair of "children" chromosomes; α k is a random number between (0, 1); and (k is the total genes of the crossover operation).Table 3 shows the parents selected for crossover.When k = 1 and α k = 0.4, the children chromosomes after crossover are shown in Table 4. Table 3.An example of chromosome encoding.Children I 0.4gen n 1,I + 0.6gen n 1,II gen n−1 2,I gen n−1 3,I Children II 0.4gen n 1,II + 0.6gen n 1,I gen n−1 2,II gen n−1 3,II

Mutation Operator
Mutation is a common reproduction operator used for finding new points in the searching space to evaluate.A genetic mutation operation is used in this paper [27].
Assuming a chromosome is ) , , ( , if the n gen 1 is selected for the mutation, the mutation can be shown with Equation ( 14): where r is a random number between [0,1]; Tmax is maximum number of generations; and λ = 3.This property causes this operation to make a uniform search in the initial space when n is small, and a very local one in later stages.
To deal with the problem that the mutation may violate the parameters' constraints, this paper will assign a relatively high weight to reduce their probability of being selected in the following search [27,29].

Termination
The search continues until HitRn-HitRn-1 < 0.001% or the number of generation reaches the maximum number of generations Tmax, which is set to be 5000 [30].

The procedure of GA
The major steps of GA are shown in Figure 3.

Identification Model
In this section the detailed setting of parameters in SVC and GA as well as the modeling procedure of mode detection are introduced.The models are estimated with the survey data in Beijing.Compared to the observed records provided by the survey data, success rates of 100%, 100%, 88.9%, 92.7%, and 80.0% were obtained for the detection of walking, bicycle, subway, bus, and car, respectively.This indicates that the developed model shows a high level of accuracy rate for mode identification.For a case study, the identified and observed mode choices of respondent #010 are presented.

Alternatives
According to the travel survey data, the real mode share in Beijing in 2010 was calculated.The results are shown in Table 5.The results reveal that car and bus are the two most popular travel modes in Beijing.For mode identification using GPS data, buses and cars have great correlation due to their similar characteristics, especially concerning the speed and acceleration in almost the same traffic conditions.Therefore, as mentioned above, SVC, instead of the MNL model, will be employed to establish the mode identification model in this paper [31,32].

Variables
Based on a preliminary correlation test, eight variables related to the characteristics of travel modes were selected, as shown in Table 6.

Estimation Results
There are three major GA parameters, namely pc, pm, and psize.In general, pc varies from 0.3 to 0.9 and pm varies from 0.01 to 0.1, while psize is the population size, which is set according to the size of the samples.Considering the features of mode identification, the condition of survey data, and the previous studies [27,28,33,34] related to GA, pc, pm, and psize are set at 0.6, 0.06, and 80, respectively.The convergence of the calculation by GA indicates that the Hit ratio increases slowly after the 3000th generation.The highest Hit ratio appears in about the 3500th generation, and remains almost unchanged after that.The three parameters, i.e., C, ε, and σ were then optimized as 84.54, 0.0009, and 0.7367, respectively, with the best optimization value among the 10 results for the practical mode detection model.
The estimation results of the SVC model are shown in Table 7.The results indicate that the factors related to acceleration, travel speed, and travel time are the major ones that should be considered in travel mode detection, while that regarding trip distance does not have significant impact on mode decision.One of the reasons for this is that there is correlation between travel time and trip distance.In detecting walking trips, travel time and maximum speed are important factors-that is, the longer a trip takes or the larger its maximum speed is, there is less probability that its mode is walking.The results also show that there are three major factors, i.e., the 75th percentile of speed, maximum speed, and acceleration, which impact the detection of a bicycle.In detail, a trip with low speed and acceleration tends to be by bicycle.On the contrary, a trip with high speed and acceleration is more likely to be on the subway.The determination of a bus is similar to that of subway, except that the coefficient of maximum speed and acceleration is smaller than that of the subway.This reveals that the higher a trip's speed and acceleration, the more probability that it refers to the mode of subway.

Mode Identification
The travel mode of each trip in the sample data is determined by using the developed model.With the developed model, the mode choices of all the trips in the GPS data are determined.By comparing the predicted results and the observed records, the Hit ratios of the model are shown in Table 8.The estimation results show high accuracy for detection of walking, bicycle, subway, and bus, and that of car is also acceptable.Further analysis reveals that the reason for the relatively lower accuracy of car and subway determination is that the model sometimes makes mistakes in distinguishing between car and subway.Furthermore, compared to the detection results presented by Gong et al. [3]-i.e., 92.4%, 65.5%, 62.5%, and 84.1% for walking, subway, bus, and car, respectively-the detection results of this work are better.This also represented the good performance of the SVC model in mode detection.

Conclusions
In this paper, an SVC model was constructed for mode detection with GPS survey data.GA was used in optimizing the parameters in SVC.The results indicate that the developed model shows a high level of accuracy for mode identification.Our findings can significantly enhance the efficiency and accuracy of travel survey and data processing.They also serve as a foundation for a future model system of full-scale travel information identification with GPS data.Moreover, by providing crucial travel information, the results contribute to the modeling and analyzing of travel behavior and are readily applicable to a wide range of transportation practice.
The estimation results also reveal that further study should be conducted with respect to high-accuracy detection of subway, bus, and car.One of the potential methods is to distinguish subway and bus from car based on the GIS information.Therefore, a future study could make potential progress by combining GPS data with GIS technology to determine travel modes.Moreover, although the sample size is relatively large compared to some of previous studies, it was not representative enough for the general population.Further study should pay more attention to the process of sampling and data collection to eliminate sampling bias.

Figure 1 .
Figure 1.GPS traces and daily activity travel pattern of respondent #010.

Figure 2 .
Figure 2. The parameters for SVC.

Figure 3 .
Figure 3.The basic procedure of GA.
Figure 4   shows respondent #010's daily mode choices identified by the model.

Figure 4 .
Figure 4. Identified and observed mode choices of respondent #010.

Table 1 .
The thresholds of identification parameters.

Table 6 .
Variables in the mode identification model.

Table 7 .
Estimation results of the mode detection model.

Table 8 .
Verification results of the mode detection model.