Trip purpose imputation using GPS trajectories with machine learning

: We studied trip purpose imputation using data mining and machine learning techniques 1 based on a dataset of GPS-based trajectories gathered in Switzerland. With a large number of 2 labeled activities in 8 categories, we explored location information using hierarchical clustering 3 and achieved a classiﬁcation accuracy of 86.7% using a random forest approach as a baseline. The 4 contribution of this study is summarized below. Firstly, using information from GPS trajectories 5 exclusively without personal information shows a negligible decrease in accuracy (0.9%), which 6 indicates the good performance of our data mining steps and the wide applicability of our 7 imputation scheme in case of limited information availability. Secondly, the dependence of model 8 performance on the geographical location, the number of participants, and the duration of the 9 survey is investigated to provide a reference when comparing classiﬁcation accuracy. Furthermore, 10 we show the ensemble ﬁlter to be an excellent tool in this research ﬁeld not only because of the 11 increased accuracy (93.6%) especially for minority classes, but also the reduced uncertainties in 12 blindly trusting the labeling of activities by participants, which is vulnerable to class noise due to 13 the large survey response burden. Finally, the trip purpose derivation accuracy across participants 14 reaches 74.8%, which is signiﬁcant and suggests the possibility of effectively applying a model 15 trained on GPS trajectories of a small subset of citizens to a larger GPS trajectory sample. 16


Introduction
Trip purpose imputation is an important part of constructing travel diaries of 20 individuals and has attracted the attention of many researchers due to its significance imputation precision.Generally, the socio-demographic characteristics of participants 92 are gathered together with GPS trajectories and are taken to be important supplementary 93 information [1].Land use data and POI could be used to indicate possible activities for a 94 stopping point on GPS trajectories [8].In addition, the popularity of POI inferred from 95 social media data (e.g.Twitter) [5], travel and tourism statistics [9], and mobile phone 96 billing data [10] have also been utilized to derive travel purpose.Data pre-processing, which has been intensively investigated in the data mining 99 field [11], receives much less discussion than it deserves in trip purpose imputation 100 research.Therefore, we discuss the issue in-depth below.García et al. [12] summarized 101 the three most influential data pre-processing requirements to improve data mining 102 efficiency and performance, i.e. imperfect data handling, data reduction, and imbalanced 103 data pre-processing.

104
An important aspect of imperfect data handling is noise filtering [13], which aims 105 at detecting the attribute noise and the more harmful class noise [14].For class noise 106 removal, ensemble filters proposed by Brodley and Friedl [15,16] have been widely 107 applied as an excellent tool.Ensemble filters adopt an ensemble of classifiers to eliminate 108 the mislabeled training data that cannot be correctly classified by all or part of the 109 classifiers using n-fold cross-validation.To avoid treating an exception that is specific to 110 an algorithm as noise, multiple algorithms are used.Basically, there are two strategies 111 for implementing ensemble filters: majority vote filters, which mean the instances 112 that cannot be correctly classified by more than half of the algorithms are treated as 113 mislabeled; and conservative consensus filters, which mean only the instances that 114 cannot be correctly classified by all algorithms are treated as noise.Majority vote filters 115 are sometimes preferred to conservative consensus filters, as retaining bad data is more 116 harmful than discarding good data especially when there are ample training data [16].
117 Nevertheless, we chose conservative consensus filters, with the results of these two 118 strategies being similar.

119
Missing data is another typical problem in transport research that normally involves 120 survey processes.The first step to handle missing data should be understanding sources 121 of "unknownness" [17], which might be due to lost, uncollected, or unidentifiable in 122 existing categories.Besides omitting the instances or features with missing values, which 123 is usually not suggested, approaches for missing data inference can be classified into two 124 groups [18]: data-driven, e.g.mean or mode; and model-based, e.g.k-nearest neighbors 125 (kNN).kNN has gained popularity because of its simplicity and good performance in 126 dealing with both numerical and nominal values [19].

127
Attribute selection, as a classic part of data reduction, is conducive to generating 128 a simpler and more accurate model and avoiding over-fitting risks [12,20].For feature 129 selection, feature importance measured by mean decrease in the Gini coefficient in 130 the random forest approach can be used as a reference [21].However, such a rank-131 based measure cannot take feature interactions into account and might suffer from 132 stochastic effects [22].Conventionally, feature selection techniques can be grouped into 133 two categories: filter methods, i.e. variable ranking techniques; and wrapper methods, 134 which involve classifiers and become an NP-hard problem [20].One of the most popular 135 algorithms for feature selection is minimum redundancy maximum relevance based on 136 mutual information [23], which is initially designed as a filter and then developed to 137 be a wrapper as well [12].Another popular wrapper algorithm that is designed for the 138 random forest is provided in an R package Boruta [22], which aims at identifying all 139 relevant features rather than an optimal subset and is employed for our analysis.

140
An imbalanced distribution of categories might result in unbalanced accuracies 141 of classification.This problem also troubled the machine learning community, where 142 Ling and Li [24] suggested duplicating small-portion classes and Kubat and Matwin 143 [25] tried to downsize large-portion classes.One of the most prevalent ways to cope with imbalanced data is the Synthetic Minority Over-Sampling Technique (SMOTE) 145 introduced by Chawla et al. [26], which suggests formulating new samples as randomized 146 interpolation of minority class samples.SMOTE is widely used because of its simplicity, 147 good performance, and compatibility with any machine learning algorithm [12].As   The methods used to derive trip purposes can be divided into two main categories 153 [28]: rule-based systems with an accuracy of around 70% [29], which rely predominantly 154 on land use and personal information, as well as timing, duration, and sequence of 155 activities; and machine learning approaches, which focus more on activities than position 156 and show varying accuracy between 70% and 96% depending on different algorithms, 157 data set, activity categories, and so on [8].Although manual trip purpose derivation 158 approaches using rules give satisfactory results, there is no standard set of accepted rules 159 for mining travel information and thus it relies on researchers' experiences.Compared to 160 conventional deterministic approaches, machine learning algorithms like random forest 161 and dynamic Bayesian network models could even rank possible activities, which are 162 particularly helpful when activities are ambiguous [5].Consequently, we opt for machine 163 learning approaches that have already been widely applied in this area, such as decision 164 trees [30], random forests [28], artificial neural networks [31], and dynamic Bayesian 165 network models [5].Because of the good performance of random forests compared to 166 other methods demonstrated by numerous studies [32][33][34], we employed it as a starting 167 point for analysis.An introduction to random forests is given in Section 3.2.As a classification method, kNN [42] is also shown to be a good missing value 206 imputation technique [12,19].Here we give a short introduction to the kNN algorithm.objects is obtained.To define the "similarity" between two clusters, [45] summarized six 233 strategies, from which we selected the "Group-average" strategy as it is more reasonable 234 and conservative than its alternatives.In our case, the similarity between two activities Through the process of hierarchical clustering, d XY will increase gradually.There-242 fore, we can define an appropriate threshold to stop the process and get intermediate 243 clustering results.In our study, a threshold of 30 meters is chosen to restrict the size of 244 each cluster considering the GPS accuracy [37] and results in a radius of fewer than 30 245 meters for each cluster.

246
A random forest is an ensemble of classification and regression trees [46].Since 247 its introduction, classification and regression tree (CART) has been an important tool 248 and received lots of attention in different research fields [42].A detailed description of

249
CART can be found in Song and Ying [47].As a further development of CART, Breiman  An advantage of the random forest is that it provides an inherent measure of 317 feature importance using Gini impurity as shown in Fig. 1, which provides an important 318 reference on feature selection.Among the 21 features, the most important six features 319 are more useful in classification, whereas the personal-based attributes are less relevant: 320 except for "Age", all personal information belongs to the least relevant 7 features.To 321 assess the importance of three sets of features grouped in Table 2, we conduct three additional experiments by leaving one set of features out and present the results in Fig. 323 2. When leaving all the personal information unused, the overall accuracy decreased 324 around 0.9%.Although the Boruta method [22] shows that all features are relevant, 325 which indicates a good result of our preliminary feature selection, we omit the personal 326 information from further analysis for the following reasons: This could indicate the 327 strength and applicability of our method even when no personal information is available, 328 i.e., we can undertake trip information enrichment at high accuracy using only GPS         As a baseline, we achieved an overall accuracy rate of 86.7% for eight activity 403 categories using the highly heterogeneous data (3689 participants) with random forests.

404
Through feature importance analysis using the inherent measure of the mean decrease

416
In this context, it is important to note, this is misleading to compare accuracy rates

148a
variation of SMOTE, Adaptive Synthetic Sampling Approach (ADASYN) proposed 149 by He et al. [27] puts more weight on minority samples that are harder to learn when 150 selecting samples for interpolation. 152

168 2 . 4 .
Model Performance Assessment 169 Model performance can be assessed in various ways, which act as an important 170 component of model development.Although reported trip information might suffer 171 from memory recall errors or other issues, it is probably the best candidate as ground-172 truth for model validation and assessment [35].Innovatively, Li et al. [36] used the 173 visualized spatial distribution of recognized trip purposes to validate simulation outputs.174Albeitclassification models might be used to generate travel diaries for citizens that are 175 not in the training dataset, Montini et al.[32] found that the accuracy of trip purpose 176 detection is participant-dependent.As proportion and categories of trip purposes have a 177 significant influence on the accuracy of classification[9], high-frequency activities should 178 be treated with special care.

181
In this study, we analyzed GPS trajectories collected from 3689 Swiss participants 182 from September 2019 to September 2020 through the "Catch-my-day" GPS tracking 183 app, developed by Motion Tag.Considering solely the 91% of all activities that are 184 within Switzerland, it amounts to 1.82 million activities above a time threshold of 5 185 minutes, of which 43% is labeled by participants.Although a threshold of 5 minutes to 186 extract activities from GPS trajectories might ignore some short activities, we use it as a 187 simplification for the current study.As a GPS-integrated mobile phone has a position 188 error of 1 to 50 meters with a mean of 6.5 meters as shown by Garnett and Stewart [37], 189 this is taken into account when conducting spatial clustering of activities.More details 190 about the study design and research scope can be found in Molloy et al. [38] and Molloy 191 et al. [39].192 Based on the "Mobility and Transport Microcensus 2015" in Switzerland, we 193 grouped activities into eight categories as shown in

207
Given a training set T = {U, V}, where U are predictors and V are labels, we can 208 estimate the distance between a test object w 0 = {u 0 , v 0 } and all training objects w = 209 {u, v} ∈ {U, V} to find its k nearest neighbors.Then the label v 0 for this test object 210 w 0 is determined as median of v of its k nearest neighbors in the case of numerical 211 variables and mode in the case of categorical variables.The Gower distance computation 212 between u 0 and u, which is applicable for both categorical and continuous variables, 213 can be referred to in Kowarik and Templ[43].Two issues might affect the performance 214 of kNN: one is the choice of k, where a small value of k could be noise sensitive and a 215 large value of k might include redundant information; another issue is that an arithmetic 216 average might ignore the distance-dependent characteristics, where closer objects have 217 higher similarities.These two issues can be addressed by weighting the vote of each nearest neighbor for the final result by their distance, i.e. weighted kNN.Missing 219 value imputation for personal-related information in this work is conducted using the R 220 package "VIM" developed by Kowarik and Templ[43], which also provides weighted 221 kNN methods for better performance.222Toexplore implicit information contained in the data, data mining techniques like 223 clustering can be employed[28].Using the hierarchical clustering method introduced by 224 Ward Jr[44], we grouped the spatial location of activities for each participant to make use 225 of repetitive patterns of human behaviors.Hierarchical clustering optimizes the route 226 by which groups are obtained[45], so it might not give the best clustering result for a 227 specified number of groups[44].However, compared to another widely known k-means 228 clustering technique, hierarchical clustering allows us to define the distance used for 229 grouping rather than defining the number of groups.The basic steps for hierarchical 230 clustering are illustrated below: 1) Treat initial x objects as individual clusters; 2) Group 231 a pair of the most "similar" clusters; 3) Repeat step 2 until a single cluster containing all 232

235
is defined as the Euclidean distance of their geographical location.Next, we use two 236 general activity clusters X and Y to illustrate the estimation of their average distance.237 Assuming there are m and l activities in clusters X and Y, respectively, while i and j 238 are single elements of the m and l activities, respectively.We use d ij to represent the 239 distance between activities i and j, d XY the distance between clusters X and Y. Then we 240 can calculate d XY as:

329trajectories;
The inclusion of socio-demographic data might lead to overfitting of models 330 to current participants and limit the applicability of models on GPS trajectories of other 331 users.While the elimination of activity information gives similar results, the removal of 332 cluster-based information leads to a dramatic decrease in model performance, which 333 strongly suggests the effectiveness of our usage of hierarchical clustering algorithms.

Figure 1 .
Figure 1.Feature importance in trip purpose imputation measured with mean decrease in Gini in random forests.

Figure 2 .
Figure 2. The model performance for each activity categories and the overall acuracy in four experiments, where we use all features or leave one set of features unused to measure the significance of each set of features.

Figure 3
Figure3shows the spatial distribution of labeled activities and accuracy rate using

Figure 3 .
Figure 3.The spatial distribution of the number of labeled activities (a) and accuracy rate (b) using grids with an area of 4 km 2 in Switzerland.The exponential scale in (a) is used to account for the unevenly distributed activities.To investigate the dependence of classification performance on the number of partic-

Figure 4 .
Figure 4.The impact of the number of participants and the duration of the survey on model performance.

Figure 5 . 5 .
Figure 5.The model performance on the original data and the ensemble filtered data through four classification algorithms.5.Discussion and Conclusions399 402

405in
Gini of random forests and the Boruta method, we verified that current features are of 406 high relevance and the features extracted with hierarchical clustering are crucial in model 407 performance.Additional experiments that leave out a set of personal-related features 408 reveal the possibility of trip purpose imputation with only GPS trajectories.Thanks 409 to the innovative application of hierarchical clustering in extracting relevant features, 410 the answer to the first research question becomes obvious: the required data sources 411 for a satisfactory model performance are minimized to GPS trajectories.Although 412 many researchers managed to achieve better performance by incorporating various 413 data sources, we advocate considering limited data availability on a larger scale, where 414 collecting personal information along with GPS trajectories is impossible or the quality 415 of data sources varies considerably, is vital to generalize our results.

417
among papers due to the different sample sizes (persons and length of observation

Table 2 .
Selected features for trip purpose imputation.The categorical features are indicated by *, while m() and std() denote "mean of" and "standard deviation of", respectively.
[3]eover, POI information from Google Places API as adopted in Ermagun et al.[3]199 was investigated for a pilot study and not considered further due to the large monetary 200 cost for large datasets such as the one used here and its comparatively minor benefits.201Residential zoning information in Switzerland as land use information is also tested 202 with very little effect on trip purpose derivation accuracy and hence excluded from the 203 final models.2043.2.Methods

Table 3 .
Confusion matrix of labeled versus predicted trip purposes using random forests (Overall accuracy: 86.7%).

Table 4 .
Although it has been still be improved for wider applicability in transport management, where the possibility 394 might exist in including other data sources.While the division of activity categories 395 is primarily subject to practical applications, its effects on model performance could 396 be quantified in further analysis.In addition, the complexity of specific activities like Classification accuracy of multiple algorithms with ensemble filter and across participants imputation.
4.2.Ensemble Filter with Multiple Classification Algorithms358 A large data set is more vulnerable to class noise than smaller ones because of the 359 heavier and longer survey response burden of participants.It is a challenging topic that 360 has not been considered in the context of trip purpose imputation.398