Smart City Mobility Application—Gradient Boosting Trees for Mobility Prediction and Analysis Based on Crowdsourced Data

Mobility management represents one of the most important parts of the smart city concept. The way we travel, at what time of the day, for what purposes and with what transportation modes, have a pertinent impact on the overall quality of life in cities. To manage this process, detailed and comprehensive information on individuals’ behaviour is needed as well as effective feedback/communication channels. In this article, we explore the applicability of crowdsourced data for this purpose. We apply a gradient boosting trees algorithm to model individuals’ mobility decision making processes (particularly concerning what transportation mode they are likely to use). To accomplish this we rely on data collected from three sources: a dedicated smartphone application, a geographic information systems-based web interface and weather forecast data collected over a period of six months. The applicability of the developed model is seen as a potential platform for personalized mobility management in smart cities and a communication tool between the city (to steer the users towards more sustainable behaviour by additionally weighting preferred suggestions) and users (who can give feedback on the acceptability of the provided suggestions, by accepting or rejecting them, providing an additional input to the learning process).


Introduction
The development of information and communication technologies (ICT) has a big impact on peoples' everyday life. This is not just evident in a way we communicate with each other, but also in the amount of information we produce daily, intentionally or unintentionally, and the potential this data brings to managing the urban environment. This potential is of particular research interest within the smart cities topic [1][2][3][4][5][6][7][8][9], and in this context, location acquisition technologies play an important basis for smart city applications [10,11].
When it comes to the mobility aspect of smart cities, location information acquisition is often supported by mobile phone data and, in the literature, there are some interesting examples of their use for extraction of origin-destination (OD) matrices [12][13][14][15][16] or derivation of travel behaviour information for model validation purposes [17,18]. The first attempts to provide personalized travel information services were made based on the analysis of data from public transport fare collection systems [19,20].
Nevertheless, little is known about the potential of crowdsourced data for smart city mobility management, especially in the context of personalized mobility services and the interactions between a city and its transportation system users. In this article we tackle this idea by using crowdsourced data from multiple sensors sources and a gradient boosting trees algorithm to model the personal mobility decision making process regarding the transportation mode selection for a set of given conditions (location, trip's purpose, weather conditions, time of day, etc.). We see this as a potential platform for a city to steer the mobility of its inhabitants towards more sustainable behaviour by implementing the proposed model to provide personalised route suggestions for users via a dedicated smartphone application. Not only can the suggested approach enable city-individual communication, but it can provide users' feedback where by accepting or rejecting personalised route suggestion the user evaluates the provided option.

Gradient Boosting Trees
To model the users' decision making process regarding the transportation mode selection, we applied the gradient boosted trees (GBT) method. GBT is one of the most effective machine learning models for predictive analytics [21]. In general, it belongs to the family of decision tree learning methods which map observations about an item to conclusions about the item's target value in a tree structure. Depending on the characteristics of the target value they can be used for regression (when the target variable is continuous) or classification (categorical target variable) purposes. In this context, further on we will focus just on the GBT classifier, as our target value is categorical (transportation mode).

Predictive Learning
The predictive learning problem consists of random explanatory variables (predictors) = , … , and a random response variable . By using a sample of known pairs of values , the goal is to obtain an estimate ( ), of the function ( ) mapping to , that minimizes the value of loss function , ( ) over the joint distribution of all ( , ) pares Equation (1): Restricting the ( ) to be a member of parameterized class of functions ( , ) , where = , , … is a finite set of parameters whose joint values identify individual class members, changes the function optimization problem into parameter optimization problem Equation (2): where the value of parameter is calculated as a sum of initial guess and all successive increments ("boosts") , each based on the sequence of proceeding steps Equation (3): In general, boosting is used to increase the stability of the model [22], where for misclassified training events weights are increased ("boosted") and a new tree is formed. To measure the successfulness of the prediction a separate, testing, data set is used. This procedure is repeated for new trees and the final score of the mth tree is the weighted sum of scores of the individual leaves.

GBT Algorithm
To model the users' decision making regarding the selection of the transportation mode we use the GBT algorithm originally developed by Jerome H. Friedman [23]. In this algorithm the loss function , ( ) for the k-class problem Equation (4) is described as: where = 1 ( = ) ∈ 0,1 and ( ) = Pr( = 1| ). In addition, the logistic transformation is used to the predicted values before computing residuals, scaled to a probability scale where each tree has terminal nodes with corresponding regions and to compute the final classifications Equation (5): The steps of the used GBT Algorithm are given below and for more details we refer the reader to the source publications [23,24], whereas the more general overview of the decision threes and the GBT can be found in literature [25][26][27].

Data Collection
As input values for the GBT algorithm we rely on three data sources: (a) a dedicated smartphone application [28] with active logging (users can provide input by actively tracking their routes and defining the trip's purpose and transportation mode used) or the application can be set in a passive mode (whereby the tracks are passively logged and automatically segmented into trips with a separate IDs and transportation mode detected based on the Google Activity recognition API [29]). (b) a dedicated geographic information system (GIS) web interface [30] where users can register and give basic information about their mobility behaviour as routes often used, trip purposes, etc. (c) a weather forecast API [31] that provides information about the weather conditions at the requested location.
Based on these data sources over 4000 trips were recorded during a period of six months (Table 1). Considering the distribution of the recorded trips, the least of them were made during the evening hours (after 19 h), whereas in general the most kilometres were travelled by car, followed by foot and bike ( Figure 1).
Based on these three data sources a set of predictor variables is created in order to model the users' decision making process when it comes to the selection of transportation mode for a given trip in a given circumstances. Table 2 shows the full list of variables with acronyms and description.

Modelling the Individual's Mobility Decision Making from the Crowdsourced Data
For the purpose of modelling the individuals' mobility decision making process we selected an individual (ID = 23), who logged 311 trips. These trips were made by three transportation modes: car, foot and bike. The goal of the GBT algorithm was to successfully learn which transportation mode the user is most likely to use for a given purpose, weather conditions, origin and destination location pairs, trip distance, starting time of the trip and in regard to the working day or holiday/weekend condition. The learning process is based on the previous behaviour of the user (training data set) and results are then compared to the test data set (separate data set that also contains information on the user behaviour, but was not used for the learning process) in order evaluate the success of the learning results.

Optimal Number of Trees
The first step in building a model was to compute a sequence of (very) simple decision trees, where each successive tree was built for the prediction residuals of the preceding tree. Figure 2 shows examples of some of those simple decision trees that were used in the building process. As more and more trees were added to the model, the average squared error function for the training data (from which the respective trees were estimated) decreased. This clearly showed the improvement of the learning process as the model was able to learn from the errors of previous trees and make more accurate predictions regarding the transportation mode that the individual will use. Based on the average squared error value we were able to estimate the optimal number of trees as it clearly marked the point where the smallest error for the testing data occurred. Table 3 shows values of standard errors for both test and train data set.

Predictors Importance
Next to the standard error values, which serve as an indication of the overall model's quality an pertinent insight into the decision making process is the calculated importance of predictors (Figure 3). The predictor importance value shows what predictors influenced the decision about the selection of the transportation mode the most. One can see that the decision about the transportation mode for any given trip and the selected individual is mainly based upon the information on the location, followed by the indication of the starting time for the trip. Correlation analysis gave the highest (and statistically significant) value (0.328828) for evening hours, meaning that this factor has a high influence on the decision about what transportation mode to choose. On the other hand, for the city, this can indicate that in the evening hours fewer public transportation lines, at certain locations, limit the mobility options and therefore result in a less sustainable mobility behaviour (usage of the car).
Regarding the weather-related predictors, the humidity and the dew point have the highest importance in the decision making process of a given individual. In addition, the decision is the least influenced by the information on working/not working day and the trip's purpose.

Classification Matrix
The classification matrix gives an overview of correctly classified and misclassified values or when the built model successfully predicted which transportation mode the user will select and when not. Figure 4 shows the histogram of the classification matrix, where the highest values on the diagonal of the histogram mean that these transportation modes were correctly classified or that the boosting trees were able to correctly model the user's decision making process from the given dataset. Table 4 gives a more detailed overview of the classification results. The overall success of the boosted trees model to correctly recognize the transportation mode that user will select in the certain circumstances is 73%. The highest success was obtained for the transportation mode walk, followed by car and bike, while the user made the most of the trips by bike. The highest misclassification occurred between the car and the bike (28 trips), which corresponds to the 20% of all bike trips and the lowest among the walk and bike (4 trips or 6% of all walk trips).   Table 5 shows a sample of the boosting trees predictions for the test dataset. Next to the more detailed insight into the misclassified values the predictions give a score that each transportation mode obtained for a given trip. This way, one can evaluate what transportation mode would be the second option to the user for a given trip in a case the first option was temporary unavailable, or it can also indicate what option for a given trip and the given circumstances is the least favourable by the user (next to the transportation modes that were not used during the data collection period by this individual). Here, the highest potential for the city to impact the user's transportation mode selection decision process is seen.  For example, when scores for the two transportation modes are quite close, the city can favour the more sustainable one and give it priority or based on the starting and ending location add additional score/weight to the transportation mode that it wants to promote in that area (e.g., if there is a bike highway infrastructure that corresponds to the user's trip location, and city wants to promote its usage). This can be communicated to the user in the order route suggestions appear when requested and based on the knowledge gained, from the decision making process model, the city can provide automatized and personalized route guidance for each user to steer their behaviour towards the more desirable one in the sustainability sense.

Conclusions
In this article we successfully modelled, based on the detailed behavioural data, what transportation mode an individual is likely to use in a given context. To do this we applied gradient boosted trees and crowdsourced data from three sources (a smartphone application, a GIS-based web interface, and weather sensors). These data were divided into two data sets-training and test-and the overall success of the suggested model was 73%. It should be noted that the expected results depend on the quality of the input data, therefore advances in the location detection precision, automation of transportation mode detection for passively collected data and trip segmentation can positively influence the quality of decision making model. With this in mind, our future research will be focused on trip chaining and the detection of multimodal combinations to be used and suggested to the user.
The potential applications of the developed model include provision of personalized services as well as personalized routing suggestions to users via a dedicated smartphone application. This is particularly interesting in the context of steering users' mobility behaviour towards the more sustainable one as the city is able, in the ranking/scoring suggestions, to weight the preferred ones. Also, by using the personalised routing the city can gradually impact the overall mobility behaviour of its citizens by making it more synchronized. For example, instead applying the "one-for-all" solutions (as redirecting all users towards the less crowded streets, and in that sense solving the local problem, whereas in the sense of the connected network impact, this action usually just relocates the traffic jams towards the new location), the combined small changes made by individuals could be managed to have pertinent and balanced joint impact. Further on, the suggested model can be used as a smart city mobility management platform and communication tool between policy makers and citizens, where via suggestions city can communicate more preferable routes and users can, by accepting or rejecting the suggestions, provide feedback on the personalized results and route preferences. In addition, city can gains insights on the usage of its network infrastructure as well as manage traffic flows in incident situations by providing personalized alternative routes is such cases. For a user, the developed model can be seen as a filtering tool whereby for a given trip he will not need to consult several search engines (e.g., national rail company for train routes, local public transportation company for bus routes, Google maps for pedestrian routes, etc.) and route suggestions can be searched and displayed on one place in line with his personal preferences.