Finding Location Visiting Preference from Personal Features with Ensemble Machine Learning Techniques and Hyperparameter Optimization

: For the question regarding the relationship between personal factors and location selection, many researches support the effect of personal features for personal location favorite. However, it is also found that not all of personal factors are effective for location selection. In this research, only distinguished personal features excluding meaningless features are used in order to predict visiting ratio of speciﬁc location categories by using three different machine learning techniques: Random Forest, XGBoost, and Stacking. Through our research, the accuracy of prediction of visiting ratio to a speciﬁc location regarding personal features are analyzed. Personal features and visited location data had been collected by tens of volunteers for this research. Different machine learning methods showed very similar tendency in prediction accuracy. As well, precision of prediction is improved by application of hyperparameter optimization which is a part of AutoML. Applications such as location based service can utilize our result in a way of location recommendation and so on.


Introduction
Prior researches show that human personality and favorite visiting place have considerable relationship . Coefficient of Determination is used in the relation between personality and favored locations [1]. By use of probability models such as Poisson distribution, the relation between personality and favored locations are identified and personal mobility model is predicted in [2,3]. These are traditional methods of analysis based on statistics. A trail to use machine learning in order to these analysis can be found in [4] in terms of back propagation network. Nowadays, a lot of new methods including machine learning technologies can be adopted for such sort of analysis. In this research, we will show the relation between personal factors and favorite locations using various machine learning techniques, especially ensemble techniques, and will verify the consensus of results from machine learning methods.
Ensemble techniques combines the result of each independent model and thus shows more preciseness comparing to single model only. By introducing up to date machine learning techniques, ensemble techniques of several machine learning models are used in this research. Two representative ensemble techniques are used: bagging and boosting. For bagging techniques, random forest is used since it is widely used. For boosting techniques, we used XGBoost since it has high performance and fast in training time, which is also widely used. Both of two methods, random forest and XGBoost have base model of decision tree. We also used stacking as shown in Section 2.4 in order to verify that various regression models other than decision tree also effective in our research in a form of meta learning. Different from previous researches, our focus is that common belief of relationship between personality and location selection is proved by state of the art technologies. As well personal prevents overfitting by using bagging techniques. For XGBoost, boosting technique is used to repeated random sampling for weighted sequential learning. It is also possible to reduce bias with boosting technique. Stacking combines multiple machine learning methods in order to exploit of strength of multiple models and complement weakness of two different models in a way of multistage learning. Stacking has better performance comparing to other models while it requires high computational cost. With these three different ensemble methods with different characteristics, the results of three techniques must be verified each other in order to show the consensus of three result sets. In Section 2, we will show related techniques. In addition to machine learning technique used in this research, considerations of personality factors will be discussed. Section 3 will show details of the data and the experiment. The handling of personal factors and location categories will be discussed. As well, SMAPE will be addressed which stands for prediction error along with search space for hyperparameter optimization. Section 4 will show results of analysis by Random Forest, XGBoost, and Stacking and evaluate the results. Results of all three techniques were verified and discussed. As well, results of feature selection and hyperparameter optimization will be discussed. For Random forest and XGBoost, we can apply both of feature selection and hyperparameter optimization however for Stacking, hyperparamter optimization was omitted due to the high computational cost. However, feature selection solely improved prediction accuracy for most of location category. There is high similarity among three result sets and thus consensus can be addressed for the relationship between location categories and personal features. Section 5 will conclude this research with future works.

Ensemble Techniques
Ensemble model is a technique to generate one strong prediction model with combinations of existing several machine learning models, which usually shows better prediction performance comparing to single model approach. Three distinguished models of ensemble methods are bagging, boosting and Stacking. Bagging reduces variance, boosting reduces bias, and stacking shows improved prediction performance. For example, ensemble methods are highly ranked at the machine learning competition such as Netflix competition, KDD 2009, and Kaggle [11]. Random Forest is a typical model using bagging. Boosting algorithm has improved to XGBoost which is usual selection with guaranteed performance nowadays. Stacking combines different models while bagging or boosting depends on single base model. In such way, strength of different algorithms are exploited while weakness of each algorithms can be compensated. We will use all three methods: Random Forest, XGBoost, and Stacking in order to show the relationship between personal features and favorite location categories [10].

Random Forest
Random Forest is suggested by Leo Breiman in 2001 [12]. Random Forest a combination of different decision trees. Each decision tree can predict well while there is possibility of overfitting for train data. But combination of various independent decision trees and average of the results can show prediction performance excluding overfitting effect. Bootstrap is usually used in order to generate multiple independent trees. Each nodes utilizes part of input features. Each branch of decision tree uses subset of different features. Such a process of learning enables every decision tree in Random Forest be unrelated. Then the final result of regression analysis is average of the results from each decision tree. In this research, we use average while in case of classification problem, the final value can be voted from the results of each decision tree. It is resistant to noise. In addition, the degree of effect of input feature can be numerically represented as importance value. With feature importance, important and effective input features can be selected. It also works well on very large datasets and can parallelize the train simply. It is also appropriate to deal with many input features [13]. Random Forest is one of widely used machine learning algorithm with excellent performance. It works well even without much hyperparameter tuning, and does not need to scale data. However, some of hyperparameters are optimized in order to figure out the effect of hyperparameter. Due to this advantage and performance, a Random Forest was used for this study.

XGBoost
XGBoost use boosting instead of bagging in terms of ensemble. Boosting is a technique of binding weak leaner in order to build powerful predictor. Other than bagging, which aggregates result of independent models, boosting generates booster which stands for base model sequentially. Bagging is to generate model with general performance while boosting is concentrated on solving difficult problems. Therefore, boosting put high weight on difficult problems, by assigning high weight on incorrect answers while assigning low weight on correct answers. Boosting is prone to be weak with outlier values even though it shows high prediction accuracy.
XGBoost stands for Extreme Gradient Boosting and is developed by Tianqi Chen [14]. XGBoost is a performance upgraded version of Gradient Boosting Machine (GBM). XGBoost is widely used by various challenges such as the Netflix prize. For example, 17 prize winners out of 29 solutions of Kaggle in 2015 is implemented by XGBoost [11]. XGBoost is supervised learning algorithm and is an ensemble method such as Random Forest. It is suitable to regression, classification and so on. XGBoost is aim for scalability, portability, and accurate library. In other words, XGBoost utilizes parallel processing and has high flexibility. It has automatic pruning facility with greedy-algorithm and thus shows less chance of overfitting. XGBoost has various hyperparameters and learning rate, max depth of booster, selection of boosters, number of booster are very effective parameters for performance. For boosters of XGBoost, there are gbtree, gblinear, and dart. Gbtree utilizes weak learner as regression tree which is a decision tree with continuous real number object values. Gblinear utilizes weak learner as linear regression model, and dart has base model of regression tree with dropout which is usually used in neural network.

Stacked Generalization (Stacking)
Stacking is one of ensemble method of machine learning method supposed by D. H. Wolperk dates in the year of 1992 [15]. The key of Stacking is training various machine learning models independently, and then meta model does machine learning with the result of the models as inputs, thus Stacking has stack of two or more learning phases. Stacking method combines various machine learning models so that it solves high variance problem fulfilling to the basic purpose of ensemble method, different from other machine learning methods such as bagging, boosting, and voting. In addition, combinations were made in a way that to obtain the strength of each models and compensates the weakness of each models.
Training stage is composed of two phases. At level 1, prediction of sub models are trained with train data similar to other methods. Usually various sub models are utilized in order to generate various prediction. At level 2, prediction generated at level 1 are regarded as input features, and then meta learner or blender, which is final predictor, generates final prediction result. In this stage, overfitting and bias are reduced since level 2 uses different train data from level 1 [16]. Since Stacking does not provide feature selection, input features selected by Random Forest model were used. In our research, ExtraTreesRegressor, RandomForestRegressor, XGBRegressor, LinearRegressor, and KNeighborsRegressor are used at level 1, and final results were selected with low error rate from the result of XGBoost and Random Forest depending on each location categories.

Feature Selection
Usually, machine learning is a linear or non-linear function of input data. The selection of train data in order to learn a function is not negligible for better model construction. The volume of data does not guarantee better models however, incorrect results could be misled. For example, linear function with large number of independent variable does not guarantee high prediction of expected value of dependent variable. In other words, the performance of machine learning is highly dependent on input data set, as meaningless input data hinders learning result from being better one. Therefore, it is required to select meaningful features of collected data in advance to model learning, and it is so called feature selection. Subsets of data is used to test the validity of features, i.e., feature selection. For example, importance of feature is higher as the feature sits near root of tree. Then it is possible to imply the importance of a specific feature. In case of regression model, it is possible to do feature selection with use of forward selection and/or backward elimination. In addition to improve performance of model, feature selection has another benefit of reducing the dimension of input data. Through the feature selection, input data set which is smaller than real observation space however which has better explainability, and it is useful for very big data, or for restricted resource or time [7].

Hyperparameter Optimization
Apart from parameters dependent on data or model, hyperparameter is dependent on machine learning model. Learning rate for deep learning or max depth for decision tree are prominent examples. It is very important to find optimal value for each hyperparameter since hyperparameters significantly affects the performance of model. Manual search of hyperparameters usually requires intuition of skillful researchers, luck and consumes too much time. There exists two famous systematic methodology called grid search and random search [8]. Grid Search searches hyperparameter values in predefined regular intervals and decide values of highest performance, even though human actions are also required. Such uniform and global nature of grid search is a better than manual search however, also requires much time in case of combination for hyperparameters increase. Random search, a random sampling approach of grid search shows increased speed of hyperparameter search. However, these two search methods does repetitive unnecessary search since they does not utilize prior knowledge of hyperparameter search. Bayesian Optimization is another useful method with systematic use of prior knowledge [9]. Bayesian Optimization assumes an objective function f of input value x, where objective function f is a blackbox and requires time to compute. Searching x * for f with small number of input values as small as possible is main purpose. Surrogate Model and Acquisition Function are also required. Surrogate Model does probabilistic estimation for unknown objective function based on known input values and function values pair of (x 1 , f (x 1 )), ..., (x t , f (x t )). Probability models for surrogate model are Gaussian Processes (GP), Tree-structured Parzen Estimators (TPE), Deep Neural Networks and so on. Acquisition Function recommends next useful input value of x t+1 based on current estimations of objective function in order to search optimal input value of x * . Two strategies are famous: exploration and exploitation. Exploration recommends points with high standard deviation since x + may exist in uncertain region. Exploitation recommends points around the point with highest function value. Two strategies are equally important in order to search optimal input value, x + . However, calibrating ratio of two strategy is very important since these two strategies show trade off. Expected Improvement (EI) is the most frequently used as acquisition function since EI contains both of exploration and exploitation. Probability of improvement (PI) is a probability of a certain input value x which function value is bigger than Then EI calculates the effect of x considering PI and magnitude between f (x) and f (x + ). EI can consider Probability of Improvement (PI), and another strategies such as Upper Confidence Bound (UCB), Entropy Search (ES), and so on are exist.

Big Five Factors (BFF)
BFF is a factor of personality suggested by P.T. Costa and R.R. McCrae in 1992 [17]. It has five factors of Openness, Conscientiousness, Extraversion, Agreeableness, and Neuroticism. Set of questionnaires is answered by participants and each five factors will be valued as a score from 1 to 5points. Since BFF can numerically represent conceptual human personality, many of research adopts BFF [18][19][20][21][22][23].

Preparation of Experiments
Previous research results showed that various personal factors effect the favorite visiting place [5,6]. In addition, effective personal factors vary widely according to each location category [5]. In this research, more precise experiments were designed including feature selection and hyperparameter optimization with three different machine learning methodology. More than 60 volunteers had been collected their own data for this research. However, some parts of data from volunteers are too small for this research, therefore only meaningful data sets are survived. From the data of 34 volunteers, personal factors used in the experiments were: In order to compare experiment result, Symmetric Mean Absolute Percentage Error (SMAPE) discussed in Section 3.4. SMAPE usually represented in a range of 0 to 200% while we normalized the value in a range of 0 to 100% by revising the formula for intuitive comparison. As a result, prediction accuracy is difference between 100% and value of SMAPE.

Personal Factors
BFF stands for Big Five Factors where the five factors are Openness (O), Conscientiousness (C), Extraversion (E), Agreeableness (A), and Neuroticism (N). Each factors is measured as numerical numbers so that factors can be easily applied to training process. Table 1 shows BFF of participants. We can figure out personality of a person through these values. Person with high Openness is creative, emotional and interested in arts. Person with high Conscientiousness is responsible, achieving, and restraint. Person with high Agreeableness is agreeable to other person, altruistic, thoughtfulness and modesty. While person with high Neuroticism is sensitive to stress, impulsive, hostile and depressed. For example, as shown in Table 1, person 4 is creative, emotional, responsible, restraint. Also considering person 4's Neuroticism, person 4 is not impulsive and resistant to stress. The personality shown in Table 1 will be used our experimental basis with other personal factors.
In the Table 2, the number corresponding to the response is as follows: The highest level of education

Personal Factors Value
The highest level of education 2 In case of Person 1, high school graduate, no religion, income in USD 500 to 1000, public transport, commute in 1 to 2 h, two or three travels per year, 1 to 3 h spent for social media per day, and both dynamic and static hobby. Static activity has examples of movie and play watching, reading a book and so on while dynamic activity has examples of sports, food tour, and so on.

Location Category Data
SWARM application is used to collect geo-positioning data installed on smartphones [24]. Users actively check in visited places with SWARM. These actively collected location data are used as part of our analysis. The location data is in a form of location such as restaurant, home, bus stop and so on and timestamp for a specific person. Volunteers collected their own location visiting data by their own smartphones. The location category data was used as label (target data) for the supervised learning such as Random Forest, XGBoost and Stacking. The location category data is checked in to the visiting places using the SWARM application. Afterwards, the number of visits and visiting places were identified from web page of SWARM. Part of the location data of person 16 is shown in the Table 3. The data collected were classified into 10 categories. Table 4 shows the classification of person 16's location data into a category. To input categorized location data to machine learning models, visiting ratio of location categories are used as labels. The Formula (1) is as follows.
Visiting_Ratio = count_o f _visit_to_location total_count_o f _visits (1) Table 5 shows hyperparameter search space for this research. For example, booster of XGBoost stands for base model, which is either of 'gblinear', 'gbtree' or 'dart'. Tree based model are dart and gbtree, but gblinear is based on linear function. The 'dart' has additional dropout for deep learning model. In case of tree based models, additional parameters such as min_sample_leaf and min_sample_split are also introduced. Of course, these hyperparameters must be set as adequate values in order to prune the overfitting. In this research, these values are not set since we have relatively smaller number of data and minute hyperparameters, which are found to be less effective for accuracy, were left as default values.

Symmetric Mean Absolute Percentage Error
We used SMAPE as an accuracy measure. SMAPE is an abbreviation for Symmetric Mean Absolute Percentage Error. It is an alternative to Mean Absolute Percent Error when there are zero or near-zero demand for items [25][26][27]. SMAPE by itself limits to error rate of 200%, reducing the influence of these low volume items. It is usually defined as Formula (2) where A t is the actual value and F t is the forecast value. The Formula (2) provides a result between 0% and 200%. However, Formula (3) is often used in practice since a percentage error between 0% and 100% is much easier to interpret, and we also use the formula.

Analysis of Results
We will show experimental result in this section mainly in forms of tables and graphs. Table 6 shows selected features according to learning model and the corresponding prediction accuracy for each location category. Prediction accuracy is represented as 100% minus SMAPE. Random Forest and Stacking uses the same set of feature since feature selection was done with Random Forest. The abbreviations for each machine learning algorithm is as follows: • RF: Random Forest • XGB: XGBoost • STK: Stacking From the results, of course, selected features from Random Forest and XGBoost are overlapped but not in total. This is due to the purpose of learning between these two models. From the view of big data, feature selection can reduce noise and reduces the effect of overfitting with increased accuracy. However, Figures 1 and 2 shows that the accuracy is a little bit degraded. This is maybe due to the restricted size and nature of data used in this experiments. In case of XGBoost, prediction accuracy is increased for location categories such as foreign institutions, hospital or location categories with various subcategories. We found that foreign institutions and hospital has inherently small number of data, and location categories with various subcategories which aggregate various, nonrelated nature of subcategories.  In addition, Figure 3 shows that Stacking with feature selection resulted in high prediction accuracy. Maybe aggregation of five different models of Stacking reduces noise of data. In case of foreign institutions and hospital which has very small number of visit, five different models may lead to low accuracy in level 1 of Stacking and the aggregated results lead to low accuracy. It is notable that several of BFF were always included in selected features. It is proven that personality is highly related with visiting places.  Table 7 shows the results numerically: hyperparameter optimization and prediction accuracy. In Table 7, hyperparameter value of RF are chosen as n_estimators, max_depth, bootstrap, respectively and in case of XGB, n_estimators, max_depth, learning_rate, booster are shown. It is notable that feature selection leads to accuracy decrease however prediction accuracy can be increased with hyperparameter optimization. For most of Random Forest optimization, bootstrap is used for the most of location categories. Bootstrap is useful for smaller number of input data. In addition, different method of optimization for the same location categories leads to similar value of max_depth or n_estimators. Interestingly, different hyperparameters can lead to similar accuracy. Maybe the big structure generated in the leaning process by Random Forest and XGBoost enables convergence of accuracy. Especially, XGB is highly dependent on the selection of booster. It is important to select adequate booster for XGB. In addition, selection of number of iteration for bayesian optimization is also important. In case of theater and concert hall, low number of iteration for bayesian optimization leads to booster of 'gblinear'. And the accuracy is reduced in 20% comparing to that of grid search or random search. Linear function of 'gblinear' shows big gap in accuracy comparing to 'gbtree' and 'dart' which is based on tree structure. In addition, extra number of iteration leads to low prediction accuracy due to overfitting. We concluded that the number of iteration is in the range of 50 to 60.
As predicted, execution time for three different optimization is quite different. Table 8 shows execution time with hyperparameter optimization. Figure 4 is for RF and Figure 5 is for XGB, respectively. Even though grid search and random search shows similar performance, however, there is big difference in execution time. In addition, bayesian optimization is a little bit slower than random search but quite faster than grid search. We guess bayesian search is the choice due to the balance of execution time and performance. As well, prior knowledge is reflected in bayesian optimization.    The total representation of accuracy of three different models can be found in Tables 9-11. Figures 1-3 aforementioned shows the accuracy of each experimental condition. For the Stacking, hyperparameter optimization cannot be made due to the nature of Stacking, meta learning. Once hyperparameter optimization is applied to Stacking, every model in level 1 of Stacking must contain hyperparameter optimization which will result in drastic increase of execution time. Actually, Stacking with feature selection shows the prediction accuracy as high as RF and XGB with hyperparameter optimization. Figure 6 shows prediction accuracy for each location category. The cases of foreign institutions and hospital are somewhat incredible due to low accuracy, and due to smaller number of raw data. For other location categories, prediction accuracy is in the range of 50% to 80%. As aforementioned, too various subcategories shows somewhat low accuracy while ramification of subcategories could lead to higher prediction accuracy. For instance, distinct location categories such as restaurant, pub, beverage store, theater and concert hall show high accuracy.

Conclusions and Future Works
Location Based Service (LBS) and recommendation system are typical examples of personalized service. For example, contents providers such as Netflix opened a competition for recommendation system development. However, recommendation systems have cold start problem which makes recommendation difficult for the new users or new contents. In addition, personal information protection of history is another sort of problem. It could be possible if we can predict user preference based on basic user features regardless of history. From various research, it is discussed that human personality and location preference is highly related. Additionally, personal features other than personality also related with preference of location visiting. Of course, there are meaningful, distinguished personal features for personal location preference.
In this research, from three different methods of machine learning, we figured out the effects of distinguished personal features for personal location preference. As a result, eight location categories out of ten showed meaningful prediction accuracy: Retail Business, Service industry, Restaurant, Pub, Beverage Store, Theater and Concert Hall, Institutions of Education, and Museum, Gallery, Historical sites, Tourist spots. For each of three algorithms, prediction accuracy seem to be reliable with very similar tendency of analysis results. As well, input features were selected which affects location category selection with Random Forest and XGBoost, and of course, input features are dependent on location categories. Based on our research, visiting preference to such location categories can be highly predictable from personal features. In addition, hyperparameter optimization which is a sort of AutoML technology is introduced in order to increase prediction accuracy. Grid search, random search as well as bayesian optimization are applied and the results are compared.
In our research, we demonstrated a method for visiting place prediction. For large amount of input data, feature selection is applied in order to reduce dimension of data and increase quality of input data. In such cases, Stacking could be one of the best solution even without hyperparameter optimization. On the contrary, for smaller number of input data, bagging or boosting in addition with hyperparameter optimization could be a better solution since Stacking may show poor prediction accuracy. We need to research further, especially for location categories such as service industry, retail business since too many subcategories make the categorization vague. In addition, less sensitive personal features must exists for the prediction of visiting location. Such features will be able to be identified. From the aspect of volunteers' data, we need to expand data collection for more number of data as well as wider span of data, since our data is clear limitations of volunteer pool; engineering students in his or her twenties.

Conflicts of Interest:
The authors declare no conflict of interest.