Next Article in Journal
Semiconductor Laser Linewidth Theory Revisited
Previous Article in Journal
Analysis of Electromagnetic Coupling Characteristics of Balise Transmission System Based on Digital Twin
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Finding Location Visiting Preference from Personal Features with Ensemble Machine Learning Techniques and Hyperparameter Optimization

1
Lotte Data Communication Company, Seoul 08500, Korea
2
Department of Computer Eivineering, Hongik University, Seoul 04066, Korea
*
Author to whom correspondence should be addressed.
Appl. Sci. 2021, 11(13), 6001; https://doi.org/10.3390/app11136001
Submission received: 6 May 2021 / Revised: 20 June 2021 / Accepted: 23 June 2021 / Published: 28 June 2021
(This article belongs to the Section Computing and Artificial Intelligence)

Abstract

:
For the question regarding the relationship between personal factors and location selection, many researches support the effect of personal features for personal location favorite. However, it is also found that not all of personal factors are effective for location selection. In this research, only distinguished personal features excluding meaningless features are used in order to predict visiting ratio of specific location categories by using three different machine learning techniques: Random Forest, XGBoost, and Stacking. Through our research, the accuracy of prediction of visiting ratio to a specific location regarding personal features are analyzed. Personal features and visited location data had been collected by tens of volunteers for this research. Different machine learning methods showed very similar tendency in prediction accuracy. As well, precision of prediction is improved by application of hyperparameter optimization which is a part of AutoML. Applications such as location based service can utilize our result in a way of location recommendation and so on.

1. Introduction

Prior researches show that human personality and favorite visiting place have considerable relationship . Coefficient of Determination is used in the relation between personality and favored locations [1]. By use of probability models such as Poisson distribution, the relation between personality and favored locations are identified and personal mobility model is predicted in [2,3]. These are traditional methods of analysis based on statistics. A trail to use machine learning in order to these analysis can be found in [4] in terms of back propagation network. Nowadays, a lot of new methods including machine learning technologies can be adopted for such sort of analysis. In this research, we will show the relation between personal factors and favorite locations using various machine learning techniques, especially ensemble techniques, and will verify the consensus of results from machine learning methods.
Ensemble techniques combines the result of each independent model and thus shows more preciseness comparing to single model only. By introducing up to date machine learning techniques, ensemble techniques of several machine learning models are used in this research. Two representative ensemble techniques are used: bagging and boosting. For bagging techniques, random forest is used since it is widely used. For boosting techniques, we used XGBoost since it has high performance and fast in training time, which is also widely used. Both of two methods, random forest and XGBoost have base model of decision tree. We also used stacking as shown in Section 2.4 in order to verify that various regression models other than decision tree also effective in our research in a form of meta learning. Different from previous researches, our focus is that common belief of relationship between personality and location selection is proved by state of the art technologies. As well personal features other than personality such as age, religion, method of transportation, salary, and so on, are also used for this relationship analysis. In addition, the results of ensemble methods will be presented in numerical manner. For the inputs of analysis, as well as personalities, other personal factors such as salary, method of transportation, religion, and so on are found to be related with favorite locations [5,6]. However not all of these personal factors are meaningful factors for location preference. In other words, meaningless features for the input of analysis degrades the prediction accuracy of the relationship. Therefore feature selection [7] was executed for each location category. And then, prediction accuracy was improved through hyperparameter optimization. Hyperparameter optimization was done in three different ways: grid search, random search [8] and bayesian optimization [9] from the current advancement of AutoML. Grid search and random search are two representative methods of hyperparameter optimization. Grid search takes long time since it checks performance for every possible candidates of parameters. Random search is faster than grid search since it checks several samples randomly while shows less precision comparing to grid search. These two methods has shortages that current search information cannot be transferred to next step. Bayesian optimization overcomes this shortage: it utilizes prior knowledge for new search of optimized value with smaller search time and higher precision. In this research, all three of hyperparameter optimization methods are used. In this research, Big Five Factors (BFF) of personality such as Openness, Conscientiousness, Extraversion, Agreeableness and Neuroticism are used as well as the highest level of education, religion, salary, method of transportation, commute time, the frequency of journey in one year, social media usage status, time spent on social media per day, and nature of personaly hobby. BFF is taxonomy for personality characteristics presented by Costa & McCrae in 1992, and it is found useful for personality related research. In this research, numerical features of BFF is utilized and BFF is regarded as parts of input data. With these selected inputs of relationship analysis, we will use machine learning techniques for the analysis. Basically, three machine learning methods were used: Random Forest, XGBoost, and Stacking. Each of three methods is sort of ensemble techniques. From various aspects of researches, ensemble method is proven to compensate the weakness of single model and improve the performance of generalization [10]. Random Forest, especially, prevents overfitting by using bagging techniques. For XGBoost, boosting technique is used to repeated random sampling for weighted sequential learning. It is also possible to reduce bias with boosting technique. Stacking combines multiple machine learning methods in order to exploit of strength of multiple models and complement weakness of two different models in a way of multistage learning. Stacking has better performance comparing to other models while it requires high computational cost. With these three different ensemble methods with different characteristics, the results of three techniques must be verified each other in order to show the consensus of three result sets. In Section 2, we will show related techniques. In addition to machine learning technique used in this research, considerations of personality factors will be discussed. Section 3 will show details of the data and the experiment. The handling of personal factors and location categories will be discussed. As well, SMAPE will be addressed which stands for prediction error along with search space for hyperparameter optimization. Section 4 will show results of analysis by Random Forest, XGBoost, and Stacking and evaluate the results. Results of all three techniques were verified and discussed. As well, results of feature selection and hyperparameter optimization will be discussed. For Random forest and XGBoost, we can apply both of feature selection and hyperparameter optimization however for Stacking, hyperparamter optimization was omitted due to the high computational cost. However, feature selection solely improved prediction accuracy for most of location category. There is high similarity among three result sets and thus consensus can be addressed for the relationship between location categories and personal features. Section 5 will conclude this research with future works.

2. Related Works

2.1. Ensemble Techniques

Ensemble model is a technique to generate one strong prediction model with combinations of existing several machine learning models, which usually shows better prediction performance comparing to single model approach. Three distinguished models of ensemble methods are bagging, boosting and Stacking. Bagging reduces variance, boosting reduces bias, and stacking shows improved prediction performance. For example, ensemble methods are highly ranked at the machine learning competition such as Netflix competition, KDD 2009, and Kaggle [11]. Random Forest is a typical model using bagging. Boosting algorithm has improved to XGBoost which is usual selection with guaranteed performance nowadays. Stacking combines different models while bagging or boosting depends on single base model. In such way, strength of different algorithms are exploited while weakness of each algorithms can be compensated. We will use all three methods: Random Forest, XGBoost, and Stacking in order to show the relationship between personal features and favorite location categories [10].

2.2. Random Forest

Random Forest is suggested by Leo Breiman in 2001 [12]. Random Forest a combination of different decision trees. Each decision tree can predict well while there is possibility of overfitting for train data. But combination of various independent decision trees and average of the results can show prediction performance excluding overfitting effect. Bootstrap is usually used in order to generate multiple independent trees. Each nodes utilizes part of input features. Each branch of decision tree uses subset of different features. Such a process of learning enables every decision tree in Random Forest be unrelated. Then the final result of regression analysis is average of the results from each decision tree. In this research, we use average while in case of classification problem, the final value can be voted from the results of each decision tree. It is resistant to noise. In addition, the degree of effect of input feature can be numerically represented as importance value. With feature importance, important and effective input features can be selected. It also works well on very large datasets and can parallelize the train simply. It is also appropriate to deal with many input features [13]. Random Forest is one of widely used machine learning algorithm with excellent performance. It works well even without much hyperparameter tuning, and does not need to scale data. However, some of hyperparameters are optimized in order to figure out the effect of hyperparameter. Due to this advantage and performance, a Random Forest was used for this study.

2.3. XGBoost

XGBoost use boosting instead of bagging in terms of ensemble. Boosting is a technique of binding weak leaner in order to build powerful predictor. Other than bagging, which aggregates result of independent models, boosting generates booster which stands for base model sequentially. Bagging is to generate model with general performance while boosting is concentrated on solving difficult problems. Therefore, boosting put high weight on difficult problems, by assigning high weight on incorrect answers while assigning low weight on correct answers. Boosting is prone to be weak with outlier values even though it shows high prediction accuracy.
XGBoost stands for Extreme Gradient Boosting and is developed by Tianqi Chen [14]. XGBoost is a performance upgraded version of Gradient Boosting Machine (GBM). XGBoost is widely used by various challenges such as the Netflix prize. For example, 17 prize winners out of 29 solutions of Kaggle in 2015 is implemented by XGBoost [11]. XGBoost is supervised learning algorithm and is an ensemble method such as Random Forest. It is suitable to regression, classification and so on. XGBoost is aim for scalability, portability, and accurate library. In other words, XGBoost utilizes parallel processing and has high flexibility. It has automatic pruning facility with greedy-algorithm and thus shows less chance of overfitting. XGBoost has various hyperparameters and learning rate, max depth of booster, selection of boosters, number of booster are very effective parameters for performance. For boosters of XGBoost, there are gbtree, gblinear, and dart. Gbtree utilizes weak learner as regression tree which is a decision tree with continuous real number object values. Gblinear utilizes weak learner as linear regression model, and dart has base model of regression tree with dropout which is usually used in neural network.

2.4. Stacked Generalization (Stacking)

Stacking is one of ensemble method of machine learning method supposed by D. H. Wolperk dates in the year of 1992 [15]. The key of Stacking is training various machine learning models independently, and then meta model does machine learning with the result of the models as inputs, thus Stacking has stack of two or more learning phases. Stacking method combines various machine learning models so that it solves high variance problem fulfilling to the basic purpose of ensemble method, different from other machine learning methods such as bagging, boosting, and voting. In addition, combinations were made in a way that to obtain the strength of each models and compensates the weakness of each models.
Training stage is composed of two phases. At level 1, prediction of sub models are trained with train data similar to other methods. Usually various sub models are utilized in order to generate various prediction. At level 2, prediction generated at level 1 are regarded as input features, and then meta learner or blender, which is final predictor, generates final prediction result. In this stage, overfitting and bias are reduced since level 2 uses different train data from level 1 [16]. Since Stacking does not provide feature selection, input features selected by Random Forest model were used. In our research, ExtraTreesRegressor, RandomForestRegressor, XGBRegressor, LinearRegressor, and KNeighborsRegressor are used at level 1, and final results were selected with low error rate from the result of XGBoost and Random Forest depending on each location categories.

2.5. Feature Selection

Usually, machine learning is a linear or non-linear function of input data. The selection of train data in order to learn a function is not negligible for better model construction. The volume of data does not guarantee better models however, incorrect results could be misled. For example, linear function with large number of independent variable does not guarantee high prediction of expected value of dependent variable. In other words, the performance of machine learning is highly dependent on input data set, as meaningless input data hinders learning result from being better one. Therefore, it is required to select meaningful features of collected data in advance to model learning, and it is so called feature selection. Subsets of data is used to test the validity of features, i.e., feature selection. For example, importance of feature is higher as the feature sits near root of tree. Then it is possible to imply the importance of a specific feature. In case of regression model, it is possible to do feature selection with use of forward selection and/or backward elimination. In addition to improve performance of model, feature selection has another benefit of reducing the dimension of input data. Through the feature selection, input data set which is smaller than real observation space however which has better explainability, and it is useful for very big data, or for restricted resource or time [7].

2.6. Hyperparameter Optimization

Apart from parameters dependent on data or model, hyperparameter is dependent on machine learning model. Learning rate for deep learning or max depth for decision tree are prominent examples. It is very important to find optimal value for each hyperparameter since hyperparameters significantly affects the performance of model. Manual search of hyperparameters usually requires intuition of skillful researchers, luck and consumes too much time. There exists two famous systematic methodology called grid search and random search [8]. Grid Search searches hyperparameter values in predefined regular intervals and decide values of highest performance, even though human actions are also required. Such uniform and global nature of grid search is a better than manual search however, also requires much time in case of combination for hyperparameters increase. Random search, a random sampling approach of grid search shows increased speed of hyperparameter search. However, these two search methods does repetitive unnecessary search since they does not utilize prior knowledge of hyperparameter search. Bayesian Optimization is another useful method with systematic use of prior knowledge [9]. Bayesian Optimization assumes an objective function f of input value x, where objective function f is a blackbox and requires time to compute. Searching x * for f with small number of input values as small as possible is main purpose. Surrogate Model and Acquisition Function are also required. Surrogate Model does probabilistic estimation for unknown objective function based on known input values and function values pair of ( x 1 , f ( x 1 ) ) , , ( x t , f ( x t ) ) . Probability models for surrogate model are Gaussian Processes (GP), Tree-structured Parzen Estimators (TPE), Deep Neural Networks and so on. Acquisition Function recommends next useful input value of x t + 1 based on current estimations of objective function in order to search optimal input value of x * . Two strategies are famous: exploration and exploitation. Exploration recommends points with high standard deviation since x + may exist in uncertain region. Exploitation recommends points around the point with highest function value. Two strategies are equally important in order to search optimal input value, x + . However, calibrating ratio of two strategy is very important since these two strategies show trade off. Expected Improvement (EI) is the most frequently used as acquisition function since EI contains both of exploration and exploitation. Probability of improvement (PI) is a probability of a certain input value x which function value is bigger than f ( x + ) = m a x i f ( x i ) among current estimations of f ( x 1 ) , , f ( x t ) . Then EI calculates the effect of x considering PI and magnitude between f ( x ) and f ( x + ) . EI can consider Probability of Improvement (PI), and another strategies such as Upper Confidence Bound (UCB), Entropy Search (ES), and so on are exist.

2.7. Big Five Factors (BFF)

BFF is a factor of personality suggested by P.T. Costa and R.R. McCrae in 1992 [17]. It has five factors of Openness, Conscientiousness, Extraversion, Agreeableness, and Neuroticism. Set of questionnaires is answered by participants and each five factors will be valued as a score from 1 to 5points. Since BFF can numerically represent conceptual human personality, many of research adopts BFF [18,19,20,21,22,23].

3. Preparation of Experiments

Previous research results showed that various personal factors effect the favorite visiting place [5,6]. In addition, effective personal factors vary widely according to each location category [5]. In this research, more precise experiments were designed including feature selection and hyperparameter optimization with three different machine learning methodology. More than 60 volunteers had been collected their own data for this research. However, some parts of data from volunteers are too small for this research, therefore only meaningful data sets are survived. From the data of 34 volunteers, personal factors used in the experiments were:
  • The highest level of education
  • Religion
  • Salary
  • Method of transportation
  • Commute time
  • The frequency of journey in one year
  • Social media usage status
  • Time spent on social media per day
  • Category of personaly hobby
  • BFF
These input features and location visiting data from SWARM [24] application are treated as inputs for Random Forest, XGBoost, and Stacking. The primary result is ratio of visiting to specific location categories. For Stacking, ExtraTreesRegressor, RandomForestRegressor, XGBRegressor, LinearRegressor, and KNeighborsRegressor are used in level 1, and XGBoost and Random Forest are used in level 2, result with the smallest error rate is selected. In order to compare experiment result, Symmetric Mean Absolute Percentage Error (SMAPE) discussed in Section 3.4. SMAPE usually represented in a range of 0 to 200% while we normalized the value in a range of 0 to 100% by revising the formula for intuitive comparison. As a result, prediction accuracy is difference between 100% and value of SMAPE.

3.1. Personal Factors

BFF stands for Big Five Factors where the five factors are Openness (O), Conscientiousness (C), Extraversion (E), Agreeableness (A), and Neuroticism (N). Each factors is measured as numerical numbers so that factors can be easily applied to training process. Table 1 shows BFF of participants. We can figure out personality of a person through these values. Person with high Openness is creative, emotional and interested in arts. Person with high Conscientiousness is responsible, achieving, and restraint. Person with high Agreeableness is agreeable to other person, altruistic, thoughtfulness and modesty. While person with high Neuroticism is sensitive to stress, impulsive, hostile and depressed. For example, as shown in Table 1, person 4 is creative, emotional, responsible, restraint. Also considering person 4’s Neuroticism, person 4 is not impulsive and resistant to stress. The personality shown in Table 1 will be used our experimental basis with other personal factors.
In the Table 2, the number corresponding to the response is as follows:
The highest level of education
  • Middle school graduate
  • High school graduate
  • College graduate
  • Master
  • Doctor
Religion
  • Atheism
  • Christianity
  • Catholic
  • Buddhism
Salary
  • Less than USD 500
  • USD 500 to 1000
  • USD 1000 to 2000
  • USD 2000 to 3000
  • over USD 3000
Method of transportation
  • Walking
  • Bicycle
  • Car
  • Public transport
Commute time
  • Less than 30 min
  • 30 min to 1 h
  • 1 h to 2 h
  • Over 2 h
The frequency of journey in one year
  • Less than one time
  • 2 to 3 times
  • 4 to 5 times
  • Over six times
Social media usage status (SNS1)
  • Use
  • Not use
Time spent on social media per day (SNS2)
  • Less than 30 min
  • 30 min to 1 h
  • 1 h to 3 h
  • Over 3 h
Category of personaly hobby
  • Static activity
  • Dynamic activity
  • Both
In case of Person 1, high school graduate, no religion, income in USD 500 to 1000, public transport, commute in 1 to 2 h, two or three travels per year, 1 to 3 h spent for social media per day, and both dynamic and static hobby. Static activity has examples of movie and play watching, reading a book and so on while dynamic activity has examples of sports, food tour, and so on.

3.2. Location Category Data

SWARM application is used to collect geo-positioning data installed on smartphones [24]. Users actively check in visited places with SWARM. These actively collected location data are used as part of our analysis. The location data is in a form of location such as restaurant, home, bus stop and so on and timestamp for a specific person. Volunteers collected their own location visiting data by their own smartphones. The location category data was used as label (target data) for the supervised learning such as Random Forest, XGBoost and Stacking. The location category data is checked in to the visiting places using the SWARM application. Afterwards, the number of visits and visiting places were identified from web page of SWARM. Part of the location data of person 16 is shown in the Table 3.
The data collected were classified into 10 categories. Table 4 shows the classification of person 16’s location data into a category.
To input categorized location data to machine learning models, visiting ratio of location categories are used as labels. The Formula (1) is as follows.
V i s i t i n g _ R a t i o = c o u n t _ o f _ v i s i t _ t o _ l o c a t i o n t o t a l _ c o u n t _ o f _ v i s i t s

3.3. Hyperparameter Search Space

Table 5 shows hyperparameter search space for this research. For example, booster of XGBoost stands for base model, which is either of ‘gblinear’, ‘gbtree’ or ‘dart’. Tree based model are dart and gbtree, but gblinear is based on linear function. The ‘dart’ has additional dropout for deep learning model. In case of tree based models, additional parameters such as min_sample_leaf and min_sample_split are also introduced. Of course, these hyperparameters must be set as adequate values in order to prune the overfitting. In this research, these values are not set since we have relatively smaller number of data and minute hyperparameters, which are found to be less effective for accuracy, were left as default values.

3.4. Symmetric Mean Absolute Percentage Error

We used SMAPE as an accuracy measure. SMAPE is an abbreviation for Symmetric Mean Absolute Percentage Error. It is an alternative to Mean Absolute Percent Error when there are zero or near-zero demand for items [25,26,27]. SMAPE by itself limits to error rate of 200%, reducing the influence of these low volume items. It is usually defined as Formula (2) where A t is the actual value and F t is the forecast value. The Formula (2) provides a result between 0% and 200%. However, Formula (3) is often used in practice since a percentage error between 0% and 100% is much easier to interpret, and we also use the formula.
S M A P E = 1 n t = 1 n | F t A t | ( A t + F t ) / 2
S M A P E = 100 % n t = 1 n | F t A t | | A t | + | F t |

4. Analysis of Results

We will show experimental result in this section mainly in forms of tables and graphs. Table 6 shows selected features according to learning model and the corresponding prediction accuracy for each location category. Prediction accuracy is represented as 100% minus SMAPE. Random Forest and Stacking uses the same set of feature since feature selection was done with Random Forest. The abbreviations for each machine learning algorithm is as follows:
  • RF: Random Forest
  • XGB: XGBoost
  • STK: Stacking
From the results, of course, selected features from Random Forest and XGBoost are overlapped but not in total. This is due to the purpose of learning between these two models. From the view of big data, feature selection can reduce noise and reduces the effect of overfitting with increased accuracy.
However, Figure 1 and Figure 2 shows that the accuracy is a little bit degraded. This is maybe due to the restricted size and nature of data used in this experiments. In case of XGBoost, prediction accuracy is increased for location categories such as foreign institutions, hospital or location categories with various subcategories. We found that foreign institutions and hospital has inherently small number of data, and location categories with various subcategories which aggregate various, nonrelated nature of subcategories.
In addition, Figure 3 shows that Stacking with feature selection resulted in high prediction accuracy. Maybe aggregation of five different models of Stacking reduces noise of data. In case of foreign institutions and hospital which has very small number of visit, five different models may lead to low accuracy in level 1 of Stacking and the aggregated results lead to low accuracy. It is notable that several of BFF were always included in selected features. It is proven that personality is highly related with visiting places.
Table 7 shows the results numerically: hyperparameter optimization and prediction accuracy. In Table 7, hyperparameter value of RF are chosen as n_estimators, max_depth, bootstrap, respectively and in case of XGB, n_estimators, max_depth, learning_rate, booster are shown. It is notable that feature selection leads to accuracy decrease however prediction accuracy can be increased with hyperparameter optimization. For most of Random Forest optimization, bootstrap is used for the most of location categories. Bootstrap is useful for smaller number of input data. In addition, different method of optimization for the same location categories leads to similar value of max_depth or n_estimators. Interestingly, different hyperparameters can lead to similar accuracy. Maybe the big structure generated in the leaning process by Random Forest and XGBoost enables convergence of accuracy. Especially, XGB is highly dependent on the selection of booster. It is important to select adequate booster for XGB. In addition, selection of number of iteration for bayesian optimization is also important. In case of theater and concert hall, low number of iteration for bayesian optimization leads to booster of ‘gblinear’. And the accuracy is reduced in 20% comparing to that of grid search or random search. Linear function of ’gblinear’ shows big gap in accuracy comparing to ‘gbtree’ and ‘dart’ which is based on tree structure. In addition, extra number of iteration leads to low prediction accuracy due to overfitting. We concluded that the number of iteration is in the range of 50 to 60.
As predicted, execution time for three different optimization is quite different. Table 8 shows execution time with hyperparameter optimization. Figure 4 is for RF and Figure 5 is for XGB, respectively. Even though grid search and random search shows similar performance, however, there is big difference in execution time. In addition, bayesian optimization is a little bit slower than random search but quite faster than grid search. We guess bayesian search is the choice due to the balance of execution time and performance. As well, prior knowledge is reflected in bayesian optimization.
The total representation of accuracy of three different models can be found in Table 9, Table 10 and Table 11. Figure 1, Figure 2 and Figure 3 aforementioned shows the accuracy of each experimental condition. For the Stacking, hyperparameter optimization cannot be made due to the nature of Stacking, meta learning. Once hyperparameter optimization is applied to Stacking, every model in level 1 of Stacking must contain hyperparameter optimization which will result in drastic increase of execution time. Actually, Stacking with feature selection shows the prediction accuracy as high as RF and XGB with hyperparameter optimization. Figure 6 shows prediction accuracy for each location category. The cases of foreign institutions and hospital are somewhat incredible due to low accuracy, and due to smaller number of raw data. For other location categories, prediction accuracy is in the range of 50% to 80%. As aforementioned, too various subcategories shows somewhat low accuracy while ramification of subcategories could lead to higher prediction accuracy. For instance, distinct location categories such as restaurant, pub, beverage store, theater and concert hall show high accuracy.

5. Conclusions and Future Works

Location Based Service (LBS) and recommendation system are typical examples of personalized service. For example, contents providers such as Netflix opened a competition for recommendation system development. However, recommendation systems have cold start problem which makes recommendation difficult for the new users or new contents. In addition, personal information protection of history is another sort of problem. It could be possible if we can predict user preference based on basic user features regardless of history. From various research, it is discussed that human personality and location preference is highly related. Additionally, personal features other than personality also related with preference of location visiting. Of course, there are meaningful, distinguished personal features for personal location preference.
In this research, from three different methods of machine learning, we figured out the effects of distinguished personal features for personal location preference. As a result, eight location categories out of ten showed meaningful prediction accuracy: Retail Business, Service industry, Restaurant, Pub, Beverage Store, Theater and Concert Hall, Institutions of Education, and Museum, Gallery, Historical sites, Tourist spots. For each of three algorithms, prediction accuracy seem to be reliable with very similar tendency of analysis results. As well, input features were selected which affects location category selection with Random Forest and XGBoost, and of course, input features are dependent on location categories. Based on our research, visiting preference to such location categories can be highly predictable from personal features. In addition, hyperparameter optimization which is a sort of AutoML technology is introduced in order to increase prediction accuracy. Grid search, random search as well as bayesian optimization are applied and the results are compared.
In our research, we demonstrated a method for visiting place prediction. For large amount of input data, feature selection is applied in order to reduce dimension of data and increase quality of input data. In such cases, Stacking could be one of the best solution even without hyperparameter optimization. On the contrary, for smaller number of input data, bagging or boosting in addition with hyperparameter optimization could be a better solution since Stacking may show poor prediction accuracy. We need to research further, especially for location categories such as service industry, retail business since too many subcategories make the categorization vague. In addition, less sensitive personal features must exists for the prediction of visiting location. Such features will be able to be identified. From the aspect of volunteers’ data, we need to expand data collection for more number of data as well as wider span of data, since our data is clear limitations of volunteer pool; engineering students in his or her twenties.

Author Contributions

Conceptualization, H.Y.S.; methodology, Y.M.K.; software, Y.M.K.; validation, H.Y.S.; formal analysis, Y.M.K.; investigation, H.Y.S.; resources, H.Y.S.; data curation, Y.M.K.; writing—original draft preparation, Y.M.K.; writing—review and editing, H.Y.S.; visualization, Y.M.K.; supervision, H.Y.S.; project administration, H.Y.S.; funding acquisition, H.Y.S. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MEST) (NRF-2019R1F1A1056123).

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Song, H.Y.; Lee, E.B. An analysis of the relationship between human personality and favored location. In Proceedings of the AFIN 2015, The Seventh International Conference on Advances in Future Internet, Venice, Italy, 23–28 August 2015; p. 12. [Google Scholar]
  2. Song, H.Y.; Kang, H.B. Analysis of Relationship Between Personality and Favorite Places with Poisson Regression Analysis. ITM Web Conf. 2018, 16, 02001. [Google Scholar] [CrossRef] [Green Version]
  3. Kim, S.Y.; Song, H.Y. Determination coefficient analysis between personality and location using regression. In Proceedings of the International Conference on Sciences, Engineering and Technology Innovations, ICSETI, Bali, India, 22 May 2015; pp. 265–274. [Google Scholar]
  4. Kim, S.Y.; Song, H.Y. Predicting Human Location Based on Human Personality. In International Conference on Next Generation Wired/Wireless Networking, Proceedings of the NEW2AN 2014: Internet of Things, Smart Spaces, and Next Generation Networks and Systems, St. Petersburg, Russia, 27–29 August 2014; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2014; pp. 70–81. [Google Scholar] [CrossRef]
  5. Kim, Y.M.; Song, H.Y. Analysis of Relationship between Personal Factors and Visiting Places using Random Forest Technique. In Proceedings of the Federated Conference on Computer Science and Information Systems (FedCSIS), Leipzig, Germany, 1–4 September 2019; pp. 725–732. [Google Scholar]
  6. Song, H.Y.; Yun, J. Analysis of the Correlation Between Personal Factors and Visiting Locations With Boosting Technique. In Proceedings of the Federated Conference on Computer Science and Information Systems (FedCSIS), Leipzig, Germany, 1–4 September 2019; pp. 743–746. [Google Scholar]
  7. Li, J.; Cheng, K.; Wang, S.; Morstatter, F.; Trevino, R.P.; Tang, J.; Liu, H. Feature selection: A data perspective. ACM Comput. Surv. (CSUR) 2017, 50, 1–45. [Google Scholar] [CrossRef] [Green Version]
  8. Shahriari, B.; Swersky, K.; Wang, Z.; Adams, R.P.; De Freitas, N. Taking the human out of the loop: A review of Bayesian optimization. Proc. IEEE 2015, 104, 148–175. [Google Scholar] [CrossRef] [Green Version]
  9. Snoek, J.; Larochelle, H.; Adams, R.P. Practical bayesian optimization of machine learning algorithms. arXiv 2012, arXiv:1206.2944. [Google Scholar]
  10. Zhou, Z.H. Ensemble Methods: Foundations and Algorithms; Chapman and Hall/CRC: Boca Raton, FL, USA, 2012. [Google Scholar]
  11. Bennett, J.; Lanning, S. The netflix prize. In Proceedings of the KDD Cup and Workshop; Citeseer: New York, NY, USA, 2007; Volume 2007, p. 35. [Google Scholar]
  12. Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef] [Green Version]
  13. Segal, M.R. Machine Learning Benchmarks and Random Forest Regression. UCSF: Center for Bioinformatics and Molecular Biostatistics. 2004. Available online: https://escholarship.org/uc/item/35x3v9t4 (accessed on 27 June 2021).
  14. Chen, T.; Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM Sigkdd International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]
  15. Wolpert, D.H. Stacked generalization. Neural Netw. 1992, 5, 241–259. [Google Scholar] [CrossRef]
  16. Géron, A. Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems; O’Reilly Media: Newton, MA, USA, 2019. [Google Scholar]
  17. Costa, P.T.; McCrae, R.R. Four ways five factors are basic. Personal. Individ. Differ. 1992, 13, 653–665. [Google Scholar] [CrossRef]
  18. Hoseinifar, J.; Siedkalan, M.M.; Zirak, S.R.; Nowrozi, M.; Shaker, A.; Meamar, E.; Ghaderi, E. An Investigation of The Relation Between Creativity and Five Factors of Personality In Students. Procedia Soc. Behav. Sci. 2011, 30, 2037–2041. [Google Scholar] [CrossRef] [Green Version]
  19. Jani, D.; Jang, J.H.; Hwang, Y.H. Big five factors of personality and tourists’ Internet search behavior. Asia Pac. J. Tour. Res. 2014, 19, 600–615. [Google Scholar] [CrossRef]
  20. Jani, D.; Han, H. Personality, social comparison, consumption emotions, satisfaction, and behavioral intentions. Int. J. Contemp. Hosp. Manag. 2013, 25, 970–993. [Google Scholar] [CrossRef] [Green Version]
  21. John, O.P.; Srivastava, S. The Big Five trait taxonomy: History, measurement, and theoretical perspectives. In Handbook of Personality: Theory and Research; University of California: Berkeley, CA, USA, 1999; Volume 2, pp. 102–138. [Google Scholar]
  22. Amichai-Hamburger, Y.; Vinitzky, G. Social network use and personality. Comput. Hum. Behav. 2010, 26, 1289–1295. [Google Scholar] [CrossRef]
  23. Chorley, M.J.; Whitaker, R.M.; Allen, S.M. Personality and location-based social networks. Comput. Hum. Behav. 2015, 46, 45–56. [Google Scholar] [CrossRef] [Green Version]
  24. Foursquare Labs, Inc. Swarm App. 2019. Available online: https://www.swarmapp.com/ (accessed on 27 June 2021).
  25. Armstrong, J.S. Long-Range Forecasting; Wiley New York ETC.: New York, NY, USA, 1985. [Google Scholar]
  26. Flores, B.E. A pragmatic view of accuracy measurement in forecasting. Omega 1986, 14, 93–98. [Google Scholar] [CrossRef]
  27. Tofallis, C. A better measure of relative prediction accuracy for model selection and model estimation. J. Oper. Res. Soc. 2015, 66, 1352–1362. [Google Scholar] [CrossRef] [Green Version]
Figure 1. Accuracy Graph of Random Forest.
Figure 1. Accuracy Graph of Random Forest.
Applsci 11 06001 g001
Figure 2. Accuracy Graph of XGBoost.
Figure 2. Accuracy Graph of XGBoost.
Applsci 11 06001 g002
Figure 3. Accuracy Graph of Stacking.
Figure 3. Accuracy Graph of Stacking.
Applsci 11 06001 g003
Figure 4. Execution Time Graph of Random Forest.
Figure 4. Execution Time Graph of Random Forest.
Applsci 11 06001 g004
Figure 5. Execution Time Graph of XGBoost.
Figure 5. Execution Time Graph of XGBoost.
Applsci 11 06001 g005
Figure 6. Predict Accuracy Comparison of all methods.
Figure 6. Predict Accuracy Comparison of all methods.
Applsci 11 06001 g006aApplsci 11 06001 g006bApplsci 11 06001 g006cApplsci 11 06001 g006d
Table 1. BFF of Participants.
Table 1. BFF of Participants.
OCEAN
Person 13.33.93.33.72.6
Person 22.73.23.22.72.8
Person 34.33.12.33.22.9
Person 44.24.33.53.62.6
Person 543.743.92.8
Person 63.843.13.82.3
Person 73.23.23.53.33.5
Person 82.83.83.83.32.3
Person 93.43.63.53.63.1
Person 1033.62.533
Person 114.13.83.82.83
Person 123.132.832.8
Person 133.33.23.52.62.6
Person 143.73.33.63.83.5
Person 152.43.732.82.6
Person 163.43.23.032.6
Person 173.93.33.52.92.8
Person 1833.33.33.13
Person 193.33.63.13.13.5
Person 203.22.933.13.4
Person 212.43.93.332.8
Person 222.533.42.82.5
Person 2343.43.32.32.9
Person 243.344.33.72.3
Person 253.13.63.53.23.3
Person 263.63.333.33.1
Person 273.43.12.93.33.1
Person 283.63.63.133.4
Person 292.73.72.933
Person 303.53.83.43.23
Person 313.63.32.83.22.8
Person 322.23.23.12.83.4
Person 333.32.93.13.13.3
Person 343.43.23.43.33.1
Table 2. Personal Factors: Person 1.
Table 2. Personal Factors: Person 1.
Personal FactorsValue
The highest level of education2
Religion1
Salary2
Method of transportation4
Commute time3
The frequency of journey in one year2
Social media usage status1
Time spent on Social media per day3
Hobby3
Openness3.3
Conscientiousness3.9
Extraversion3.3
Agreeableness3.7
Neuroticism2.6
Table 3. Sample Location Data: Person 16.
Table 3. Sample Location Data: Person 16.
LocationCount of Visit
Hongik Univ. Wowkwan19
Hongik Univ. IT Center7
Kanemaya noodle Restaurant3
Starbucks3
Hongik Univ. Central Library8
Coffeesmith2
Daiso3
Table 4. Sample Categorized Location Data: Person 16.
Table 4. Sample Categorized Location Data: Person 16.
Location CategoryCount of VisitVisiting Ratio
Foreign Institutions00.0000
Retail Business60.0400
Service Industry60.0400
Restaurant290.1933
Pub20.0133
Beverage Store260.1733
Theater and Concert Hall40.0267
Institutions of Education620.4133
Hospital60.0400
Museum, Gallery, Historical site, Tourist spots90.0600
Table 5. Hyperparameter Search Space.
Table 5. Hyperparameter Search Space.
Learning ModelHyperparameterMeaningRange of Values
Random
Forest
n_estimatorsNumber of decision trees 50 n _ e s t i m a t o r s 1000
max_depthMax depth of each decision tree 2 m a x _ d e p t h 15
bootstrapWhether to use bootstrap{True, False}
XGBoostboosterType of booster to use{‘gbtree’, ‘gblinear’, ‘dart’}
n_estimatorsNumber of gradient boosted trees or boosting round 50 n _ e s t i m a t o r s 1000
max_depthMax depth of each gradient boosted tree 2 m a x _ d e p t h 15
learning_rateLearning rate 0.05 l e a r n i n g _ r a t e 0.3
Table 6. Results of Feature Selection.
Table 6. Results of Feature Selection.
Location CategoryLearning AlgorithmSelected FeaturesAccuracy (%) (100-SMAPE)
Foreign InstitutionsRFO, C, A3.07
STK1.77
XGBO, C, Salary2.72
Retail BusinessRFO, A, N50.89
STK58.50
XGBO, A, Salary, Journey, SNS1, Hobby43.76
Service industryRFO, C, A,58.16
STKEdu, Salary57.19
XGBEdu61.50
RestaurantRFO, C, E, A,75.60
STKReligion, Hobby84.19
XGBC, E, N, Religion, C.time76.15
PubRFO, A, Edu,63.62
STKSalary, SNS266.43
XGBO, E, A, Edu, Transport, Journey, SNS258.84
Beverage storeRFE, N, Religion,69.59
STKSalary, SNS2, Hobby78.75
XGBE, A, N, Hobby52.30
Theater and Concert HallRFE, N, Religion63.62
STKSalary57.25
XGBE, N, Religion, Salary, Transport69.39
Institutions of EducationRFO, C, E, A, N,75.79
STKReligion80.90
XGBE, N, Journey, SNS267.14
HospitalRFC, E, A,10.44
STKEdu, Salary14.48
XGBEdu, Journey15.15
Museum, Gallery,
a historical site, tourist spots
RFO, A, Journey41.18
STK45.35
XGBO, A, Journey43.57
Table 7. Optimized Hyperparameters.
Table 7. Optimized Hyperparameters.
Location CategoryOptimization AlgorithmLearning AlgorithmOptimal Hyperparameter Values SearchedAccuracy (%) (100-SMAPE)
Foreign
Institutions
GridRF50, 15, True3.98
XGB1000, 2, 0.3, ‘gblinear’3.88
RandomRF250, 4, True2.90
XGB1000, 5, 0.3, ‘gblinear’3.88
BayesianRF150, 15, True4.03
XGB1000, 15, 0.3, ‘gblinear’3.88
Retail
Business
GridRF50, 9, True51.67
XGB200, 2, 0.05, ‘gblinear’49.20
RandomRF350, 9, True52.57
XGB500, 12, 0.1, ‘gblinear’50.78
BayesianRF450, 5, False55.78
XGB350, 3, 0.3, ‘dart’46.25
Service
Industry
GridRF250, 2, True59.59
XGB50, 2, 0.1, ‘dart’61.55
RandomRF950, 2, True59.18
XGB550, 13, 0.1, ‘dart’61.50
BayesianRF900, 2, True59.09
XGB1000, 4, 0.05, ‘gbtree’61.50
RestaurantGridRF50, 11, True77.50
XGB50, 2, 0.05, ‘gblinear’78.70
RandomRF200, 9, True77.45
XGB650, 4, 0.05, ‘gblinear’78.94
BayesianRF250, 5, True77.95
XGB100, 10, 0.05, ‘gblinear’78.76
PubGridRF50, 3, True68.83
XGB50, 2, 0.25, ‘gbtree’63.19
RandomRF100, 2, True69.09
XGB900, 7, 0.2, ‘gblinear’65.30
BayesianRF250, 14, True69.82
XGB900, 2, 0.25, ‘dart’63.86
Beverage
Store
GridRF50, 12, True66.02
XGB1000, 2, 0.3, ‘gblinear’74.25
RandomRF50, 9, True69.66
XGB800, 9, 0.3, ‘gblinear’74.25
BayesianRF50, 10, True70.40
XGB1000, 4, 0.3, ‘gblinear’74.25
Theater and
Concert Hall
GridRF150, 6, True64.76
XGB50, 2, 0.15, ‘dart’68.32
RandomRF100, 13, True66.60
XGB500, 5, 0.1, ‘gbtree’70.49
BayesianRF950, 2, True62.26
XGB733. 2. 0.15, ‘gbtree’69.64
Institutions
of Education
GridRF50, 6, True77.05
XGB1000, 5, 0.3, ‘gblinear’77.45
RandomRF400, 3, True76.20
XGB900, 12, 0.2, ‘gblinear’77.45
BayesianRF100, 15, True76.22
XGB944, 14, 0.55, ‘gblinear’77.45
HospitalGridRF100, 10, True10.62
XGB50, 5, 0.1, ‘gbtree’16.97
RandomRF50, 15, True11.84
XGB50, 13, 0.1, ‘gbtree’16.97
BayesianRF100, 2, False14.13
XGB293, 15, 0.55, ‘gblinear’16.11
Museum,
Gallery,
Historical site,
Tourist spots
GridRF150, 2, True42.38
XGB1000, 5, 0.3, ‘gblinear’50.74
RandomRF1000, 2, True42.68
XGB500, 9, 0.25, ‘gblinear’50.72
BayesianRF700, 2, True42.55
XGB76, 10, 0.2, ‘gblinear’48.83
Table 8. Hyperparameter Optimization Execution Time.
Table 8. Hyperparameter Optimization Execution Time.
Location CategoryOptimization AlgorithmLearning AlgorithmExecution Time (in Seconds)
Foreign
Institutions
GridRF546.29
XGB259.12
RandomRF6.68
XGB4.50
BayesianRF190.65
XGB27.08
Retail
Business
GridRF542.62
XGB345.77
RandomRF8.33
XGB5.82
BayesianRF143.21
XGB18.48
Service
Industry
GridRF551.15
XGB198.99
RandomRF9.06
XGB0.91
BayesianRF135.79
XGB18.23
RestaurantGridRF557.60
XGB297.58
RandomRF5.39
XGB4.92
BayesianRF142.96
XGB23.84
PubGridRF657.58
XGB284.47
RandomRF6.24
XGB4.85
BayesianRF289.76
XGB31.11
Beverage
Store
GridRF542.57
XGB292.10
RandomRF3.53
XGB1.03
BayesianRF148.80
XGB13.14
Theater and
Concert Hall
GridRF550.68
XGB211.76
RandomRF7.57
XGB5.22
BayesianRF152.43
XGB30.92
Institutions
of Education
GridRF571.96
XGB122.84
RandomRF7.88
XGB1.59
BayesianRF176.90
XGB36.61
HospitalGridRF326.95
XGB181.21
RandomRF5.54
XGB4.79
BayesianRF169.42
XGB29.93
Museum,
Gallery,
Historical site,
Tourist spots
GridRF558.27
XGB260.24
RandomRF7.54
XGB4.88
BayesianRF161.72
XGB22.67
Table 9. Accuracy of Random Forest.
Table 9. Accuracy of Random Forest.
Learning ModelRF+Feature
Selection
+Hyperparameter Optimization
Location Category GridRandomBayesian
Foreign Institutions2.963.073.982.94.03
Retail Business48.9850.8951.6752.5755.78
Service Industry60.9858.1659.5959.1859.09
Restaurant78.0975.677.577.4577.95
Pub65.3263.6268.8369.0969.82
Beverage Store69.6669.5966.0269.6670.4
Theater and Concert Hall64.4363.6264.7666.662.26
Institutions of Education76.6675.7977.0576.276.22
Hospital12.6310.4410.6211.8414.13
Museum, Gallery, Historical site, Tourist spots44.5741.1842.3842.6842.55
Table 10. Accuracy of XGBoost.
Table 10. Accuracy of XGBoost.
Learning ModelXGB+Feature
Selection
+Hyperparameter Optimization
Location Category GridRandomBayesian
Foreign Institutions1.562.723.883.883.88
Retail Business47.1643.7649.2050.7846.25
Service Industry52.5061.5061.5561.5061.50
Restaurant76.4476.1578.7078.9478.76
Pub61.9058.8463.1965.3063.86
Beverage Store60.9752.3074.2574.2574.25
Theater and Concert Hall74.9969.3968.3270.4969.64
Institutions of Education72.4967.1477.4577.4577.45
Hospital12.3015.1516.9716.9716.11
Museum, Gallery, Historical site, Tourist spots43.0643.5750.7450.7248.83
Table 11. Accuracy of Stacking.
Table 11. Accuracy of Stacking.
Learning ModelStacking+Feature Selection
Location Category
Foreign Institutions2.881.77
Retail Business49.6558.50
Service Industry55.5157.19
Restaurant82.0384.19
Pub63.2066.43
Beverage Store72.5278.75
Theater and Concert Hall55.0157.25
Institutions of Education72.4080.90
Hospital23.4014.48
Museum, Gallery, Historical site, Tourist spots43.7645.35
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Kim, Y.M.; Song, H.Y. Finding Location Visiting Preference from Personal Features with Ensemble Machine Learning Techniques and Hyperparameter Optimization. Appl. Sci. 2021, 11, 6001. https://doi.org/10.3390/app11136001

AMA Style

Kim YM, Song HY. Finding Location Visiting Preference from Personal Features with Ensemble Machine Learning Techniques and Hyperparameter Optimization. Applied Sciences. 2021; 11(13):6001. https://doi.org/10.3390/app11136001

Chicago/Turabian Style

Kim, Young Myung, and Ha Yoon Song. 2021. "Finding Location Visiting Preference from Personal Features with Ensemble Machine Learning Techniques and Hyperparameter Optimization" Applied Sciences 11, no. 13: 6001. https://doi.org/10.3390/app11136001

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop