Toward Improving the Prediction Accuracy of Product Recommendation System Using Extreme Gradient Boosting and Encoding Approaches

With the spread of COVID-19, the “untact” culture in South Korea is expanding and customers are increasingly seeking for online services. A recommendation system serves as a decisionmaking indicator that helps users by suggesting items to be purchased in the future by exploring the symmetry between multiple user activity characteristics. A plethora of approaches are employed by the scientific community to design recommendation systems, including collaborative filtering, stereotyping, and content-based filtering, etc. The current paradigm of recommendation systems favors collaborative filtering due to its significant potential to closely capture the interest of a user as compared to other approaches. The collaborative filtering harnesses features like user-profile details, visited pages, and click information to determine the interest of a user, thereby recommending the items that are related to the user’s interest. The existing collaborative filtering approaches exploit implicit and explicit features and report either good classification or prediction outcome. These systems fail to exhibit good results for both measures at the same time. We believe that avoiding the recommendation of those items that have already been purchased could contribute to overcoming the said issue. In this study, we present a collaborative filtering-based algorithm to tackle big data of user with symmetric purchasing order and repetitive purchased products. The proposed algorithm relies on combining extreme gradient boosting machine learning architecture with word2vec mechanism to explore the purchased products based on the click patterns of users. Our algorithm improves the accuracy of predicting the relevant products to be recommended to the customers that are likely to be bought. The results are evaluated on the dataset that contains click-based features of users from an online shopping mall in Jeju Island, South Korea. We have evaluated Mean Absolute Error, Mean Square Error, and Root Mean Square Error for our proposed methodology and also other machine learning algorithms. Our proposed model generated the least error rate and enhanced the prediction accuracy of the recommendation system compared to other traditional approaches.


Introduction
Over the past few decades, recommender systems (RS) have immensely been utilized by a number of domains including e-commerce, research-paper recommender systems, social websites, etc. The e-commerce or similar websites e.g., Amazon, Yelp, Epinions, etc. provide feasibility to its customers to share their opinion regarding the purchased items. The main purpose of these systems is to recommend items or products that relate to one's interest or are highly likely to be purchased in the future. The recommendations are made based on historical records i.e., opinions of customers who have already purchased those products. The opinions are expressed in the form of reviews or ratings thus playing a key role in attracting future customers and helping them with decision-making [1][2][3][4]. The recommender systems utilize the interest of a user which is captured by exploiting a set of diversified features belonging to the following approaches: collaborative filtering (CF), content-based filtering, metadata-based filtering, etc. The CF follows a theory that users like what like-minded users like. Two or more users are considered like-minded if they do similar ratings of items. Since the theory followed by the CF is deemed strong by the scientific community, therefore, it has been widely been employed by a number of researchers [5][6][7].
The CF poses various steps for a recommendation and is also referred to as item-based collaborative filtering (IBCF). Furthermore, it is scalable and easy to adapt compared to other recommendation approaches. A recommendation in this system is made based on profile similarity (profile features, graphics, etc.), user product rating, and comments. The recommendations are made based on two types of user interest: Implicit and Explicit. User product views and actions relate to implicit (model-based) category. The features such as Likes and follows are associated with an explicit category (memory-based) [8][9][10][11]. The contemporary state-of-the-art on the CF recommendation system has reported classification and prediction of future items to be bought with good accuracy. However, to the best of our knowledge, none of them was able to present remarkable results for both accuracy and classification [12][13][14][15]. Trust is another factor used in various applications that needs to be evaluated. Due to the memory-based scalability issue, most of the industrial settings are managed based on the model-based method for recommendation [16][17][18][19][20]. Matrix factorization is popular in the model-based category and changes [21]. The process is going based on optimizing two low-rank metrics. Metrics are divided into items and users latent features. The combination of user-item features relationship makes the original matrix avoid over-fitting. In recent years, many approaches have been proposed related to item recommendation. In this paper, we present the CF-based recommendation system to overcome the issue of good prediction and classification accuracy at the same time. In the proposed system, user profile information and click history collected from a Jeju online shopping mall are harnessed as features. The key idea of the proposed approach is to avoid the recommendation of those items that have already been purchased. Word2vec applied for data pre-processing followed by an XGBoost machine learning algorithm to evaluate the prediction and classification accuracy. Finally, the system recommends a product based on user click rate and purchased items.
The main contributions of this paper are as follows: • The main objective of this study is the use of data mining and machine learning approaches in the context of online product shopping.

•
The presented XGBoost classifier enhancing the prediction accuracy for data mining techniques to extract hidden knowledge from an online product shopping dataset is the important side of system managing.

•
The presented work follows the analysis techniques, e.g., time series analysis, statistical analysis, product analysis, user interest analysis, access page based analysis, and purchased product analysis.

•
Collaborate filtering process applied to predict the rate of products and evaluate the neighbor items. • Word2vec encoding technique used to generate the dataset vector space and predict the surrounded context randomly.

•
We have various data pre-processing steps to change the data format in consistent format.

•
Various features extracted from dataset e.g., user information, purchased items, click information, etc. • Finally, we illustrate the constructiveness of the XGBoost model which is applied for prediction, classification, and regression of online product shopping. Figure 1 shows the prediction process of the neighbor product in the proposed system. The neighbor recommended items generated based on the user click sequences. The number of directions to a product is also important. If there are various directions provided for one product, it means that the user clicked a few times on the same product. For easier explanation, this product might be the user interest. This procedure repeated for all users. Hence, this model aims to obtain the complex relationship between user and product.  The rest of the paper is divided as follows: Section 2 presents the literature review of the proposed item-based recommendation system. Section 3 presents the data analysis for online shopping product data, prediction, and classification evaluation. Section 4 presented the predictive analysis of online shopping product data using a machine learning algorithm of the presented recommendation system. Section 5 presents the prediction result of the online shopping product, and, finally, we conclude the paper in the Section 7.

Literature Review
This section encompasses details about the contemporary state-of-the-art in the recommendation system followed by collaborative filtering with identification of some of the important issues that have been overlooked by the scientific community. The notion of recommendation systems was actuated in the late 1990s. To date, the field has become a center of attention of various e-commerce organizations such as Amazon, eBay, Shopify, etc. that are involved in recommending certain items/products to its customer.

Recommendation System
The recommendation system is a type of information filtering system that analyzes user behavior on the available dataset and predicts user preferences. User behavior contains the click information, user rating on purchased items, comments, sharing, cart information, etc. [22][23][24][25][26][27]. The main focus of this system is to help the users with a lack of available information to make a decision regarding the purchase of certain items. From the 1990s, different RS in different domains including articles, movies, and products have been presented by the scientific community [28][29][30][31][32][33][34][35]. These systems have broadly been categorized into five main classes including (1) collaborate filtering, (2) content-based recommendation, (3) knowledge-based recommendation, (4) hybrid recommendation, and (5) demographic recommendation. This categorization has been given by Burke [36]. The idea of collaborate filtering (CF) was coined by Goldberg et al. [37]. The CF has a predicting process to show user preferences and the same point between users or products [38].
In other words, CF assumes that, if two users have a similar interest in one product, they will have a similar interest in other products as well. Researchers have drawn comparisons of the neighborhood-based processes with other approaches in terms of ease of adaptability and efficiency, etc. Similarly, most of the studies have improved the performance of existing approaches using different machine learning algorithms and techniques. Besides various potential aspects of CF, it contains a sparsity issue due to a lack of information of user interest in products [39]. It has been observed that the total ratings of products given by customers are comparatively lower than the number of items purchased. This leads to most of the purchased items having few or sometimes zero ratings. During the last few years, online shopping systems have grabbed paramount attention of people across the globe. The online shopping websites like Amazon are accessible by a lot of people. In some CF approaches, the main focus is the purchased items by a user only during a certain time period. Similarly, there are many IoT-based platforms, such as healthcare [40][41][42], indoor localization [43,44], and many other IoT systems [45][46][47] which can be improved in terms of accuracy by integrating the recommendation functionality. Song et al. [48] have analyzed changes in a user behavior by constructing dependent rules from two different attributes, i.e., dataset and user behavioral changes. The study has conducted the comparison of purchasing by using the defined rules and the outcomes suggested no pattern changes. Some of the studies have focused on temporal dynamic reflection [49][50][51][52][53][54][55]. The temporal dynamic method is used to assess the association between time purchasing time or the time spent in clicking various web pages.

Item-Based Collaborate Filtering
Item-based content filtering (IBCF) is one of the traditional methods of a recommendation system. The primary purpose of this system is to collect dataset based on same product weights. To find the similarity between the item-item prediction rate and computing time provided. It is well recognized that higher weight products contain an extensive contribution. In most of the scenarios, IBCF fails to provide adequate prediction and classification accuracy. To overcome issues related to IBCF, the scientific community has presented different approaches. Gao et al. [56] combine the user ranking with products similarity and proposed page-rank-based ranking approach solely to improve the prediction accuracy. The proposed approach yielded a good performance in improving accuracy. Similarly, Koren [15] proposed dynamic preferences-based recommendation system which extracts the changes of user interest within a defined time period and predicted the accuracy of IBCF. Feng et al. [13] adopted a temporal overlapping based strategy to improve the accuracy of classification using dynamic user preferences based on the time-weight rule.

User Preference-Based Recommendation
Based on the user interest to clicked or purchased products, sometimes they are not interested in leaving comments or ratings of products or, in another point of view, they leave a false rating for specific products. This is a popular problem in most of the recommendation systems. To overcome this issue, ChoiRec12 [57] presents a model which extracts the ratings from purchased products based on the user purchase history on specific product. The collected data are only related to such products which were purchased a few times by the same user. For further processing, user-item matrix collaborate filtering was applied in the presented model. The major process in this system is a sequential purchase pattern removed from data history, and item recommendation has no provisions.

Deep Learning in Recommendation Systems
Deep learning is one of the achievements in different fields. In product recommendation systems, the autoencoder approach contributed well [58]. In [59], tag-aware recommendation developed, which shows that the user profile is presenting by tags, and, in this case, the neural network extracts the detail feature from tag layers. In another identical approach to extract the user interest and every detail of user profile based on memory components, the attention-aware recommendations developed [60]. Furthermore, the co-attention network investigates the visual information for improving the recommendation system performance [61]. Convolutional neural network (CNN) and the recurrent neural network (RNN) have the highest rate of usage in recommendation systems [62][63][64]. GAN was used to overcome the CNN and RNN negative sampling issue while during system design [65]. Table 1 presents an overview of various techniques used in the recommendation system. It is divided into four main parts. In the proposed approach, three main techniques are presented, which are specific for that research: collaborative filtering, content-based filtering, and web usage filtering. The techniques which apply in these approaches mainly focus on overcoming the sparsity, scalability, improving the performance, and improving the recommendation results. These techniques have to face some issues like cold start, Gray sheep, customer authentication, and trustworthy. The cold start issue is a computer-based potential information system that requires the data modeling degree. The Gray sheep issue is one of the recommendation system problems which increase the error rate of collaborative filtering technique. The authentication problem is also very famous in the recommendation system. This issue is mostly happening based on users' lack of identification information in their profile.

Data Analysis for Online Shopping Product Data
In this section, data collection characteristics are presented in detail. Data collection is the first and essential step of any recommendation algorithm. RS was designed to find the nearest products to user preferences. To do this, data type and information is the primary step. We perform RS by identifying user profile information and click history from Jeju online shopping mall website. Table 2 shows the overview of the processed dataset. The total number of recorded data are 294,864. Based on the collected data, we extract the purchased product records and the list reduced to 10,000 purchased product records. Data mentioned were collected based on user click history. In addition, 80% of the dataset was used for the training set, and 20% of dataset was used for the testing set.

Collection of Online Shopping Product Data
In this paper, we collected the data from Jeju online shopping mall website, the Republic of Korea, to analyze and inquire online shopping product and extract the hidden parts to improve the product recommendation system. Data mining techniques were applied to clean and manage the missing values and increase the stability and performance of data. Furthermore, the below are executed to extract the hidden parts and information for providing better service to online shopping users and customers.

•
Data collection • Data cleaning • Managing missing values • Extract the related information • Feature discovery Table 3 shows the extracted features from presented data.

Data Pre-Processing
After online shopping product data investment, data pre-processing needs to remove all unnecessary information. To do this, data pre-processing techniques have been used for transforming the raw data in a suitable form to apply for the data analysis and prediction analysis. The following steps show the pre-processing technique. First, we removed all duplicate files to improve the readability of dataset. Second, we removed the user's records which have no purchased products. Third, we kept the information which is needed for the process of recommendation systems. Figure 2 shows the main part of pre-processing in the proposed system.

Generate pattern sessions
Pre-Processing Data Figure 2. Overview of data pre-processing.

Data Analysis and Visualization
In the proposed system, we execute data analysis on the basis of data mining to extract useful and effective information for better recommendation to the user. The following analysis was applied to get the results of the available dataset. Time series analysis was used to generate unique and important information for recommending products. To generate the time series analysis, data duration is for (2019) collected information. For analyzing the data needed, information was divided into (i.e., Monthly and Daily basis) to generate the information for product shopping frequency. Figures 3 and 4 show the product shopping data according to monthly and daily basis.

User Interest Analysis
Each user who visited online shopping websites has some preferences for shopping. In this analysis, we extract the most purchased products based on users' shopping records. The most purchased item shows the users preferences and interest and mostly the quality of the product. Figure 5 presents the user interest and most visited products for shopping.

Access Page Analysis
Access page analysis shows the records of the user clicked web pages. The most visited pages are one of the options for recommending the product provided in that access page. The user clicks and reviewing products increase the weight of that product for recommending in neighborhood products. Figure 6 shows the rating of the mentioned accessed page in the collected dataset. One row is representing the page id and another one representing the number of users visiting that page.

Purchased Product Frequency
In the proposed system, analysis of the collected dataset is based on time series frequency and product types. The time series frequency describes the maximum and minimum visited pages, maximum and minimum purchased item, user preferences, the most visited dates, the most visited month, etc. Similarly, based on time series analysis, it is also easier to recognize the product ratio. Figure 7 presents the flow diagram based on purchased products, frequency, and user interest. The process of product purchasing divided into three layers. The input layer contains the extracted valuable features out of raw data. After encoding, the extracted features are passed on to the analysis layer and analysis techniques are applied on the product dataset purchased. Finally, the system visualizes and classifies online shopping products.

Proposed Extreme Gradient Boosting and Encoding Approach for Online Product Recommendation
This section proposes the predictive recommendation method used for the extracted knowledge and information in the previous section. A detailed view of the proposed architecture illustrated in Figure 8. There are multiple layers in the designed architecture. The first layer was designed as input data. After the following processing techniques, the data passed to the data analysis section. Data analysis in this procedure was divided into three main analyses: time series analysis, user interest analysis, and access page analysis. The derived features from the analysis outputs include the user click information, each item purchased by the user, etc. The process data was used to generate the word2vec encoding technique. The generated information is used as an input of the XGBoost prediction layer. The prediction process was used to generate the neighbor items based on the user click history. We compare the XGBoost algorithm result with other machine learning algorithms like "random forest", "support vector regressor", and "linear regressor", where XGBoost performed the best.

XGBoost
XGBoost is one of the powerful boosting algorithms in the machine learning system. This algorithm can predict, classify, and optimize the defined system with the highest accuracy based on the data structure. Recommended products in this system contain the combination of classification and prediction results by applying the collaborate filtering technique. The item-based collaborate filtering process is predicting the rate of the clicked products, evaluating the rate of neighbor products at the same time. If the neighbor products are very equivalent to user preferences, then the target product has also the same stage for recommending to the user. The differences between products prediction weight are effective in getting better prediction and recommendation results. To know the probability of possible predictions for products, we need to get the true values, which are defined as Equation (1): The basic statement of XGBoost prediction defined in Equation (2). The probability of the observations z i and λ i is distributed ordinary and similarly independent: Generally, the probability of recommendation contains an error rate during the process too. Decreasing the error rate of prediction processes likelihood requires increasing to evaluate the coefficient of θ i . Equation (3) defines prediction error rate decrease:

Random Forest
The Random Forest algorithm (RF) is one of the famous learning methods which is represented as an extremely outstanding and powerful technique in a machine learning system for classification problem. The main idea is to apply RF classification on a recommendation system to extract the top highly related products based on user preference. The classification technique in RF contains a few phases mentioned in the following steps:

•
Bootstrap the random sampling approach to recapture the P training set out of the raw data with the same size.

•
Develop the RF regression model to apply in a bootstrap training set based on P decision tree. • Combine all independent P decision trees to improve the regulation of RF.
Equation (4) presents the RF regression model:

Support Vector Regressor
Support vector machine (SVM) is one of the machine learning algorithms, which is also defined as regression models and is known as support vector regressor (SVR). SVR is also another classification algorithm with some differences from other mentioned algorithms. It is popular as a fast-developing strategy among machine learning algorithms. The main concept of SVR is on a regression model. It operates the support vector machine for prediction purposes of the system. One of the differences between SVR and other machine learning algorithms is that it tries to fit the best line inside the predefined error values.

Linear Regressor
A linear regression model is defined to find the relationship between target variables by fitting data into linear equations. Each variable is regarded as an ex-positive variable, and the rest is defined as dependent variables. The first type of regression model that has been used mainly in practical applications is the linear regression. There are generally two types of models, one which is linearly related to their parameters unknowingly and is easy to fit than the second model, which are nonlinearly dependent. For the first model, the estimators are also easy to determine. Therefore, linear regression has been preferable.

Implementation and Testing Environment
In this section, the implementation, development environment, and the prediction results of the proposed recommendation system and predictive analysis model are presented. The remainder of this section includes the detailed process of experimental set up of the processed environment, total model structure, and comparison of our model result with other baselines.

Experimental Setup
The experimental setup is summarized in Table 4. All experiments and results of the system are carried out using Intel(R) Core(T.M.) i7-8700 CPU @3.20 GHz processor with 32 GB memory. The collaborate filtering algorithm was used for recommendation systems. The XGBoost machine learning algorithm was used for the classification and prediction process. Similarly, the library and framework used in the proposed system was the Jupyter notebook. The programming language used in the designing of this system was WinPython-3.6.2, and the encoding approach was word2vec.

Performance Evaluation
Various performance measurement operates for evaluation of the regression problem. In this paper, we applied a statistical evaluation measurement to specify the constructiveness of our model. e.g., Mean Square Error (MSE), Mean Absolute Error (MAE), and Root Mean Square Error (RMSE).
• Mean Square Error (MSE): takes the square of differences between the actual value and evaluated value. The reason for taking the square is to drop the negative values. Equation (6) presents the evaluation of MSE: • Root Mean Square Error (RMSE): detects the error rate from the regression model and checks the error size with the size of the target value. Equation (7) presents the evaluation of RMSE:

Results and Discussion
This section presents the online product recommendation analysis and results based on the user's purchased products and clicks sequence. In Section 6.1, we present the word2vec recognition rate and computation time testing the model ten times. In Section 6.2, we report the accuracy of the proposed method and compare it with other models and, finally, the recommendation results presented in Section 6.3.

Word2vec Recognition Rate and Computation Time
Word2vec is a two-layer neural network which generates data in vector space. The proposal for using the word2vec approach in this system is to generate the nearest neighbor products based on user purchasing record and click sequences. Figures 9 and 10 show the recognition rate and computation time of the proposed system. We test our algorithm ten times to get the best rating of the system.

Accuracy
In this section, we present the experimental results of the extreme gradient boosting algorithm. In this process, we used the top 10 predicted products based on user clicks, preferences, and purchases. The mentioned extreme gradient boosting algorithm implemented in the winpython environment. Figure 11 shows the performances of the operated model such as MAE, MSE, and RMSE.

Recommendation Results
Based on the process on previous sections and generating the output for XGBoost prediction, classification, and regression results, the nearest products based on user purchased items and click information are represented in Table 5. The results are based on the word2vec technique. Each table contains five columns which represent the brief information about product recommendation-p_values mean predicted values and T_values mean true values. The nearest item column represents the neighbor columns. In this table, we mention the five nearest products based on user preference. The comparison between recommended products and predicted values shows the effectiveness of this system if the recommendation is true or false. For further explanation, this process is not limited to recommending products. The recommendation is all based on user behavior in the system. If the recommendation results contain unrelated products, this means, in user click history, that all the click information is not limited to one category and user clicks and visited various product pages. In every recommendation system, to generate the system after normalizing data, it is supposed to give the real product name or id too. Table 6 shows the real product id after the recommendation process. The presented table converts the predicted product numbers into real product numbers using the methods mentioned. Based on the data encoding process, the last two columns represent the encoded product number and the real product number.

Comparative Analysis
Based on the proposed recommendation system, we compare the achieved results with recently proposed systems in product recommendations. Figure 13 shows the other approaches' comparison with the proposed system. As it shows in Figure 13, we compare our results with three related publications [74][75][76]. In the first related study, the proposed approach is using the PMF model for a recommendation system, and the proposed model successfully obtained the accuracy of 0.72 percent for a recommendation. The second approach applied the C-Means technique and got a relatively better result than the previous approach, around 0.78 percent. The final study applied the EHCF approach in recommendation systems and obtained an accuracy of 0.794. All of the mentioned studies are recently proposed models. The proposed system results show that the XGBoost-based recommendation model has a much better result than other approaches.

Conclusions
Collaborative filtering based (CF) recommendation systems have been widely utilized by various e-commerce organizations in order to increase their product sales and satisfy their customers. The CF assumes that two like-minded people are most likely to exhibit a similar likeness pattern in the future. The contemporary state-of-the-art on CF has presented classification and prediction accuracy of different proposed models. However, none of them have presented collectivity. For instance, if classification accuracy is good, then prediction accuracy declines and vice versa. This prevents CF from being a generalized recommender system. In this paper, we have proposed XGBoost-based collaborate filtering product recommendations to evaluate the performance of CF recommendations based on user profile and click information. The proposed approach follows an idea that the recommendation of those items that have already been purchased must be filtered prior to prediction. The proposed system has extracted the items purchased by a user to improve the accuracy of prediction and classification of the model. The data preparation and processing are relying on a word2vec technique which plays an important role in data normalization. Comparison between recent related studies and our proposed XGBoost recommendation model shows the outstanding performance of our model. The system recommends nearest products based on product prediction weight. In the future, we would take into consideration the user session details and how they can impact the accuracy of the recommendation model.

Conflicts of Interest:
The authors declare no conflict of interest.