2.1. Platform Design and Implementation
Figure 1 illustrates the unified Modeling Language (UML) functional scheme of the implemented system applied for an industry working in the large-scale retail sector. The main objective of the present research work was the implementation of an innovative BI platform. In detail,
Figure 1 shows the Use Case Diagram (UCD) of the project. This diagram outlines the main functionalities that the intelligent platform is equipped with. Here, we focus on the functionality outlined in red, which is the prediction of sales through XGBoost algorithms.
Figure 2 shows the UCD of the intelligent platform. The operator carries out the data entry. Then, the data is deposited in a data warehouse (DW) that allows for depositing the data coming from multiple sources and at the same time simplifies the data processing. A web app allows the customer to receive some personalized purchase information made possible by the customer profiling carried out by K-means algorithms. The various branches of the store send data in real time to a central database (DB). This is a Structured Query Language (SQL) DB which is not suitable for archiving historical data. Therefore, the massive data are sent on the Vertica DB. The resulting DW architecture is shown in
Figure 3.
We used the XGBoost algorithm in order to predict the future purchases of customers who have made a number of purchases equal to at least 30% of the number of days corresponding to the training period. On the basis of these predictions and the analysis of the association rules of the products purchased by the total customers of the supermarket, it is possible to formulate personalized offers for each customer. Customers were grouped into clusters, calculated on the basis of recency (distance from the last entry into the supermarket) and frequency (frequency of entry in the last 90 days). The prediction algorithm calculates the product that will be purchased by a customer who belongs to a cluster or to the entire clientele.
Figure 4 shows the flow chart of the implemented XGBoost algorithm. The algorithm takes as input the cluster on which to make the forecast, the training period, the number of days for which it carries out the prediction, and the code of the product in question. Finally, a check box permits one to select which of the following parameters must be used by the algorithm to make the prediction. The multiple parameters used for the prediction estimation were the following:
- -
day of purchase of the product;
- -
week number;
- -
promotion status;
- -
weather conditions;
- -
perceived temperature;
- -
ratio between the number of receipts containing the specific product code and the number of receipts sold in total during the day;
- -
ratio between the number of customer receipts with the cluster identified and the total number of receipts for the day;
- -
ratio between the total receipts issued by the department of the specific product code and the number of total receipts issued during the day;
- -
unit price;
- -
ratio between the number of customers who bought that product and the total number of customers of the day;
- -
total number of customers of the day;
- -
product in promotion at the competitor;
- -
holiday week;
- -
week before a holiday week;
- -
holiday;
- -
day after a holiday;
- -
pre-holiday;
- -
closing day of the financial year.
After building the dataset, a data pre-processing activity was carried out in order to apply Xgboost. Specifically, the techniques used were the following:
one hot encoding, which is a mechanism that consists of transforming the input data into binary code;
clustering, in order to obtain a relevant statistic, the clientele was divided into groups of customers with similar habits.
Unless the customer makes a large number of purchases in the training period, it is difficult to acquire an accurate prediction of the products they will buy. To overcome the lack of statistics on customers who are not very loyal, customers were grouped into clusters. Clusters are segments or groups of people with similar habits of recency and frequency. This segmentation allows for the increase of the statistics available as the data of the whole group are analyzed to predict what the whole group will buy. Subsequently, other algorithms could be used to distribute the predicted sales among the customers belonging to that cluster. Clustering is based on the calculation of the quintiles (20%, 40%, ... 100%) of the cumulative frequency of recency (number of days since the last order) and frequency (total number of orders in the last 90 days) for each customer who made purchases on the analysis day.
The quantiles were calculated over the last 90 days. On the basis of the recency and frequency values with respect to the quintile values, the following scores were defined for the recency (
R) and for the frequency (
F) which are associated with each customer:
Once the two scores have been calculated, the client’s cluster was evaluated using the segmentation map (
Figure 5). The segment is given by the Cartesian point (
R,
F) in the graph.
Furthermore, various optimization techniques of the predictive algorithm were elaborated, such as the Grid search technique and the Feature Selection.
Grid search, also called parameter sweep, is useful for finding the best combination of hyperparameters for the algorithm, which specifically are: max_depth, min_child_weight, learning_rate (eta), n_estimators, subsample, colsample_bytree, reg_lambda, and reg_alpha in order to minimize the cost function as much as possible. The most-used error estimation parameters are: root mean square error (RMSE), mean square error (MSE), mean absolute error (MAE), mean square error on prediction (RMSEP), cross correlation coefficient (R), and standard deviation (SD) [
46,
47,
48,
49]. The reliability of the algorithms depends on the estimates of the errors that are made during data processing. The errors may also depend on the creation of inefficient training models. The cost function chosen to evaluate the accuracy of the implemented algorithm was RMSE.
The other technique used to increase the accuracy of the algorithm was the Feature Selection, which consists in the selection of the features with the highest relevance for the provided training set. In fact, there is no unique set of features suitable for predicting the quantities that are sold of all products. The paper concludes with a detailed paragraph on the testing phase of the algorithm and the results obtained.
We used the Python programming language to implement the XGboost algorithm. In particular, Python provides the XGBoost and scikit-learn open-source libraries. Using ScikitLearn Wrapper interface [
50,
51,
52] for implementing XGBoost provides the opportunity to run the algorithm with the maximum number of parallel threads via the optional n_jobs parameter. The n_jobs = −1 option, permitting one to use all CPU cores, saves time, especially for trees that use a large number of nodes. Parallel computing is widely used in various sectors [
12,
53,
54,
55] because it optimizes computational costs.
We used the algorithm to be able to predict the future purchases of customers who had made at least 15 purchases in the last 90 days in order to create personalized offers for each customer. The offers are suggested by the analysis of the products purchased by the total customers of the supermarket. This analysis was performed using the association rules.
Appendix A lists the Python code of the adopted model, namely the piece of code where the XGBoost model is declared in terms of the hyperparameters: eta (alias learning_rate), max_depth, gamma, min_child_weight, subsample, colsample_bytree, alpha, reg_lambda, and n_estimators. For this purpose, the XGBRegressor and GridSearchCV classes of the scikit-learn library were used [
56,
57,
58].
GridSearchCV allows the model to automatically optimize the value of the hyperparameters. In fact, initially, the value of the parameters was not defined, with the exception of the target ‘reg: squarederror’ and the value of n_jobs, which was set to −1 to parallelize the code. During the construction of GridSearchCV class, it is necessary to provide a dictionary of hyperparameters to evaluate. This class assigns the best combination of the hyperparameters used, as it builds and evaluates a model for each combination of parameters. In particular, the correct calibration of the values of the eta and subsample hyperparameters avoids any overfitting problems. This is achieved by choosing eta so that the update step is not too large. Furthermore, only a fraction of randomly extracted data were sampled. This fraction is decided by the subsample value. Finally, cross-validation was used to evaluate each individual model.
2.2. XGBoost Algorithm
The XGBoost algorithm is a machine-learning algorithm that is composed of a sequence of weak predictors. This algorithm was introduced by Chen [
59] and perfects the gradient boosting algorithm. Gradient boosting is based on the iterative estimation of trees on the residuals obtained at each step and on adaptive updating of the estimates. Gradient boosting uses the gradient descent technique, for which the split that favors the approach to the minimum point of the objective function is chosen.
The optimizations of XGBoost, compared with gradient boosting, are due to its scalability, the use of parallel and distributed computing, tree pruning operations, management of missing values, and regularization to avoid overfitting and bias.
The input data are the set of values of the variables
used to make a prediction on the variable
:
This constitutes the training dataset. The model predicts the value of the variable starting from the variables , which are characterized by multiple features. In a linear regression problem, the predicted value is , where is the weight of . In a generic problem, denotes the parameters of the model.
The objective function, which measures the model’s ability to fit the training data, consists of two terms:
where
is training loss function and
is the regularization term. The loss function is a differentiable function that evaluates the prediction. The regularization term allows one to control the model’s complexity and avoid overfitting.
XGBoost uses Taylor expansion of the loss function to write the objective function in the following form:
where
, while
. By defining the following quantities:
we can write the optimal weight of the
j-th leaf as
, where
is the instance set of the
j-th leaf and
is a function that maps the data instance into a leaf of the tree, returning the index of the leaf itself. The objective function that optimizes the model is
Here, is the number of leaves that characterize the tree.
The computational cost of the algorithm is optimized due to the simultaneous training of all trees. Finally, the gain function, which allows one to evaluate the split candidates, is
where the first term is the contribution relative to the left nodes (subscript
L), the second term is relative to the right nodes (subscript
R), and the last term is the contribution of the parent leaf node (subscript
P). The split condition that generates the greatest gain is chosen. This is a pruning strategy that optimizes a tree level to avoid overfitting.
For the implementation of the code in Python, the following open-source libraries were used:
- -
Numpy, which facilitates the use of large matrices and multidimensional arrays to operate effectively on data structures by means of high-level mathematical functions;
- -
Pandas, which allows the manipulation and analysis of data;
- -
XGBoost, which provides a gradient boosting framework;
- -
Sklearn, which provides various supervised and unsupervised learning algorithms.
A seed was awarded to initialize the pseudorandom number generator. The random module uses the seed value as the basis for generating a random number. This initialization allows for the replication of the same experiment, thus always obtaining the same results even if the process consists of random functions.
The training dataset was created using a shuffle function that allows for writing the lines in a random order.
2.3. AD Approach Applied for Large-Scale Retail Sector
In order to further improve the performance of the model, the augmented data technique was used. This approach was already validated in the case of LSTM neural networks [
43,
44,
45], where data were artificially created to feed a data-poor training dataset. In [
44], the training dataset was increased from 768 records to 10,000 records. In the present case, the dataset consisted of 7,212,348 records (instances or rows) relating to 897 sampling days. Each of this record refers to the quantity, which is expressed in pieces or weight according to the type of article, sold out of a single item in a store branch. For each article, the data are not sufficient to reach the desired accuracy, and for this reason increased data have been used. The following 14 parameters were the dataset attributes (variables or columns):
receipt identification code
receipt date
customer cluster
plu code
day of the week
week
quantity sold (number of items)
eventual promotion
store branch identifier
unit amount
measured quantity (kg)
product description
number of receipts
shop department identifier
Each of these observations refers to transactions relating to a specific product, sold on a specific day in a specific branch, by customers belonging to a specific cluster.
Table 1 summarizes the statistics of the data.
Figure 6 shows the structure of the records that make up the original dataset and the methodology used to create artificial data. The technique consists of generating a certain number of receipts for each product in a sequential manner for each day. The identification number of the first artificial record is equal to the number of the last record of the original dataset increased by 1. For this reason, a reordering of the data was carried out based on the receipt date and not on the identification number.
The number of receipts generated for a certain product in a given day is chosen randomly by setting an upper and lower threshold. In a similar way to the technique used in [
23], we used the random method
random(a,b) to generate real numbers in the range (a,b). Furthermore, in generating the quantity sold and the number of receipts relating to a product, the average quantity sold for each item for that product was taken into account.
Figure 7 shows the statistic distribution of the number of receipts. The Log Scale option was used to make the histogram of this dataset attribute.