An Ensemble Framework to Improve the Accuracy of Prediction Using Clustered Random-Forest and Shrinkage Methods

: Nowadays, in the topics related to prediction, in addition to increasing the accuracy of existing algorithms, the reduction of computational time is a challenging issue that has attracted much attention. Since the existing methods may not have enough efﬁciency and accuracy, we use a combination of machine-learning algorithms and statistical methods to solve this problem. Furthermore, we reduce the computational time in the testing model by automatically reducing the number of trees using penalized methods and ensembling the remaining trees. We call this efﬁcient combinatorial method “ensemble of clustered and penalized random forest (ECAPRAF)”. This method consists of four fundamental parts. In the ﬁrst part, k-means clustering is used to identify homogeneous subsets of data and assign them to similar groups. In the second part, a tree-based algorithm is used within each cluster as a predictor model; in this work, random forest is selected. In the next part, penalized methods are used to reduce the number of random-forest trees and remove high-variance trees from the proposed model. This increases model accuracy and decreases the computational time in the test phase. In the last part, the remaining trees within each cluster are combined. The results of the simulation and two real datasets based on the WRMSE criterion show that our proposed method has better performance than the traditional random forest by reducing approximately 12.75%, 11.82%, 12.93%, and 11.68% and selecting 99, 106, 113, and 118 trees for the ECAPRAF–EN algorithm.


Introduction
Ensemble learning is a powerful tool for the classification and prediction of various issues that has been extensively studied in statistics and machine learning.These methods use several algorithms to enhance the performance of the model and increase the prediction accuracy so that several weak algorithms are combined with a special pattern to create a strong learner that performs better than a single learner.The most well-known algorithms of ensemble learning are bagging, boosting, and random forest (RF).Bagging [1] produces regression trees by random selection with replacement.In addition, the generated trees do not appertain to previous trees, and each one is independent from its peers.These deep trees grow without pruning, hence a high variance and low bias exist in each tree, whose average causes low variance.Random forest is a developed form of bagging that is used for both regression and classification.Some methods similar to RF discussed in the literature do not work as well as the RF proposed by Breiman [2].Dietterich [3] proposed a method in which each node is randomly selected from the k best splits.Amit et al. [4] suggested the first random-selection algorithm based on the best separators in which k trees are generated by the random vector.In the RF, a subset of predictors is selected as a separator in each node.Each node is grown and split by randomly selecting attributes.
The number of trees in the RF has always been a concern to researchers.A large number of trees does not always mean good prediction accuracy, but in some cases, causes increases in error, overfitting, and decreases in accuracy.By using RF, Khan et al. [5] proposed a method in which less OOB error determined the most optimal RF trees.To overcome the problem of a large number of trees, we propose a method in which shrinkage methods, which perform variable selection, are used to reduce the number of RF trees.This method not only increases accuracy but also prevents overfitting.In our method, from all the RF trees existing in the prediction, a subset of trees is selected using shrinkage methods.In [6], the post-selection boosting random-forest algorithm was suggested, which utilizes lasso regression to reduce the number of RF trees and improve the performance of the RF algorithm.On the other hand, clustering the correlated data, identifying the homogeneous subsets of predictors, and assigning them to similar clusters can be effective at increasing the prediction accuracy.For this reason, k-means clustering is used in this paper.
Major Contribution: In present paper, we propose a hybrid approach called ensemble of clustered and penalized random forest (ECAPRAF), which is shown in Figure 1.This method has four fundamental parts.These parts include clustering, predicting, shrinking, and ensembling.The purpose of using the clustering method is to identify homogeneous subsets of data and assign them to similar groups.This allows similar data placements in a cluster to reduce variance and increase correlation within clusters, as well as to increase the prediction accuracy in order to create an optimal model.For this purpose, k-means clustering is used.The second part, which is related to the prediction algorithm, is responsible for predicting the initial trees with low error.The main idea is to use a set of weak predictors with high accuracy to make initial decisions that can be summed together to obtain the final prediction.To do this, RF is used.The reason to choose RF as a prediction algorithm is to produce low-variance trees and use them as an input for shrinkage methods.The next part is to use the shrinkage methods to reduce the number of RF trees within each cluster in order to perform the final prediction in each cluster without an initial selection of trees.The accuracy of the model can be increased compared to traditional RF.In the last part, the remaining trees are ensembled to improve the power of the learners.This is performed using the weighted mean of the trees in the clusters.Averaging reduces the variance of each tree, and if one learner is weak, then the other learners correct it or reduce the error.Finally, it can be stated that this method provides an efficient algorithm to increase the prediction accuracy, improve the traditional RF algorithm, and automatically reduce the computational time and the number of RF trees within clusters.
Motivation: The purpose of this study is to provide a combination of statistical methods and machine-learning algorithms in order to increase accuracy and decrease the computational time.This is done by applying shrinkage methods such as lasso, elastic net and group lasso to random-forest trees and ensembling the remaining trees.In addition, attempts are made to reduce the variance and increase the correlation within the clusters by clustering the data and homogenizing them, which is performed in order to create an optimal model and increase the prediction accuracy.
Paper Organization: The rest of the paper is organized as follows: Section 2 examines previous works in ensemble learning, shrinkage methods, machine-learning algorithms, and the combination of them.In Section 3, a brief theory of random forest, the k-means algorithm, and the shrinkage methods are introduced.Section 4 describes the main formwork and a summary of the proposed method.The simulation study to evaluate the performance of the proposed model is applied in Section 5.In Section 6, the experimental results are analyzed and discussed based on two real datasets.Finally, Sections 7 and 8 include discussion and conclusions and future work, respectively.A dataset undergoes clustering in order to identify homogeneous subsets of data and assign them to similar groups.Then, trees are extracted as predictors by random forest and reduced by shrinkage methods.Therefore, elimination through the shrinkage methods as a means of reduction leads to a decrease in error and an increase in the model accuracy.In the last part, the remaining trees are ensembled to improve the performance of the model.

Literature Review
In many articles, ensemble-learning-based methods are used for prediction and defect diagnosis.For example, in [7] and [8] they were used to find bridge defects and software defects, respectively.In [7], ensemble methods were used to predict bridge defect conditions to help bridge managers make more rational and informed steel-bridge-maintenance decisions.For this purpose, six ensemble-learning models, namely, random forest, ExtraTree, AdaBoost, GBDT, XGBoost, and LightGBM were used.In [8], random forest, ExtraTree, AdaBoost, gradient boosting, histogram-based gradient boosting, XGBoost and CatBoost methods were used for the prediction of software defects and the automatic identification of defective parts of the software.
Ensemble-learning algorithms are not only used in machine learning but are also widely used in deep leaning.For instance, in [9], a combination model of CNN and SVM was presented, which used the SVM as an ensemble method to aggregate the CNNs.In [10], six algorithms, namely, ANNS, ANNN, ridge regression, lasso regression, MLR, and elasticnet regression were used to build a model that can be used to predict rock tensile strength.In [11], the least absolute shrinkage and selection operator (Lasso) and ridge regression in conjunction with the logistic-regression (LR) method were employed for feature selection.Then, classification algorithms such as KNN, random forest (RF), and logistic regression were used for predicting the results.This study presented a novel hybrid model for the diagnosis and prediction of liver cancer.
Researchers in various fields have recently focused on the combination of shrinkage methods and machine learning to improve the performance of traditional algorithms.In [12], clustering and coefficient estimation were simultaneously performed using cluster correlation-network support vector machine, in which clusters were penalized by Lasso, SCAD, and MCP.Previously, the combination of elastic net and SVM [13], elastic net and RF [14], as well as SVM with lasso, ridge, and SCAD [15,16] were used.In [17,18], a combination of variable clustering and feature selection was also used to reduce the dimension.In addition, hierarchical clustering and variable selection of the main variables were used to evaluate the performance of RF.Tutz and Koch [19] enhanced nearest-neighbor classifiers by using selection methods such as lasso or boosting.The relevant nearest neighbors were automatically selected.Bouveyron and Brunet [20] proposed a method that adapted the traditional mixture model for modeling and classifying data in a latent discriminative subspace.To generate the proposed discriminative latent mixture (DLM) model, the model-based clustering goals and the discriminative criterion, introduced by Fisher, were combined.Farhadi et al. [21], by using shrinkage methods such as lasso, ridge, and elastic net on simple linear regression showed that elastic net had better performance even though it was not combined with machine-learning algorithms.In the present paper, we show that elastic net in combination with machine-learning algorithms has a better performance compared to other shrinkage methods.

Materials and Methods
The proposed methodology consists of four main stages in total: 1.
Clustering a dataset by employing the k-means clustering algorithm to identify homogeneous subsets of data and assign them to similar groups; 2.
Using RF algorithm within each cluster as a predictor; 3.
Reducing the number of trees to increase model accuracy and decrease computational time in test phase; 4.
Ensembling the remaining trees within each cluster.
Figure 1 shows a flowchart of ECAPRAF.More details of the proposed model are explained in the next sections.

K-Means Clustering
For many years, the main topic of data-mining studies has been cluster analysis, which has been a useful tool in data science.In addition, it is a well-known method for identifying homogeneous subsets from predictors in a dataset.K-medoids, k-means, and other clustering algorithms are widely used in many statistical analyses.K-means clustering is one of the most common and simple unsupervised learning methods in machine learning, which is used to solve various problems in statistics, computer science, genetics, and engineering.This algorithm partitions the dataset into K distinct and nonoverlapping clusters.Therefore, the data within clusters have similar characteristics while between clusters they have different characteristics.The data within clusters are grouped based on similarity and squared Euclidean distance criteria.The input parameter in the k-means algorithm, which is used to determine the number of clusters, requires an optimal value for k, which is obtained using the Gap statistic [22] in the NbClust package [23].After specifying the number of clusters, the k-means algorithm dedicates each observation to exactly one of k clusters.The minimization of the residual sum squares (RSS) can be used to determine the center of the clusters.In principle, k-means clustering is an optimization problem aimed at minimizing the objective function among groups.Although both clustering and classification divide the dataset into different classes, clustering, unlike classification, does not predict the variables; it just divides the dataset into homogeneous groups.
Suppose the dataset D = x 1 , x 2 , . . ., x p contains N points.Suppose the clusters obtained after applying k-means are C = {C 1 , C 2 , . . . ,C K }.The RSS objective function to determine the value of the clusters is defined as follows: where |C k |, C k , x ij , i ∈ C k , and K show the number of observations in the k-th cluster, the clusters, ij-th variable, the i-th observation in the k-th cluster, and the number of clusters, respectively.
By minimizing Equation ( 1), the minimization problem of the k-means clustering is solved as follows. minimize The k-means clustering steps are as follows: 1.
The selection of the number of clusters; 2.
The initialization of the cluster centers; 3.
The data are assigned to the closest cluster based on the distance criterion.The data proximity is determined by their distance from the cluster centroids.The distance measure of all data from centroids within each cluster is the square Euclidean distance; 4.
The centers of the clusters are the averages of the clusters and are updated with clusters obtained from the previous step.Additionally, there are other methods such as k-medoid and k-median that can be used to determine the centers of the clusters based on each method; 5.
The two previous steps continue until the centers of the clusters do not change and the criterion of convergence is satisfied [24].

Random-Forest Algorithm
Although the decision-tree approach, suggested by Quinlan [25], can be more useful, it builds a very deep and complex tree that suffers from high variance and overfitting due to high depth.Therefore, it requires pruning techniques.Breiman [2] introduced "The random forest algorithm" to reduce overfitting in the decision tree.This reduction was made by building an ensemble of M trees [26].To obtain M decision trees in the RF algorithm, suppose X 1 , X 2 , . . ., X p and y 1 , y 2 , . . ., y N are explanatory variables and response variables, respectively.A subset of features is randomly selected, mtry = p 3 .The predictor space is divided into J distinct and non-overlapping regions, R 1 , R 2 , . . ., R j .Then, for every observation that falls into the R j region, the same prediction is made, which is the mean of the response values for the observations in R j .The predictor and cut point are selected.Finally, a tree is obtained that has minimum RSS.The aim of building the decision trees in the RF is to minimize the equation where x i >s y i and ŷR 1 = 1 N 1 ∑ x i ≤s y i are the mean response for the training observations in R 2 and R 1 , respectively.N 1 , N 2, and s are the number of samples in R 2 , R 1 , and the cut point, respectively.
Figure 2 shows the random-forest algorithm, the steps of which can be expressed as follows: 1.
Construct trees according to the bootstrap dataset.(In this step the tree growth is without pruning.The mtry node is selected from the predictors as a separator).

5.
Final prediction for the whole M regression tree as:

Shrinkage 3.2.1. Lasso Regression
Variable selection is the basis of statistical learning that can be performed using shrinkage methods.The aim of these methods, such as lasso, elastic net, and group lasso, is to select some of the variables to make a new model.In this paper, these methods are used to reduce the number of RF trees to enhance the RF performance.In cases where the number of RF trees exceeds the number of observations (Ntree > N), shrinkage methods can be used to reduce the number of trees.
Lasso is one of the most common variable-selection methods, introduced by Tibshirani [27].Lasso regression reduces the residual sum of squares and minimizes the prediction error where the sum of the absolute values of regression coefficients are less than the constant value of t.This method is used in high-dimensional data as well as in multi-collinearity [28].In the variable-selection process in which Lasso is used, some coefficients may be estimated at exactly zero and the non-zero coefficients stay in the model.Lasso tries to find the regression model that leads to the minimum residual sum of squares.The coefficients are limited by applying constraints to the regression model.This penalty is dependent on the tuning parameter of λ.Suppose (X, Y) is a dataset so that X = x 1 , . . ., x p is the prediction variable and Y is the response variable.The objective function is as follows: where β is the linear-regression coefficient; λ ≥ 0 is the tuning parameter that controls the amount of contraction.The value of λ is selected based on cross-validation.If λ is set to zero, then the lasso estimator will be the same as OLS.Generally, the increase in λ causes many coefficients to become zero [29].

Elastic Net Regression
Although the lasso regression carries out the variable selection, it has some defects.The first defect is the number of features that the lasso regression selects.When the dataset has high dimensionality, lasso selects features less than N because the sample size has been limited by the lasso regression.The second defect is the selection of correlated features.The lasso regression selects only some of the related features, while it is expected to select all correlated features or remove them all.In [30,31], the elastic-net regression was introduced to overcome these defects.It is a combination of two methods: lasso regression ( 1 -norm) [27] and ridge regression ( 2 -norm) [32].The 2 -norm part of the penalty makes a sparse model by shrinking some regression coefficients toward zero.On the other hand, the 1 -norm part of the penalty removes some number of the selected variables.In addition, the weight between the two tuning parameters is determined by α so that the elastic net is converted into ridge regression at α = 1 and is changed into lasso regression at α = 0. Therefore, the 2 and 1 penalties create a range of tuning parameters between 0 and 1 (i.e., 0 ≤ α ≤ 1).
Suppose (X, Y) is a dataset so that X = x 1 , . . ., x p is the prediction variable and Y is the response variable.Elastic net uses the combination of the 2 and 1 penalties, which can be defined as follows: where are the 1 -norm and 2 -norm of β, respectively.λ ≥ 0 is a tuning parameter that is selected based on cross-validation.

Group-Lasso Regression
In machine-learning problems, methods such as group lasso can be used to find groups of variables.Group lasso [33,34] is another popular method for variable selection, which uses the 1 -norm for the selection and shrinkage of variables.This method is a generalization of lasso which performs the group-wise variable selection, applies the group constraints to variables, and estimates coefficients as a group due to the group structure.Consequently, the group of coefficients becomes either zero or non-zero.In contrast, lasso and elastic net perform shrinkage for each variable.The parameter vector β is divided into G 1 , G 2 , . . ., G q groups where ∪ q j=1 G j = {1, 2, . . . ,p}.The vector β is Suppose (X, Y) is a dataset so that X = x 1 , . . ., x p is the prediction variable and Y is the response variable.The group lasso estimator is defined as follows: where m j is a coefficient to create balance in groups of different sizes.The m j is selected as T j where T j is the cardinality G j and λ is the tuning parameter that controls the amount of regularization [35].

Structure of Ensemble of Clustered and Penalized Random-Forest Model
In this section, we present a combination of machine-learning algorithms and statistical methods that reduce RF trees and increase prediction accuracy.We called this method ECAPRAF which is an optimization problem to obtain the optimal value of regression coefficients defined on RF trees and reduce them within clusters.The coefficients are obtained by solving the following equation: where p λ (β t ) can be the lasso, elastic-net, or group-lasso penalty, with the lasso penalty given in Equation ( 11) of the theorem.β k is the regression coefficient defined for RF trees in the k-th cluster, T k (z u ) is the RF trees in the k-th cluster, and y k (z u ) is the mean of the predicted values in RF.The major problem in solving Equation ( 8) is that it does not have a closed-form solution.To overcome this, numerical methods can be used.For this purpose, clusters are kept fixed and Equation ( 8) is solved based on β k .According to the proposed model, the theorem below is presented to estimate and prove the optimal β k .
Theorem: Suppose PRSS(β k ) is the penalized residual sum of the squares: and p λ (β t ) is the Lasso penalty.The optimal value for β k that minimize PRSS(β k ) is obtained by solving the following equation: Proof: does not have a close form and can be solved using numerical methods.
Figure 3 shows the hybrid structure of the proposed model (ECAPRAF).As shown in the figure, the dataset {(x 1 , y 1 ), . . . ,(x N , y N )} is clustered into k clusters C 1 , . . . ,C k to identify the homogeneous subset of data.Within each cluster, the data are divided into the training set D = {(x 1 , y 1 ), . . . ,(x k , y k )}, and the test set Z = {z 1 , z 2 , . . . ,z d }, which are used for training and evaluation, respectively.The M k trees are trained by a training set in which T j k (z u ), j k = 1, . . ., M k , u = 1, . . ., d k are obtained from RF and then are evaluated by a test set.Finally, the T k (z u ) = 1, T 1 k (z u ), . . . ,T M k (z u ) matrix, which is generated from the output of the RF trees, represents the explanatory variables.The predicted results, obtained from the RF trees, are . ., K.This step produces the predicted trees to perform the next stage.The constructed trees are used as the inputs of the lasso, elastic-net, and group-lasso regressions with the response variables y k to automatically incorporate these trees into the model.Suppose that in the k-th cluster, these trees form the linear regression with the independent variable T k and dependent variable Y k .The linear-regression model corresponding to the cluster is: where The purpose is to optimize and estimate the coefficients of trees so that some of them are inclined toward zero or removed from the model using shrinkage methods.Consequently, some of the coefficients become zero and are removed from the model.Reducing trees creates a more appropriate and efficient model than the traditional RF.In principle, it reduces error and enhances the model performance.Finally, m k trees are selected in each cluster.The weak learners are corrected by other strong learners and the error is reduced by averaging so that the sum of the whole trees of clusters becomes m.
To designate the value of the tuning parameter λ k in Equation ( 11), 10-fold crossvalidation is implemented to gain the best λ k within clusters that gives the minimum error.In the case where no prior information is available for the value of k, the number of groups can specified using formal statistical tests.One of them is the Gap statistic [22], which is as follows: where B is the number of observation datasets, W kb is the within-dispersion matrix and W k is the within-group dispersion matrix.The optimal number of clusters is selected based on the minimum k.Several unsupervised-clustering test statistics related to this issue include the Gap index, Gamma index, and Friedman index.They are available in two packages, the NbClust package [23] and the cluster in R software.

Simulation Study
In this section, the performance of the proposed ECAPRAF through two real datasets and a simulation study is investigated.Then, the prediction performance of ECAPRAF with RF is compared.In addition, the proposed method is compared with ECAPRAF-Lasso, ECAPRAF-EN, and ECAPRAF-GL.

Simulation Study Design
In this subsection, a Monte Carlo simulation study is conducted to assess the performance of RF and the proposed model.All computational procedures are conducted by R software.This is performed in four parts aiming to improve the performance of RF.The steps include k-means clustering to identify homogeneous groups of data, the RF algorithm to predict trees, and shrinkage methods to reduce the number of trees.Finally, the ensembling of the remaining trees is performed to aggregate them.As a result, the error rate is reduced and the performance of the traditional RF algorithm is improved.In the end, the proposed hybrid algorithm and the RF algorithm are evaluated in terms of the number of selected trees and the error criteria.
In this study, we assume that the simulation dataset includes N = 500 random samples and p = 4 predictor variables for the linear model.The simulation dataset is partitioned into different clusters using k-means clustering for homogenization.The within-cluster data are grouped based on their similarity so that there is less dispersion and high correlation within clusters and less correlation between clusters.Within clusters, 80% of the dataset is selected for the training set, and the rest is chosen for the testing set.In the RF algorithm, Ntree = 300, 500, 800, and 1000 are considered for the total number of trees.Inside each cluster, 100, 240, 200, and 300 trees are considered for the first cluster, and 200, 260, 600, and 700 trees for the second cluster.The details of the hyper-parameters that are used in this framework are given in Table 1.
The linear-regression model is defined as Equation ( 14), mentioned in Wang's paper.The variables x 1 , x 2 , x 3 , and x 4 are generated from the standard normal distribution N(0, 1), which follows the regression model below: where ε 1 obeys the normal distribution N 0, σ 2 1 .The value of σ 2 1 is equal to 1  3 of the standard deviation of x 1 − 9x 2 − 3x 3 − 7x 4 − 4. In the first step, the dataset is partitioned into two clusters based on k-means clustering.The 3D scatter plot is drawn in Figure 4.It shows the number of clusters and the scattering of points in three dimensions.Each cluster is shown in a different color.As can be seen, two clusters are obtained for the simulation dataset, highlighted in blue and red.The first cluster with a red color contains 256 samples and the second one with a blue color contains 244 samples.The dataset consists of four features from which three features x 1 , x 2 , and x 3 are used to draw a diagram.In the second step, in each cluster, the RF algorithm is used to predict the initial decision trees with high accuracy and low variance, which are the inputs of lasso, elastic net, and group lasso.The defined linear-regression model for the prediction trees of each cluster is as follows: where . ., K is the mean of trees obtained from the RF algorithm and β k is the linear-regression coefficient.Each β k is changed with the number of trees in each cluster.The values of β 1 and β 2 for 500 trees are equal to 240 and 260 trees, respectively: .01, 0.01, 0.02, 0.02, . . ., 0.08, 0.08, 0.09, 0.09 s=90 , 0, . . ., 0 .01, 0.01, 0.02, 0.02, . . ., 0.08, 0.08, 0.09, 0.09 s=90 , 0, . . ., 0 In the next step of the proposed algorithm, the lasso, elastic-net, and group-lasso regressions are applied to shrink the tree output of each cluster.These methods are described in Section 3. Finally, the number of trees is automatically selected in each cluster without prior selection.Some trees are removed, and the model is estimated based on the remaining trees.In the last step, the remaining trees are ensembled to reduce the error rate.The obtained model from the combination of machine-learning algorithms and statistical methods improves the performance of the traditional RF algorithm as well as the proposed algorithm [6].

Simulation Results
In this subsection, the performance of our approaches, described in Section 5.1, is compared with RF.The results of the methods are calculated with 500 repetitions to evaluate the prediction accuracy using mean-squared error (MSE), root-mean-squared error (RMSE), mean absolute error (MAE), weighted mean-squared error (WMSE), weighted root-meansquared error (WRMSE), and weighted mean absolute error (WMAE), which are used to assess the results as follows: where ŷi , y i and w i are the predicted values, the true values of the i-th sample, and the number of samples in the k-th cluster, respectively.In Figure 5a,c, which show the relationship between MSE and log (λ), the red dots indicate the cross-validation error, and the top and bottom lines display the standard deviation for ECAPRAF-EN.The left and right vertical lines show the λ min and λ 1se , respectively.The λ min shows the minimum lambda with the lowest error and the λ 1se represents the lambda value based on the standard error.The above characters show the number of non-zero coefficients in each cluster.Figure 5b,d show the ECAPRAF-EN regression coefficients in each cluster.To the best of our knowledge, different variable coefficients have various effects on the response variable.The first variable entered into the model is the most effective one, and the subsequently entered variables have different effects.Finally, the coefficients of ineffective variables on the model are eliminated.This is true for lasso and group lasso, too.
According to Figure 5a,c, log(λ), 58 trees out of 240 and 48 trees out of 260 are selected from the first and second clusters, respectively.In the total of two clusters, 106 trees out of 500 are selected in ECAPRAF-EN.The left vertical dashed line refers to λ min = 0.6891 with the minimum MSE = 8.4355 and λ min = 6.6574 with the minimum MSE = 8.6256 for the first and second cluster, respectively.Moreover, the bar plot of the predicted error for 500 trees is presented in Figure 5, which shows the lowest value of WRMSE for ECAPRAF-EN.
According to Table 2, the results of WMSE, WRMSE, and WMAE are obtained from the simulation of a linear-regression model for 300, 500, 800, and 1000 trees, from which 99, The simulation results of the accuracy measurements, i.e., MSE, RMSE, and MAE, are reported in Table 3 for each cluster.As can be seen, similar to the results of WRMSE in Table 2 and Figure 6, the RMSE value of ECAPRAF-EN in each cluster is the lowest value among the three algorithms.Elastic net also has the highest reduction compared to other shrinkage methods.Therefore, it can be concluded that both within clusters and for the sum of two clusters, the ECAPRAF-EN algorithm has the lowest error and the highest accuracy among the other proposed methods.
As shown in Table 3, the number of RF trees used within some clusters may be the same, but the number of trees reduced by shrinkage methods can be different.This leads to different prediction accuracies.The reason for this can be the amount of correlation and the homogeneity of the data within the clusters.
Generally, it can be said that the trees produced from the RF due to high variance and low bias can be associated with high error and low accuracy [36].As a result, the usage of shrinkage methods can help to improve the performance of RF.This improvement varies in different methods.For example, in lasso regression, if the number of trees is bigger than the observations, then the number of selected trees will be less than the observations.There is no such defect in the elastic-net regression.Although a larger number of trees is selected, it has less error and greater accuracy than lasso.Concerning group lasso, although group selection removes a group of trees and can create a better model than other methods, it removes the trees that lead to the improvement of the model performance.

The Real Data Analysis
The performance of the random-forest, ECAPRAF-Lasso, ECAPRAF-EN, and ECAPRAF-GL algorithms are described through two real datasets.The utilized datasets contain Boston house prices and real-estate valuations.The basic information of the two datasets is shown in Table 4.The Boston house-price dataset contains 506 observations and 13 variables whose response variable is the median of the owner-occupied house price.The data were first published by Harrison and Rubinfeld [37] and are publicly available through the MASS package in R (https://cran.r-project.org/package=MASS(accessed on 7 May 2021)).Another dataset includes real-estate valuations from the UCI machine-learning repository (accessed on 7 May 2021).The original owners of this dataset are Yeh and Hsu [38], and it consists of seven variables from which five variables are selected including house age, distance to the nearest MRT station, number of convenience stores in the living circle on foot, latitude, longitude, the transaction date, and the house price of the select unit area as the independent variable and the house price of the unit area as the dependent variable.In the simulation section, the datasets are first clustered, then in each cluster, 80% of the data are selected for the training set, and the remaining datasets are selected for the testing set.In this study, glmnet, gglasso, NbClust, and randomForest packages in R are used.In the real-estate valuation and Boston house-price datasets, the data are divided into three and two clusters in which different trees are used in each cluster, respectively.In the RF algorithm, Ntree = 300, 500, 800, and 1000 are considered for the total number of trees of clusters and different trees in each cluster.The 3D scatter plot is drawn in Figure 7, which shows the number of clusters and the scattering of points in three dimensions.Each cluster is shown in a various color.As can be seen in Figure 7a, two clusters are obtained for the Boston house-price dataset, highlighted in blue and red.The first red-colored cluster contains 369 samples, and the other blue-colored cluster contains 137 samples.The dataset consists of 13 features from which 3 features, namely, the average number of rooms (rm), the pupil-teacher ratio by town (ptratio), and the weighted distances of employment centers (dis) are used to draw the diagram.For the real-estate-valuation dataset: three clusters colored blue, red, and green are identified including 41 samples in the first cluster, 93 in the second cluster, and 280 in the third cluster.The diagram is shown in Figure 7b.
The predicted results of the proposed technique (ECAPRAF) are shown for the realestate-valuation dataset in Table 5 and the Boston house-price in Table 6.The WMSE, WRMSE, and WMAE are calculated for different trees.The WRMSE, WMAE, and WMSE, obtained for RF, are based on the Boston house-price dataset, 3.4435, 2.1903, and 11.8579, respectively.In contrast, the values obtained with the proposed technique for ECAPRAF-EN are 3.2218, 2.1515, and 10.3801, respectively.As can be seen, the proposed method has a 12% reduction compared to RF.There is also a significant reduction in real-estate valuation, the results are shown in Table 5.In these datasets, the ECAPRAF presents more acceptable results than the traditional RF.Additionally, ECAPRAF-EN yields more efficient results than other shrinkage methods.Additionally, not only does the ECAPRAF-EN select many trees compared to ECAPRAF-Lasso, but it also has a lower WMSE, WRMSE, WMAE, and a great performance.Other algorithms have a better performance than RF.Therefore, it is concluded that the ECAPRAF-EN algorithm improves the performance of RF by reducing the number of trees and also outperforms other algorithms.The performance of the proposed technique and RF in terms of WRMSE is indicated in Figure 8.

Discussions
In this paper, the ECAPRAF model is proposed to increase the RF algorithm accuracy and reduce the computational time.In the proposed model, the k-means clustering algorithm clusters the data.RF extracts the prediction trees.The shrinkage methods such as lasso, elastic net, and group lasso are used to reduce the number of random-forest trees.The generated trees are input into lasso, group lasso, and EN algorithms for tree reduction.Finally, the remaining trees are ensembled.
Our combined model achieved a higher accuracy in comparison to published study [6] that used only lasso to reduce the number of RF trees.In the present proposed model, after clustering the simulation result of the linear model with 500 trees, the MSE in our proposed model reached 8.56 compared to 21.64 in [6].In the real datasets, i.e., Boston house price and real-estate valuation, the MSE reached 10.93 and 32.94 in our model, respectively, compared to 12.84 and 61.97 in [6].Moreover, we used the elastic-net method instead of lasso, which resulted in the better performance of EN than lasso.
Using the RF trees as the input of the EN method to reduce the number of trees, ECAPRAF-EN achieved a 12% error reduction, which was the highest reduction compared to other methods.It means that the proposed model has much higher accuracy and smaller error compared with the traditional methods, which proves that the proposed model is effective and has better performance.
Although ECAPRAF-EN improved the accuracy of prediction compared with other methods, we will choose deep-learning methods and different ensemble methods to further study in future work in order to improve the model performance and prediction accuracy.

Conclusions and Future Work
In this research, we proposed a combination of machine-learning algorithms and statistical methods.This method consisted of four parts.In the first part, k-means clustering was used to identify k homogeneous clusters, which homogenizes the data and assigns them to similar clusters.In the second part, the data within the clusters were trained for the prediction of RF trees, and weak trees were created with high accuracy.In the third part, shrinkage methods such as lasso, elastic net, and group lasso were used to decrease the number of RF trees.They increased efficiency, reduced error, and improved the model performance.Finally, the remaining trees were ensembled to reduce variance and error.To evaluate the efficiency of the designed method, the ECAPRAF model was applied to both real and simulation data.The results were analyzed using MSE, RMSE, and MAE criteria for each cluster and WMSE, WRMSE, and WMAE criteria for the weighted sum of clusters.The simulation results showed that in all four states, ECAPRAF-EN, ECAPRAF-Lasso, and ECAPRAF-GL were reduced by about 12%, 11.5%, and 11% compared to RF, respectively.The real datasets showed about 12%, 7%, and 11% reduction, respectively.Among the three methods, including the ECAPRAF-EN, ECAPRAF-Lasso, and ECAPRAF-GL, ECAPRAF-EN provided better results and more trees within clusters.Concerning the number of trees, ECAPRAF-EN with 128 trees performed better than ECAPRAF-Lasso with 100 and ECAPRAF-GL with 105.Among the methods, ECAPRAF-EN has the best performance compared to other shrinkage methods.Therefore, it can be concluded from this study that our proposed model, which sought to increase the accuracy and improve the performance of traditional RF with clustering and shrinkage methods, is efficient.
As a future work, we will use the CNN algorithm instead of random forest.This will be done by replacing multiple parallel CNNs instead of random-forest trees.In other words, we will enhance the CNN network by using shrinkage methods.

Figure 1 .
Figure 1.Flowchart of ECAPRAF.A dataset undergoes clustering in order to identify homogeneous subsets of data and assign them to similar groups.Then, trees are extracted as predictors by random forest and reduced by shrinkage methods.Therefore, elimination through the shrinkage methods as a means of reduction leads to a decrease in error and an increase in the model accuracy.In the last part, the remaining trees are ensembled to improve the performance of the model.

Figure 2 .
Figure 2. The basic structure of random forest.

Figure 3 .
Figure 3. Detailed structure of ECAPRAF Model.First, dataset is clustered into K classes to identify the homogeneous subset of data and assign them to similar groups.Inside each cluster, the data are trained to produce RF trees for a more accurate prediction.The trees are reduced by shrinkage methods in order to reduce computational time.Finally, all clusters are ensembled.

Figure 4 .
Figure 4. 3D scatter plot shows the number of clusters and the relationship between variables with different colors on three axes.The dataset consists of four variables from which three variables x 1 , x 2 , x 3 are plotted.

Figure 5 .
Figure 5. (a,b) The simulation results of the algorithm in the first cluster.(c,d) The simulation results of the algorithm in the second cluster.(a,c) The left and right dashed lines show the λ min and λ 1se , respectively.The λ min shows the minimum lambda with the lowest error and the λ 1se represents lambda value based on the one-standard error.The upper part of the plot shows the number of non-zero coefficients in each cluster for a given log(λ).(b,d) The number of effective variables in the model.

Figure 7 .
Figure 7. 3D scatter plot shows the number of clusters and the relationship between variables with different colors on three axes.(a) The dataset consists of 13 features from which 3 features are the average number of rooms (rm), the pupil-teacher ratio by town (ptratio), and weighted distances of employment centers (dis).(b) The dataset consists of 5 features from which 3 features are longitude, latitude, and house age.

Figure 8 .
Figure 8. Bar plot of the difference in proposed methods based on 500 trees.The RMSE value of ECAPRAF-EN is the lowest value among the three algorithms.Elastic net also has the highest reduction compared to other shrinkage methods.

Table 1 .
Parameter setting for simulation study.

Table 3 .
The results of the proposed method for the linear model in the clusters.Bar plot showing the difference among the proposed methods based on 500 trees.The RMSE value of ECAPRAF-EN is the lowest value among the three algorithms.Elastic net also has the highest reduction compared to other shrinkage methods.

Table 5 .
The results of the proposed method for the real-estate-valuation dataset.

Table 6 .
The results of the proposed method for the Boston house-price dataset.