Mitigating the Multicollinearity Problem and Its Machine Learning Approach: A Review

Technologies have driven big data collection across many fields, such as genomics and business intelligence. This results in a significant increase in variables and data points (observations) collected and stored. Although this presents opportunities to better model the relationship between predictors and the response variables, this also causes serious problems during data analysis, one of which is the multicollinearity problem. The two main approaches used to mitigate multicollinearity are variable selection methods and modified estimator methods. However, variable selection methods may negate efforts to collect more data as new data may eventually be dropped from modeling, while recent studies suggest that optimization approaches via machine learning handle data with multicollinearity better than statistical estimators. Therefore, this study details the chronological developments to mitigate the effects of multicollinearity and up-to-date recommendations to better mitigate multicollinearity.


Introduction
Multicollinearity is a phenomenon that can occur when running a multiple regression model. In this age of big data, multicollinearity can also be present in the field of artificial intelligence and machine learning. There is a lack of understanding of the different methods for mitigating the effects of multicollinearity among people in domains outside of statistics [1]. This paper will discuss the development of the methods chronologically and compile the latest methods.
Forecasting in finance deals with a high number of variables, such as macroeconomic data, microeconomic data, earnings reports, and technical indicators. Multicollinearity is a common problem in finance as the dependencies between variables can vary over time and change due to economic events. Past literature tried to remove collinear data to reduce the effects of multicollinearity. This is done through stepwise regression that eventually arrives at a model with a low root mean square error (RMSE). The computational difficulty of this has led to many selection criteria to be developed to choose models. A breakthrough

What Is Multicollinearity?
Multicollinearity is a condition where there is an approximately linear relationship between two or more independent variables. This is a multiple linear regression model: where y is the dependent variable, x 1 , . . . ,x p represent the explanatory variable, β 0 is the constant term, β 1 , . . . , β p are the coefficients of the explanatory variable, and e is the error term. The error term is the difference between the observed and the estimated values. It is normally distributed with a mean of 0 and variance σ. In the presence of multicollinearity, x 1 may be linearly dependent on another explanatory variable such as x 2 . The resulting model would be unreliable. The effects and problems are discussed in the following section. For example, when using technical indicators in stock analysis. There will be a multicollinearity issue if the indicators measure the same type of information such as momentum [3]. The different indicators are all derived from the same series of closing prices in such a case. In the context of the stock market, data are handled differently from time-series data in other fields. This is due to the following few key reasons, according to The authors of [4]. The goal of compiling stock market data is to maximize profit and not reduce prediction error. Stock market data are highly time-variant, which means the output depends on the moment of the input. They are also dependent on indeterminate events. This means that the event that causes the response is not fixed.

Effects of Multicollinearity
According to [5], there are four main symptoms of multicollinearity. The first one is a large standard error of the coefficients. Next, the sign of a variable coefficient can be different from the theory. The explanation of the variable's effect on the output will be wrong or misleading. In addition, there will be a high correlation between the predictor variable and outcome, but the corresponding parameter is not statistically significant. The last symptom is that some correlation coefficients among predictor variables are large in relation to the explanatory power or R-Squared of the overall equation.
These are merely symptoms and do not guarantee the presence of multicollinearity. There are two major problems of multicollinearity. Estimates are unstable due to the interdependence of the variables and standard errors if the regression coefficient is large. This makes the estimates unreliable and therefore decreases their precision [6]. As two or more variables have linear relationships, the marginal impact of a variable is hard to measure. The model will have poor generalization ability and overfit the data. This means that it will perform poorly on data it has never seen.

Ways to Measure Multicollinearity
Previous literature found that there are four measurements of multicollinearity. The first detector of multicollinearity is a pairwise correlation using a correlation matrix. According to [7], a bivariate correlation of 0.8 or 0.9 is commonly used as a cut-off to indicate a high correlation between two regressors. However, the problem with this method is that the correlations do not necessarily mean multicollinearity as they are not the same. The most widely used indicator of multicollinearity is the Variation Inflation Factor (VIF) or Tolerance (TOL) [8]. The VIF is defined as where R 2 j is the coefficient of determination for the regression of x j on the remaining variables. The VIF is the reciprocal of TOL. There is no formal value of VIF to determine the presence of multicollinearity, but a value of 10 and above often indicates multicollinearity [9].
Another method of measuring multicollinearity is using eigenvalues, which is from the Principal Component Approach (PCA). A smaller eigenvalue indicates a larger probability of the presence of multicollinearity. The fourth method is the Condition Index (CI). It is based on the eigenvalue. CI is the square root of the ratio between the maximum eigenvalue and each eigenvalue. According to [10], a CI of between 10 to 30 indicates moderate multicollinearity, while above 30 indicates severe multicollinearity.
VIF and CI 2 are commonly used treatments to determine the severity of the dataset before performing the methods to solve multicollinearity. It is important to note that the effectiveness of the two treatments in reducing multicollinearity is usually determined by comparing the root mean square error or out-sample forecast before and after treatments [11].

Reducing the Effects of Multicollinearity
Collecting more data is one of the simplest solutions to reduce the effects of multicollinearity because collinearity is more of a data problem than model specification problem. However, this is not always feasible, especially when research is undertaken using convenience sampling [1]. There is a cost associated with collecting more data. Furthermore, the quality of data collected might be compromised. Methods to eliminate multicollinearity by reducing the variances of regressor variances can be categorized into two methods: variable selection and modified estimates. Both methods can be applied at the same time. The detail of the variable selection and modified estimates methods are explained in the following sub-topics. Next, the machine learning approaches are also presented.

Variable Selection Methods
Researchers are mainly concerned about multicollinearity problems when forecasting with a linear regression model. Generally, researchers try to mitigate the effects of multicollinearity by using variable selection techniques so that a more reliable estimate of the parameter can be obtained [12]. These are commonly heuristic algorithms and rely on using indicators. The method can be completed by combining or eliminating variables. However, caution must be taken not to compromise the theoretical model to reduce multicollinearity. One of the earliest methods was stepwise regression. There are two basic processes, namely forward selection and backward elimination [13]. The forward selection method starts with an empty model and adds variables one at a time, while the backward elimination method starts with the full model with all available variables and drops them one by one. In each stage, they select the variable with the highest decrease in the residual sum of squares for forward selection or the lowest increase in the residual sum of squares for backward elimination.
However, there are some drawbacks to stepwise regression. According to the author of [14], it does not necessarily yield the best model in terms of the residual sum of squares because of the order that these variables are added. This is especially true in the presence of multicollinearity. It is also not clear which of the two methods of stepwise regression is better. Furthermore, it also assumes there is a best equation when there can be equations with different variables that are equally as good. Another problem of the selection criterion is the computational effort required [15]. There are 2 k possible combinations for k independent variables. The amount of computation needed also increases exponentially with the total number of independent variables.
To reduce computation time, the authors of [16] therefore developed a more comprehensive method to fit the equation to the data. It uses a fractional factorial design with the statistical criteria, Cp, to avoid computing all the possible equations. It also works better on data with multicollinearity as the effectiveness of a variable is evaluated by its presence or absence in an equation. The Cp criterion was developed by the author of [17]. It provides a way to graphically compare different equations. The selection criterion Cp is as follows: where p is the number of variables, RSS, is the residual sum of squares for the regression being considered andσ 2 is an estimate of σ 2 ; it is frequently the residual mean square from the complete regression. The model with a lower Cp is better. Later, the authors of [18] proposed a more general selection criterion, Sp that has shown to outperform the Cp criterion. Methods that are based on the least square estimator such as the Cp criterion suffer in the presence of outliers and when the error variable deviates from normality. The Sp criterion solves this problem and can be used with any estimator of β without a need for modification. The Sp criterion is defined as follows: where k and p are the parameters of the full and subset model, respectively. Information criteria provide an attractive way for model selection. Other criterions that are often used include, Akaike Information Criterion (AIC), Bayesian Information Criterion (BIC), etc. [19]. According to [20], the difference between AIC and BIC is that BIC is consistent in selecting the model when the true model is under consideration. Meanwhile, AIC aims to minimize risk functions when the true model is not one of the candidates. The choice of criterion depends on the researcher and both AIC and BIC are suggested to be used together. Table 1 provides a summary of each stepwise feature selection and quality criterion. The authors of [5] proposed the use of principal component analysis (PCA) as a solution to multicollinearity among predictor variables in a regression model. It is a statistical method to transform variables into new uncorrelated variables called principal components and reduce the number of predictive variables. Regression analysis is done using the principal components. The principal components are independent and thus satisfy the OLS assumption. The principal components are ranked according to the magnitude of the variance. This means that principal component 1 is responsible for a larger amount of variation than principal component 2. Therefore, PCA is useful for dimensionality reduction. Principal components with an eigenvalue near zero can be eliminated. This way the model is sparse while not dropping variables that might contain useful information.
The Partial Least Squares (PLS) method was developed by the author of [22]. It is a better alternative to multiple linear regression and PCA because it is more robust. The model parameters do not change by much when new samples are used. PLS is like PCA as it is also a dimension reduction technique. The difference is that it captures the characteristics of both X and Y instead of just X as does PCA. The PLS method works by iteratively extracting factors from X and Y while maximizing the covariance between the extracted factors. The PLS derives its usefulness from its ability to analyze noisy data with multicollinearity. This is because its underlying assumptions are much more realistic than traditional multiple linear regression. The authors of [23,24] compared the PLS method with the lasso and stepwise method and found it to be performing better.
A few journals have made comparisons among the techniques. The authors of [25] discussed and compared PCA and PLS as they are both dimension reduction methodologies. Both methods are used to convert highly correlated variables into independent variables and variable reduction. The methodology of PCA does not consider the relationship between the predictor variable and the response variable, while PLS does not. Therefore, the PCA is a dimension reduction technique that is unsupervised and PLS is a supervised technique. They also found that the predictive power of the principal components does not line up with the order. For example, the principal component 1 explains the change in response variable less than principal component 2. PLS is more efficient than PCA in this regard as it is a supervised technique. PLS is extracted based on significance and predictive power. The author of [26] compared partial least square regression (PLSR), ridge regression (RR), and principal component regression (PCR) using a simulation study. The study used MSE to compare the methods. They have found that when the number of independent variables increases, PLSR is the best. If the number of observations and the number of multicollinearities are large enough while the number of independent variables is small, RR has the smallest MSE.
Recent application of PLS is seen in the chaos phase modulation technique for underwater acoustic communication. The authors of [27] adopted a PLS regression into the chaos phase modulation communication to overcome the multicollinearity effect. They described PLS as a machine learning method that uses the training and testing processes simultaneously. The study found that this method effectively improves the communication signals. The authors compared it with two algorithms: the Time Reversal Demodulator and 3-layer Back Propagation Neural Network that does not perform feature analysis and relationship analysis. It shows that PLS regression has the best performance.
A multigene genetic programming was first developed by the authors of [28,29], who used this method to automate predictor selection to alleviate multicollinearity problems. The authors of [30] described a genetic algorithm-based machine learning approach to perform variable selection. The genetic algorithm is a general optimization algorithm based on concepts such as evolution and survival of the fittest. The model is initialized with creating a population with several individuals. Each individual is a different model. The genes of the model are features of the model. An objective function is used to determine the fitness of the models. In the next generation/iteration, the best model will be selected and have their genes "crossover". Some features of the parent model are combined. Mutation may also occur with some determined probability where the feature is reversed. According to [30], this machine learning concept should be combined with a derivative based search algorithm for a hybrid model. This is because genetic algorithms are very good at finding generally good solutions but not good at finding local minima, such as derivative based search algorithms. Derivative based search algorithms can be performed after a certain amount of iteration of the genetic algorithm. Iterations are continued until no improvement in the fitness of the model is seen.
The authors of [31] proposed a quadratic programming approach to feature selection because previous methods do not consider the configuration of the dataset and therefore is not problem dependent. The aim of using quadratic programming is to maximize the number of relevant variables and reduce similar variables. The criterion Q that represents the quality of a subset of features a is presented in quadratic form. Q(a) = a T Qa − b T a, where Q ∈ R nxn is a matrix of pairwise predictor similarities, and b ∈ R n is a vector for relevance of the predictor to the target vector. The author suggested that the similarity between the features x i and x j and between x i and y can be measured using Pearson's correlation coefficient [32] or the concept of mutual information [33]. However, these two methods do not directly capture feature relevance. The authors utilized a standard t-test to estimate the normalized significance of the features to account for it. This proposed method outperforms other feature selection methods, such as Stepwise, Ridge, Lasso, Elastic Net, LARS, and the genetic algorithm.
The authors of [34] presented the maximum relevance-minimum multicollinearity (MRmMC) method to perform variable selection and ranking. Its approach focusses on relevancy and redundancy as well. Relevancy refers to the relationship between features and the target variable, while redundancy is the multicollinearity between features. The main advantage of this paper over others is that it does not require any parameter tuning and is relatively easy to implement. Relevant features are measured with a correlation coefficient and redundancy with squared multiple correlation coefficient. A measure J that combines relevancy and multicollinearity is developed.
where r 2 is the correlation coefficient between feature f and target c. sc is the squared multiple correlation coefficient between feature f and its orthogonal transformed variable q. The first feature is selected using the optimization criteria V and the following are selected based on criterion J using a forward stepwise method. Although non-exhaustive, it is a very competent method for feature selection and reducing dimension. The authors of [35] suggested that the mixed integer optimization (MIO) based approach to selecting variables has received increased attention with development in algorithms and hardware. They developed mixed integer quadratic optimization (MIQO) to eliminate multicollinearity in linear regression models. It adopts VIF as an indicator for detecting multicollinearity. Subset selection is performed subject to an upper bound constraint on VIF of each variable. It achieved higher R-Squared than heuristic-based methods such as stepwise selection. The solution is also computationally tractable and simpler to implement than cutting plane algorithm.
The authors of [36] proposed a profiled independence screening (PIS) method of screening for variables with high dimensionality and highly correlated predictors. It is built upon sure independence screening (SIS) [37]. Many variable selection methods developed before SIS do not work well in extremely high dimension data where predictors vastly outnumber the sample size. However, SIS may break down where the predictors are highly correlated, which resulted in the PIS. A factor profiling operator Q(Z I ) = I n − Z I Z T I Z I −1 Z T I is introduced to eliminate the correlation between predictors. The profiled data are applied to the SIS. Z I ∈ R nxd is the latent factor matrix of X and d is the number of latent factors. Factor profiling is as follows: Q(Z I )y = Q(Z I )Xβ + Q(Z I )ε, Q(Z I )y is the profiled response variable and the columns of Q(Z I )X are the profiled predictors. However, PIS may be misleading in a spiked population model. Preconditioned profiled independence screening (PPIS) solves this by using preconditioning and factor profiling. Two real data analyses show that PPIS has good performance. Outlier detection is also a viable method for variable selection. Recently, projection pursuit was used to perform an outlier detection-based feature selection [37]. Projection pursuit aims to look for the most interesting linear projections. The author optimized it to find outliers. The method was found to be effective in improving classification tasks. However, it performs poorly when most features are highly correlated or when features are binary. Table 2 provides a summary of the findings for variable selection approaches. The variable selection methods aim to reduce the number of variables to the few that are the most relevant. This may reduce the information gain from having more data to work with. Furthermore, the modern optimization methods depend on subjectively determined indicators of relevance and similarity. This can be seen from [11] where the authors suggested other measures of multicollinearity for future research. It is therefore difficult to suggest which method is better without directly comparing performance on the same dataset. Better performance can also be due to the specific problem tested.

Modified Estimators Methods
Modified Estimators is another approach that use biased and shrunken estimators in exchange for lower variance and thus reduce overfitting [12]. The advantage is that the theoretical model is not compromised because of the dropping of variables. Its disadvantage is that the estimators are now biased. The most known method is the ridge regression developed by the author of [41]. This method adds a penalty term: the squared magnitude of the coefficient β to the loss function. The general equation of ridge regression is as follows: Mathematics 2022, 10, x FOR PEER REVIEW 9 of 18 work with. Furthermore, the modern optimization methods depend on subjectively determined indicators of relevance and similarity. This can be seen from [11] where the authors suggested other measures of multicollinearity for future research. It is therefore difficult to suggest which method is better without directly comparing performance on the same dataset. Better performance can also be due to the specific problem tested.

Modified Estimators Methods
Modified Estimators is another approach that use biased and shrunken estimators in exchange for lower variance and thus reduce overfitting [12]. The advantage is that the theoretical model is not compromised because of the dropping of variables. Its disadvantage is that the estimators are now biased. The most known method is the ridge regression developed by the author of [41]. This method adds a penalty term: the squared magnitude of the coefficient β to the loss function. The general equation of ridge regression is as follows: The main issue with ridge regression is how to find ridge parameter ƛ. If ƛ is equal to zero, then the estimate will equal to the ordinary least square estimate. However, if the ƛ is too big, it will lead to an underfitting of the model. ƛ is selected by looking for the least increase in the root mean square error (RMSE) within an appropriate decrease in ridge variable inflation factors for each variable. A ridge trace is used to assist in this. It is a plot of coefficient ꞵ versus ƛ. It is used to pick the smallest ƛ at which the coefficients start to level off. Alternatively, a validation dataset is used, find ƛ that minimizes validation SSE. Identify ƛ such that the reduction in the variance term of the slope parameter is larger than the increment in its squared bias. The authors of [42] reviewed estimation methods for ƛ and new methods were suggested. A more recent paper proposed a Bayesian approach to solving the problem of finding the ridge parameter [43]. Simulation result showed that the approach is more robust and provide more flexibility in handling multicollinearity. Later, the authors of [44] proposed another way of solving the problem of finding the ridge parameter. Their generalized cross-validation approach to is able to find the global minimum.
More estimators have been developed from the ridge estimator. The authors of [45] used a jack-knife procedure to reduce the significant amount of bias of estimators from ridge regression. The author of [46] proposed a new class of estimator, the Liu estimator, based on the ridge estimator. It has the added advantage of a simple procedure to find the parameter ƛ. This is because the estimate is a linear function of ƛ. The author of [47] proposed the Liu-Type estimator. They found that the shrinkage of ridge regression is not effective when faced with severe multicollinearity. The Liu-Type estimator has a lower MSE when compared to ridge regression and fully addresses severe multicollinearity.
Since then, variations on ridge and Liu-type estimators have been created for use in different types of regression. The authors of [48] proposed a Liu-type estimator for binary logistic regression that is a generalization of the Liu-type estimator for a linear model. The authors of [49] stated that not much attention has been given to shrinkage estimators for generalized linear regression models, such as the Poisson regression model, logistic regression model, and negative binomial regression model. Therefore, they introduced a two-parameter shrinkage estimator for negative binomial models. It is a combination of the ridge estimator and Liu estimator. The authors of [50] modified the Jackknifed ridge regression estimator to form a Modified Jackknifed Poisson ridge regression estimator. The author of [2] reviewed the biased estimators in the Poisson regression model in the presence of multicollinearity. The regular maximum likelihood method in estimating regression coefficient is not reliable in the presence of multicollinearity. They compared the performance of four estimators in addition to the widely used ridge estimator and found that Liu-type estimators have superior performance over other methods in the Poisson regression model.
The main issue with ridge regression is how to find ridge parameter Mathematics 2022, 10, x FOR PEER REVIEW work with. Furthermore, the modern optimization methods depend on determined indicators of relevance and similarity. This can be seen from [1 authors suggested other measures of multicollinearity for future research. I difficult to suggest which method is better without directly comparing per the same dataset. Better performance can also be due to the specific problem

Modified Estimators Methods
Modified Estimators is another approach that use biased and shrunken exchange for lower variance and thus reduce overfitting [12]. The advantag theoretical model is not compromised because of the dropping of v disadvantage is that the estimators are now biased. The most known method regression developed by the author of [41]. This method adds a penalty term magnitude of the coefficient β to the loss function. The general equat regression is as follows: The main issue with ridge regression is how to find ridge parameter ƛ. If zero, then the estimate will equal to the ordinary least square estimate. How is too big, it will lead to an underfitting of the model. ƛ is selected by looking increase in the root mean square error (RMSE) within an appropriate decre variable inflation factors for each variable. A ridge trace is used to assist in th of coefficient ꞵ versus ƛ. It is used to pick the smallest ƛ at which the coeffic level off. Alternatively, a validation dataset is used, find ƛ that minimizes va Identify ƛ such that the reduction in the variance term of the slope parameter the increment in its squared bias. The authors of [42] reviewed estimation m and new methods were suggested. A more recent paper proposed a Bayesian solving the problem of finding the ridge parameter [43]. Simulation result the approach is more robust and provide more flexibility in handling mult Later, the authors of [44] proposed another way of solving the problem o ridge parameter. Their generalized cross-validation approach to is able to fin minimum.
More estimators have been developed from the ridge estimator. The au used a jack-knife procedure to reduce the significant amount of bias of esti ridge regression. The author of [46] proposed a new class of estimator, the L based on the ridge estimator. It has the added advantage of a simple procedu parameter ƛ. This is because the estimate is a linear function of ƛ. The au proposed the Liu-Type estimator. They found that the shrinkage of ridge regr effective when faced with severe multicollinearity. The Liu-Type estimator MSE when compared to ridge regression and fully addresses severe multico Since then, variations on ridge and Liu-type estimators have been creat different types of regression. The authors of [48] proposed a Liu-type estimat logistic regression that is a generalization of the Liu-type estimator for a linea authors of [49] stated that not much attention has been given to shrinkage e generalized linear regression models, such as the Poisson regression mo regression model, and negative binomial regression model. Therefore, they two-parameter shrinkage estimator for negative binomial models. It is a com the ridge estimator and Liu estimator. The authors of [50] modified the Jack regression estimator to form a Modified Jackknifed Poisson ridge regressio The author of [2] reviewed the biased estimators in the Poisson regression presence of multicollinearity. The regular maximum likelihood method i regression coefficient is not reliable in the presence of multicollinearity. The the performance of four estimators in addition to the widely used ridge es found that Liu-type estimators have superior performance over other me Poisson regression model.

. If
Mathematics 2022, 10, x FOR PEER REVIEW work with. Furthermore, the modern optimization methods depend determined indicators of relevance and similarity. This can be seen fro authors suggested other measures of multicollinearity for future resear difficult to suggest which method is better without directly comparing the same dataset. Better performance can also be due to the specific prob

Modified Estimators Methods
Modified Estimators is another approach that use biased and shrun exchange for lower variance and thus reduce overfitting [12]. The adv theoretical model is not compromised because of the dropping disadvantage is that the estimators are now biased. The most known m regression developed by the author of [41]. This method adds a penalty magnitude of the coefficient β to the loss function. The general e regression is as follows: The main issue with ridge regression is how to find ridge paramete zero, then the estimate will equal to the ordinary least square estimate. is too big, it will lead to an underfitting of the model. ƛ is selected by loo increase in the root mean square error (RMSE) within an appropriate variable inflation factors for each variable. A ridge trace is used to assist of coefficient ꞵ versus ƛ. It is used to pick the smallest ƛ at which the co level off. Alternatively, a validation dataset is used, find ƛ that minimize Identify ƛ such that the reduction in the variance term of the slope param the increment in its squared bias. The authors of [42] reviewed estimati and new methods were suggested. A more recent paper proposed a Bay solving the problem of finding the ridge parameter [43]. Simulation re the approach is more robust and provide more flexibility in handling Later, the authors of [44] proposed another way of solving the proble ridge parameter. Their generalized cross-validation approach to is able minimum.
More estimators have been developed from the ridge estimator. Th used a jack-knife procedure to reduce the significant amount of bias o ridge regression. The author of [46] proposed a new class of estimator, t based on the ridge estimator. It has the added advantage of a simple pro parameter ƛ. This is because the estimate is a linear function of ƛ. T proposed the Liu-Type estimator. They found that the shrinkage of ridge effective when faced with severe multicollinearity. The Liu-Type estim MSE when compared to ridge regression and fully addresses severe mu Since then, variations on ridge and Liu-type estimators have been different types of regression. The authors of [48] proposed a Liu-type est logistic regression that is a generalization of the Liu-type estimator for a authors of [49] stated that not much attention has been given to shrinka generalized linear regression models, such as the Poisson regressio regression model, and negative binomial regression model. Therefore, t two-parameter shrinkage estimator for negative binomial models. It is the ridge estimator and Liu estimator. The authors of [50] modified the regression estimator to form a Modified Jackknifed Poisson ridge regr The author of [2] reviewed the biased estimators in the Poisson regress presence of multicollinearity. The regular maximum likelihood meth regression coefficient is not reliable in the presence of multicollinearity the performance of four estimators in addition to the widely used rid found that Liu-type estimators have superior performance over othe Poisson regression model. is equal to zero, then the estimate will equal to the ordinary least square estimate. However, if the 9 of 18 thermore, the modern optimization methods depend on subjectively cators of relevance and similarity. This can be seen from [11] where the ed other measures of multicollinearity for future research. It is therefore est which method is better without directly comparing performance on t. Better performance can also be due to the specific problem tested.

mators Methods
stimators is another approach that use biased and shrunken estimators in er variance and thus reduce overfitting [12]. The advantage is that the el is not compromised because of the dropping of variables. Its that the estimators are now biased. The most known method is the ridge oped by the author of [41]. This method adds a penalty term: the squared he coefficient β to the loss function. The general equation of ridge ollows: sue with ridge regression is how to find ridge parameter ƛ. If ƛ is equal to timate will equal to the ordinary least square estimate. However, if the ƛ lead to an underfitting of the model. ƛ is selected by looking for the least oot mean square error (RMSE) within an appropriate decrease in ridge factors for each variable. A ridge trace is used to assist in this. It is a plot versus ƛ. It is used to pick the smallest ƛ at which the coefficients start to tively, a validation dataset is used, find ƛ that minimizes validation SSE. at the reduction in the variance term of the slope parameter is larger than its squared bias. The authors of [42] reviewed estimation methods for ƛ s were suggested. A more recent paper proposed a Bayesian approach to lem of finding the ridge parameter [43]. Simulation result showed that more robust and provide more flexibility in handling multicollinearity. rs of [44] proposed another way of solving the problem of finding the . Their generalized cross-validation approach to is able to find the global ators have been developed from the ridge estimator. The authors of [45] e procedure to reduce the significant amount of bias of estimators from . The author of [46] proposed a new class of estimator, the Liu estimator, ge estimator. It has the added advantage of a simple procedure to find the is is because the estimate is a linear function of ƛ. The author of [47] -Type estimator. They found that the shrinkage of ridge regression is not aced with severe multicollinearity. The Liu-Type estimator has a lower ared to ridge regression and fully addresses severe multicollinearity. variations on ridge and Liu-type estimators have been created for use in f regression. The authors of [48] proposed a Liu-type estimator for binary n that is a generalization of the Liu-type estimator for a linear model. The tated that not much attention has been given to shrinkage estimators for ar regression models, such as the Poisson regression model, logistic l, and negative binomial regression model. Therefore, they introduced a hrinkage estimator for negative binomial models. It is a combination of tor and Liu estimator. The authors of [50] modified the Jackknifed ridge ator to form a Modified Jackknifed Poisson ridge regression estimator. ] reviewed the biased estimators in the Poisson regression model in the lticollinearity. The regular maximum likelihood method in estimating icient is not reliable in the presence of multicollinearity. They compared of four estimators in addition to the widely used ridge estimator and type estimators have superior performance over other methods in the on model.
is too big, it will lead to an underfitting of the model. work with. Furthermore, the modern optimization methods depend on subjectively determined indicators of relevance and similarity. This can be seen from [11] where the authors suggested other measures of multicollinearity for future research. It is therefore difficult to suggest which method is better without directly comparing performance on the same dataset. Better performance can also be due to the specific problem tested.

Modified Estimators Methods
Modified Estimators is another approach that use biased and shrunken estimators in exchange for lower variance and thus reduce overfitting [12]. The advantage is that the theoretical model is not compromised because of the dropping of variables. Its disadvantage is that the estimators are now biased. The most known method is the ridge regression developed by the author of [41]. This method adds a penalty term: the squared magnitude of the coefficient β to the loss function. The general equation of ridge regression is as follows: The main issue with ridge regression is how to find ridge parameter ƛ. If ƛ is equal to zero, then the estimate will equal to the ordinary least square estimate. However, if the ƛ is too big, it will lead to an underfitting of the model. ƛ is selected by looking for the least increase in the root mean square error (RMSE) within an appropriate decrease in ridge variable inflation factors for each variable. A ridge trace is used to assist in this. It is a plot of coefficient ꞵ versus ƛ. It is used to pick the smallest ƛ at which the coefficients start to level off. Alternatively, a validation dataset is used, find ƛ that minimizes validation SSE. Identify ƛ such that the reduction in the variance term of the slope parameter is larger than the increment in its squared bias. The authors of [42] reviewed estimation methods for ƛ and new methods were suggested. A more recent paper proposed a Bayesian approach to solving the problem of finding the ridge parameter [43]. Simulation result showed that the approach is more robust and provide more flexibility in handling multicollinearity. Later, the authors of [44] proposed another way of solving the problem of finding the ridge parameter. Their generalized cross-validation approach to is able to find the global minimum.
More estimators have been developed from the ridge estimator. The authors of [45] used a jack-knife procedure to reduce the significant amount of bias of estimators from ridge regression. The author of [46] proposed a new class of estimator, the Liu estimator, based on the ridge estimator. It has the added advantage of a simple procedure to find the parameter ƛ. This is because the estimate is a linear function of ƛ. The author of [47] proposed the Liu-Type estimator. They found that the shrinkage of ridge regression is not effective when faced with severe multicollinearity. The Liu-Type estimator has a lower MSE when compared to ridge regression and fully addresses severe multicollinearity.
Since then, variations on ridge and Liu-type estimators have been created for use in different types of regression. The authors of [48] proposed a Liu-type estimator for binary logistic regression that is a generalization of the Liu-type estimator for a linear model. The authors of [49] stated that not much attention has been given to shrinkage estimators for generalized linear regression models, such as the Poisson regression model, logistic regression model, and negative binomial regression model. Therefore, they introduced a two-parameter shrinkage estimator for negative binomial models. It is a combination of the ridge estimator and Liu estimator. The authors of [50] modified the Jackknifed ridge regression estimator to form a Modified Jackknifed Poisson ridge regression estimator. The author of [2] reviewed the biased estimators in the Poisson regression model in the presence of multicollinearity. The regular maximum likelihood method in estimating regression coefficient is not reliable in the presence of multicollinearity. They compared the performance of four estimators in addition to the widely used ridge estimator and found that Liu-type estimators have superior performance over other methods in the Poisson regression model. is selected by looking for the least increase in the root mean square error (RMSE) within an appropriate decrease in ridge variable inflation factors for each variable. A ridge trace is used to assist in this. It is a plot of coefficientβ versus EER REVIEW 9 of 18 work with. Furthermore, the modern optimization methods depend on subjectively determined indicators of relevance and similarity. This can be seen from [11] where the authors suggested other measures of multicollinearity for future research. It is therefore difficult to suggest which method is better without directly comparing performance on the same dataset. Better performance can also be due to the specific problem tested.

Modified Estimators Methods
Modified Estimators is another approach that use biased and shrunken estimators in exchange for lower variance and thus reduce overfitting [12]. The advantage is that the theoretical model is not compromised because of the dropping of variables. Its disadvantage is that the estimators are now biased. The most known method is the ridge regression developed by the author of [41]. This method adds a penalty term: the squared magnitude of the coefficient β to the loss function. The general equation of ridge regression is as follows: The main issue with ridge regression is how to find ridge parameter ƛ. If ƛ is equal to zero, then the estimate will equal to the ordinary least square estimate. However, if the ƛ is too big, it will lead to an underfitting of the model. ƛ is selected by looking for the least increase in the root mean square error (RMSE) within an appropriate decrease in ridge variable inflation factors for each variable. A ridge trace is used to assist in this. It is a plot of coefficient ꞵ versus ƛ. It is used to pick the smallest ƛ at which the coefficients start to level off. Alternatively, a validation dataset is used, find ƛ that minimizes validation SSE. Identify ƛ such that the reduction in the variance term of the slope parameter is larger than the increment in its squared bias. The authors of [42] reviewed estimation methods for ƛ and new methods were suggested. A more recent paper proposed a Bayesian approach to solving the problem of finding the ridge parameter [43]. Simulation result showed that the approach is more robust and provide more flexibility in handling multicollinearity. Later, the authors of [44] proposed another way of solving the problem of finding the ridge parameter. Their generalized cross-validation approach to is able to find the global minimum.
More estimators have been developed from the ridge estimator. The authors of [45] used a jack-knife procedure to reduce the significant amount of bias of estimators from ridge regression. The author of [46] proposed a new class of estimator, the Liu estimator, based on the ridge estimator. It has the added advantage of a simple procedure to find the parameter ƛ. This is because the estimate is a linear function of ƛ. The author of [47] proposed the Liu-Type estimator. They found that the shrinkage of ridge regression is not effective when faced with severe multicollinearity. The Liu-Type estimator has a lower MSE when compared to ridge regression and fully addresses severe multicollinearity.
Since then, variations on ridge and Liu-type estimators have been created for use in different types of regression. The authors of [48] proposed a Liu-type estimator for binary logistic regression that is a generalization of the Liu-type estimator for a linear model. The authors of [49] stated that not much attention has been given to shrinkage estimators for generalized linear regression models, such as the Poisson regression model, logistic regression model, and negative binomial regression model. Therefore, they introduced a two-parameter shrinkage estimator for negative binomial models. It is a combination of the ridge estimator and Liu estimator. The authors of [50] modified the Jackknifed ridge regression estimator to form a Modified Jackknifed Poisson ridge regression estimator. The author of [2] reviewed the biased estimators in the Poisson regression model in the presence of multicollinearity. The regular maximum likelihood method in estimating regression coefficient is not reliable in the presence of multicollinearity. They compared the performance of four estimators in addition to the widely used ridge estimator and found that Liu-type estimators have superior performance over other methods in the Poisson regression model. work with. Furthermore, the modern optimization methods depend on subjectively determined indicators of relevance and similarity. This can be seen from [11] where the authors suggested other measures of multicollinearity for future research. It is therefore difficult to suggest which method is better without directly comparing performance on the same dataset. Better performance can also be due to the specific problem tested.

Modified Estimators Methods
Modified Estimators is another approach that use biased and shrunken estimators in exchange for lower variance and thus reduce overfitting [12]. The advantage is that the theoretical model is not compromised because of the dropping of variables. Its disadvantage is that the estimators are now biased. The most known method is the ridge regression developed by the author of [41]. This method adds a penalty term: the squared magnitude of the coefficient β to the loss function. The general equation of ridge regression is as follows: The main issue with ridge regression is how to find ridge parameter ƛ. If ƛ is equal to zero, then the estimate will equal to the ordinary least square estimate. However, if the ƛ is too big, it will lead to an underfitting of the model. ƛ is selected by looking for the least increase in the root mean square error (RMSE) within an appropriate decrease in ridge variable inflation factors for each variable. A ridge trace is used to assist in this. It is a plot of coefficient ꞵ versus ƛ. It is used to pick the smallest ƛ at which the coefficients start to level off. Alternatively, a validation dataset is used, find ƛ that minimizes validation SSE. Identify ƛ such that the reduction in the variance term of the slope parameter is larger than the increment in its squared bias. The authors of [42] reviewed estimation methods for ƛ and new methods were suggested. A more recent paper proposed a Bayesian approach to solving the problem of finding the ridge parameter [43]. Simulation result showed that the approach is more robust and provide more flexibility in handling multicollinearity. Later, the authors of [44] proposed another way of solving the problem of finding the ridge parameter. Their generalized cross-validation approach to is able to find the global minimum.
More estimators have been developed from the ridge estimator. The authors of [45] used a jack-knife procedure to reduce the significant amount of bias of estimators from ridge regression. The author of [46] proposed a new class of estimator, the Liu estimator, based on the ridge estimator. It has the added advantage of a simple procedure to find the parameter ƛ. This is because the estimate is a linear function of ƛ. The author of [47] proposed the Liu-Type estimator. They found that the shrinkage of ridge regression is not effective when faced with severe multicollinearity. The Liu-Type estimator has a lower MSE when compared to ridge regression and fully addresses severe multicollinearity.
Since then, variations on ridge and Liu-type estimators have been created for use in different types of regression. The authors of [48] proposed a Liu-type estimator for binary logistic regression that is a generalization of the Liu-type estimator for a linear model. The authors of [49] stated that not much attention has been given to shrinkage estimators for generalized linear regression models, such as the Poisson regression model, logistic regression model, and negative binomial regression model. Therefore, they introduced a two-parameter shrinkage estimator for negative binomial models. It is a combination of the ridge estimator and Liu estimator. The authors of [50] modified the Jackknifed ridge regression estimator to form a Modified Jackknifed Poisson ridge regression estimator. The author of [2] reviewed the biased estimators in the Poisson regression model in the presence of multicollinearity. The regular maximum likelihood method in estimating regression coefficient is not reliable in the presence of multicollinearity. They compared the performance of four estimators in addition to the widely used ridge estimator and found that Liu-type estimators have superior performance over other methods in the Poisson regression model. at which the coefficients start to level off. Alternatively, a validation dataset is used, find Mathematics 2022, 10, x FOR PEER REVIEW 9 of 18 work with. Furthermore, the modern optimization methods depend on subjectively determined indicators of relevance and similarity. This can be seen from [11] where the authors suggested other measures of multicollinearity for future research. It is therefore difficult to suggest which method is better without directly comparing performance on the same dataset. Better performance can also be due to the specific problem tested.

Modified Estimators Methods
Modified Estimators is another approach that use biased and shrunken estimators in exchange for lower variance and thus reduce overfitting [12]. The advantage is that the theoretical model is not compromised because of the dropping of variables. Its disadvantage is that the estimators are now biased. The most known method is the ridge regression developed by the author of [41]. This method adds a penalty term: the squared magnitude of the coefficient β to the loss function. The general equation of ridge regression is as follows: The main issue with ridge regression is how to find ridge parameter ƛ. If ƛ is equal to zero, then the estimate will equal to the ordinary least square estimate. However, if the ƛ is too big, it will lead to an underfitting of the model. ƛ is selected by looking for the least increase in the root mean square error (RMSE) within an appropriate decrease in ridge variable inflation factors for each variable. A ridge trace is used to assist in this. It is a plot of coefficient ꞵ versus ƛ. It is used to pick the smallest ƛ at which the coefficients start to level off. Alternatively, a validation dataset is used, find ƛ that minimizes validation SSE. Identify ƛ such that the reduction in the variance term of the slope parameter is larger than the increment in its squared bias. The authors of [42] reviewed estimation methods for ƛ and new methods were suggested. A more recent paper proposed a Bayesian approach to solving the problem of finding the ridge parameter [43]. Simulation result showed that the approach is more robust and provide more flexibility in handling multicollinearity. Later, the authors of [44] proposed another way of solving the problem of finding the ridge parameter. Their generalized cross-validation approach to is able to find the global minimum.
More estimators have been developed from the ridge estimator. The authors of [45] used a jack-knife procedure to reduce the significant amount of bias of estimators from ridge regression. The author of [46] proposed a new class of estimator, the Liu estimator, based on the ridge estimator. It has the added advantage of a simple procedure to find the parameter ƛ. This is because the estimate is a linear function of ƛ. The author of [47] proposed the Liu-Type estimator. They found that the shrinkage of ridge regression is not effective when faced with severe multicollinearity. The Liu-Type estimator has a lower MSE when compared to ridge regression and fully addresses severe multicollinearity.
Since then, variations on ridge and Liu-type estimators have been created for use in different types of regression. The authors of [48] proposed a Liu-type estimator for binary logistic regression that is a generalization of the Liu-type estimator for a linear model. The authors of [49] stated that not much attention has been given to shrinkage estimators for generalized linear regression models, such as the Poisson regression model, logistic regression model, and negative binomial regression model. Therefore, they introduced a two-parameter shrinkage estimator for negative binomial models. It is a combination of the ridge estimator and Liu estimator. The authors of [50] modified the Jackknifed ridge regression estimator to form a Modified Jackknifed Poisson ridge regression estimator. The author of [2] reviewed the biased estimators in the Poisson regression model in the presence of multicollinearity. The regular maximum likelihood method in estimating regression coefficient is not reliable in the presence of multicollinearity. They compared the performance of four estimators in addition to the widely used ridge estimator and found that Liu-type estimators have superior performance over other methods in the Poisson regression model. ith. Furthermore, the modern optimization methods depend on subjectively ned indicators of relevance and similarity. This can be seen from [11] where the suggested other measures of multicollinearity for future research. It is therefore to suggest which method is better without directly comparing performance on e dataset. Better performance can also be due to the specific problem tested.

ified Estimators Methods
dified Estimators is another approach that use biased and shrunken estimators in e for lower variance and thus reduce overfitting [12]. The advantage is that the cal model is not compromised because of the dropping of variables. Its ntage is that the estimators are now biased. The most known method is the ridge on developed by the author of [41]. This method adds a penalty term: the squared de of the coefficient β to the loss function. The general equation of ridge on is as follows: e main issue with ridge regression is how to find ridge parameter ƛ. If ƛ is equal to en the estimate will equal to the ordinary least square estimate. However, if the ƛ g, it will lead to an underfitting of the model. ƛ is selected by looking for the least in the root mean square error (RMSE) within an appropriate decrease in ridge inflation factors for each variable. A ridge trace is used to assist in this. It is a plot cient ꞵ versus ƛ. It is used to pick the smallest ƛ at which the coefficients start to . Alternatively, a validation dataset is used, find ƛ that minimizes validation SSE. ƛ such that the reduction in the variance term of the slope parameter is larger than ement in its squared bias. The authors of [42] reviewed estimation methods for ƛ methods were suggested. A more recent paper proposed a Bayesian approach to the problem of finding the ridge parameter [43]. Simulation result showed that roach is more robust and provide more flexibility in handling multicollinearity. e authors of [44] proposed another way of solving the problem of finding the rameter. Their generalized cross-validation approach to is able to find the global m. re estimators have been developed from the ridge estimator. The authors of [45] ack-knife procedure to reduce the significant amount of bias of estimators from gression. The author of [46] proposed a new class of estimator, the Liu estimator, n the ridge estimator. It has the added advantage of a simple procedure to find the ter ƛ. This is because the estimate is a linear function of ƛ. The author of [47] d the Liu-Type estimator. They found that the shrinkage of ridge regression is not when faced with severe multicollinearity. The Liu-Type estimator has a lower en compared to ridge regression and fully addresses severe multicollinearity. ce then, variations on ridge and Liu-type estimators have been created for use in t types of regression. The authors of [48] proposed a Liu-type estimator for binary regression that is a generalization of the Liu-type estimator for a linear model. The of [49] stated that not much attention has been given to shrinkage estimators for zed linear regression models, such as the Poisson regression model, logistic on model, and negative binomial regression model. Therefore, they introduced a ameter shrinkage estimator for negative binomial models. It is a combination of e estimator and Liu estimator. The authors of [50] modified the Jackknifed ridge on estimator to form a Modified Jackknifed Poisson ridge regression estimator. hor of [2] reviewed the biased estimators in the Poisson regression model in the e of multicollinearity. The regular maximum likelihood method in estimating on coefficient is not reliable in the presence of multicollinearity. They compared ormance of four estimators in addition to the widely used ridge estimator and hat Liu-type estimators have superior performance over other methods in the such that the reduction in the variance term of the slope parameter is larger than the increment in its squared bias. The authors of [42]

reviewed estimation methods for
Mathematics 2022, 10, x FOR PEER REVIEW work with. Furthermore, the modern optimization methods determined indicators of relevance and similarity. This can be authors suggested other measures of multicollinearity for futu difficult to suggest which method is better without directly co the same dataset. Better performance can also be due to the spe

Modified Estimators Methods
Modified Estimators is another approach that use biased an exchange for lower variance and thus reduce overfitting [12]. theoretical model is not compromised because of the dr disadvantage is that the estimators are now biased. The most k regression developed by the author of [41]. This method adds a magnitude of the coefficient β to the loss function. The g regression is as follows: The main issue with ridge regression is how to find ridge p zero, then the estimate will equal to the ordinary least square e is too big, it will lead to an underfitting of the model. ƛ is select increase in the root mean square error (RMSE) within an appr variable inflation factors for each variable. A ridge trace is used of coefficient ꞵ versus ƛ. It is used to pick the smallest ƛ at whi level off. Alternatively, a validation dataset is used, find ƛ that m Identify ƛ such that the reduction in the variance term of the slop the increment in its squared bias. The authors of [42] reviewed and new methods were suggested. A more recent paper propos solving the problem of finding the ridge parameter [43]. Simu the approach is more robust and provide more flexibility in h Later, the authors of [44] proposed another way of solving th ridge parameter. Their generalized cross-validation approach t minimum.
More estimators have been developed from the ridge estim used a jack-knife procedure to reduce the significant amount o ridge regression. The author of [46] proposed a new class of est based on the ridge estimator. It has the added advantage of a sim parameter ƛ. This is because the estimate is a linear function proposed the Liu-Type estimator. They found that the shrinkage effective when faced with severe multicollinearity. The Liu-Ty MSE when compared to ridge regression and fully addresses se Since then, variations on ridge and Liu-type estimators ha different types of regression. The authors of [48] proposed a Liu logistic regression that is a generalization of the Liu-type estima authors of [49] stated that not much attention has been given to generalized linear regression models, such as the Poisson r regression model, and negative binomial regression model. Th two-parameter shrinkage estimator for negative binomial mod the ridge estimator and Liu estimator. The authors of [50] mod regression estimator to form a Modified Jackknifed Poisson ri The author of [2] reviewed the biased estimators in the Poisso presence of multicollinearity. The regular maximum likeliho regression coefficient is not reliable in the presence of multicol the performance of four estimators in addition to the widely and new methods were suggested. A more recent paper proposed a Bayesian approach to solving the problem of finding the ridge parameter [43]. Simulation result showed that the approach is more robust and provide more flexibility in handling multicollinearity. Later, the authors of [44] proposed another way of solving the problem of finding the ridge parameter. Their generalized cross-validation approach to is able to find the global minimum.
More estimators have been developed from the ridge estimator. The authors of [45] used a jack-knife procedure to reduce the significant amount of bias of estimators from ridge regression. The author of [46] proposed a new class of estimator, the Liu estimator, based on the ridge estimator. It has the added advantage of a simple procedure to find the parameter IEW 9 of 18 ork with. Furthermore, the modern optimization methods depend on subjectively etermined indicators of relevance and similarity. This can be seen from [11] where the uthors suggested other measures of multicollinearity for future research. It is therefore ifficult to suggest which method is better without directly comparing performance on e same dataset. Better performance can also be due to the specific problem tested.

.2. Modified Estimators Methods
Modified Estimators is another approach that use biased and shrunken estimators in xchange for lower variance and thus reduce overfitting [12]. The advantage is that the eoretical model is not compromised because of the dropping of variables. Its isadvantage is that the estimators are now biased. The most known method is the ridge gression developed by the author of [41]. This method adds a penalty term: the squared agnitude of the coefficient β to the loss function. The general equation of ridge gression is as follows: The main issue with ridge regression is how to find ridge parameter ƛ. If ƛ is equal to ero, then the estimate will equal to the ordinary least square estimate. However, if the ƛ too big, it will lead to an underfitting of the model. ƛ is selected by looking for the least crease in the root mean square error (RMSE) within an appropriate decrease in ridge ariable inflation factors for each variable. A ridge trace is used to assist in this. It is a plot f coefficient ꞵ versus ƛ. It is used to pick the smallest ƛ at which the coefficients start to vel off. Alternatively, a validation dataset is used, find ƛ that minimizes validation SSE. entify ƛ such that the reduction in the variance term of the slope parameter is larger than e increment in its squared bias. The authors of [42] reviewed estimation methods for ƛ nd new methods were suggested. A more recent paper proposed a Bayesian approach to lving the problem of finding the ridge parameter [43]. Simulation result showed that e approach is more robust and provide more flexibility in handling multicollinearity. ater, the authors of [44] proposed another way of solving the problem of finding the dge parameter. Their generalized cross-validation approach to is able to find the global inimum.
More estimators have been developed from the ridge estimator. The authors of [45] sed a jack-knife procedure to reduce the significant amount of bias of estimators from dge regression. The author of [46] proposed a new class of estimator, the Liu estimator, ased on the ridge estimator. It has the added advantage of a simple procedure to find the arameter ƛ. This is because the estimate is a linear function of ƛ. The author of [47] roposed the Liu-Type estimator. They found that the shrinkage of ridge regression is not ffective when faced with severe multicollinearity. The Liu-Type estimator has a lower SE when compared to ridge regression and fully addresses severe multicollinearity.
Since then, variations on ridge and Liu-type estimators have been created for use in ifferent types of regression. The authors of [48] proposed a Liu-type estimator for binary . This is because the estimate is a linear function of Mathematics 2022, 10, x FOR PEER REVIEW work with. Furthermore, the modern optimization methods depend on subjec determined indicators of relevance and similarity. This can be seen from [11] whe authors suggested other measures of multicollinearity for future research. It is the difficult to suggest which method is better without directly comparing performan the same dataset. Better performance can also be due to the specific problem tested

Modified Estimators Methods
Modified Estimators is another approach that use biased and shrunken estima exchange for lower variance and thus reduce overfitting [12]. The advantage is th theoretical model is not compromised because of the dropping of variable disadvantage is that the estimators are now biased. The most known method is the regression developed by the author of [41]. This method adds a penalty term: the sq magnitude of the coefficient β to the loss function. The general equation of regression is as follows: The main issue with ridge regression is how to find ridge parameter ƛ. If ƛ is eq zero, then the estimate will equal to the ordinary least square estimate. However, i is too big, it will lead to an underfitting of the model. ƛ is selected by looking for th increase in the root mean square error (RMSE) within an appropriate decrease in variable inflation factors for each variable. A ridge trace is used to assist in this. It is of coefficient ꞵ versus ƛ. It is used to pick the smallest ƛ at which the coefficients s level off. Alternatively, a validation dataset is used, find ƛ that minimizes validatio Identify ƛ such that the reduction in the variance term of the slope parameter is large the increment in its squared bias. The authors of [42] reviewed estimation method and new methods were suggested. A more recent paper proposed a Bayesian appro solving the problem of finding the ridge parameter [43]. Simulation result showe the approach is more robust and provide more flexibility in handling multicollin Later, the authors of [44] proposed another way of solving the problem of findin ridge parameter. Their generalized cross-validation approach to is able to find the minimum.
More estimators have been developed from the ridge estimator. The authors used a jack-knife procedure to reduce the significant amount of bias of estimators ridge regression. The author of [46] proposed a new class of estimator, the Liu estim based on the ridge estimator. It has the added advantage of a simple procedure to fi parameter ƛ. This is because the estimate is a linear function of ƛ. The author o proposed the Liu-Type estimator. They found that the shrinkage of ridge regression effective when faced with severe multicollinearity. The Liu-Type estimator has a MSE when compared to ridge regression and fully addresses severe multicollinear Since then, variations on ridge and Liu-type estimators have been created for different types of regression. The authors of [48] proposed a Liu-type estimator for . The author of [47] proposed the Liu-Type estimator. They found that the shrinkage of ridge regression is not effective when faced with severe multicollinearity. The Liu-Type estimator has a lower MSE when compared to ridge regression and fully addresses severe multicollinearity.
Since then, variations on ridge and Liu-type estimators have been created for use in different types of regression. The authors of [48] proposed a Liu-type estimator for binary logistic regression that is a generalization of the Liu-type estimator for a linear model. The authors of [49] stated that not much attention has been given to shrinkage estimators for generalized linear regression models, such as the Poisson regression model, logistic regression model, and negative binomial regression model. Therefore, they introduced a two-parameter shrinkage estimator for negative binomial models. It is a combination of the ridge estimator and Liu estimator. The authors of [50] modified the Jackknifed ridge regression estimator to form a Modified Jackknifed Poisson ridge regression estimator. The author of [2] reviewed the biased estimators in the Poisson regression model in the presence of multicollinearity. The regular maximum likelihood method in estimating regression coefficient is not reliable in the presence of multicollinearity. They compared the performance of four estimators in addition to the widely used ridge estimator and found that Liu-type estimators have superior performance over other methods in the Poisson regression model. The authors of [51] proposed a partial ridge regression to solve three problem of regular ridge regression. Bias is applied to all variables regardless of the degree of multicollinearity in normal ridge regression. Stability is achieved at the cost of MSE and the selection method of lower variance and thus reduce overfitting [12]. The advantage is that the odel is not compromised because of the dropping of variables. Its is that the estimators are now biased. The most known method is the ridge eloped by the author of [41]. This method adds a penalty term: the squared f the coefficient β to the loss function. The general equation of ridge s follows: issue with ridge regression is how to find ridge parameter ƛ. If ƛ is equal to estimate will equal to the ordinary least square estimate. However, if the ƛ ill lead to an underfitting of the model. ƛ is selected by looking for the least e root mean square error (RMSE) within an appropriate decrease in ridge ion factors for each variable. A ridge trace is used to assist in this. It is a plot ꞵ versus ƛ. It is used to pick the smallest ƛ at which the coefficients start to rnatively, a validation dataset is used, find ƛ that minimizes validation SSE. that the reduction in the variance term of the slope parameter is larger than in its squared bias. The authors of [42] reviewed estimation methods for ƛ ods were suggested. A more recent paper proposed a Bayesian approach to roblem of finding the ridge parameter [43]. Simulation result showed that is more robust and provide more flexibility in handling multicollinearity. hors of [44] proposed another way of solving the problem of finding the ter. Their generalized cross-validation approach to is able to find the global imators have been developed from the ridge estimator. The authors of [45] ife procedure to reduce the significant amount of bias of estimators from on. The author of [46] proposed a new class of estimator, the Liu estimator, idge estimator. It has the added advantage of a simple procedure to find the This is because the estimate is a linear function of ƛ. The author of [47] Liu-Type estimator. They found that the shrinkage of ridge regression is not n faced with severe multicollinearity. The Liu-Type estimator has a lower mpared to ridge regression and fully addresses severe multicollinearity. n, variations on ridge and Liu-type estimators have been created for use in s of regression. The authors of [48] proposed a Liu-type estimator for binary sion that is a generalization of the Liu-type estimator for a linear model. The ] stated that not much attention has been given to shrinkage estimators for inear regression models, such as the Poisson regression model, logistic del, and negative binomial regression model. Therefore, they introduced a r shrinkage estimator for negative binomial models. It is a combination of ator and Liu estimator. The authors of [50] modified the Jackknifed ridge imator to form a Modified Jackknifed Poisson ridge regression estimator. [2] reviewed the biased estimators in the Poisson regression model in the ulticollinearity. The regular maximum likelihood method in estimating fficient is not reliable in the presence of multicollinearity. They compared ce of four estimators in addition to the widely used ridge estimator and u-type estimators have superior performance over other methods in the ssion model. is arbitrary. The proposed method applies the ridge parameter only to variables with a high degree of collinearity. This way the precision of the parameter estimator improves while retaining the MSE close to that of OLS. Estimates are closer to a true OLS estimate, β and overall variance is reduced significantly. It outperforms existing method in terms of bias, MSE, and relative efficiency.
The Lasso regression is a method developed by the author of [52] as a result of the problems of both stepwise regression and ridge regression. This problem is interpretability.
Stepwise regression is interpretable, but the process is very discrete, as it is not known why variables are included or dropped from the model. Ridge regression is very good in multicollinearity due to the stability of the shrank coefficient. However, it does not reduce the coefficient to zero, therefore resulting in models that are hard to interpret. The Lasso is known as L1 regularization, while the ridge regression is known as L2 regularization. The main difference between the two is that Lasso reduces certain parameter estimates to zero. This serves to select variables as well. The equation is shown below: Mathematics 2022, 10, x FOR PEER REVIEW 9 of 18 work with. Furthermore, the modern optimization methods depend on subjectively determined indicators of relevance and similarity. This can be seen from [11] where the authors suggested other measures of multicollinearity for future research. It is therefore difficult to suggest which method is better without directly comparing performance on the same dataset. Better performance can also be due to the specific problem tested.

Modified Estimators Methods
Modified Estimators is another approach that use biased and shrunken estimators in exchange for lower variance and thus reduce overfitting [12]. The advantage is that the theoretical model is not compromised because of the dropping of variables. Its disadvantage is that the estimators are now biased. The most known method is the ridge regression developed by the author of [41]. This method adds a penalty term: the squared magnitude of the coefficient β to the loss function. The general equation of ridge regression is as follows: The main issue with ridge regression is how to find ridge parameter ƛ. If ƛ is equal to zero, then the estimate will equal to the ordinary least square estimate. However, if the ƛ is too big, it will lead to an underfitting of the model. ƛ is selected by looking for the least increase in the root mean square error (RMSE) within an appropriate decrease in ridge variable inflation factors for each variable. A ridge trace is used to assist in this. It is a plot of coefficient ꞵ versus ƛ. It is used to pick the smallest ƛ at which the coefficients start to level off. Alternatively, a validation dataset is used, find ƛ that minimizes validation SSE. Identify ƛ such that the reduction in the variance term of the slope parameter is larger than the increment in its squared bias. The authors of [42] reviewed estimation methods for ƛ and new methods were suggested. A more recent paper proposed a Bayesian approach to solving the problem of finding the ridge parameter [43]. Simulation result showed that the approach is more robust and provide more flexibility in handling multicollinearity. Later, the authors of [44] proposed another way of solving the problem of finding the ridge parameter. Their generalized cross-validation approach to is able to find the global minimum.
More estimators have been developed from the ridge estimator. The authors of [45] used a jack-knife procedure to reduce the significant amount of bias of estimators from ridge regression. The author of [46] proposed a new class of estimator, the Liu estimator, based on the ridge estimator. It has the added advantage of a simple procedure to find the parameter ƛ. This is because the estimate is a linear function of ƛ. The author of [47] proposed the Liu-Type estimator. They found that the shrinkage of ridge regression is not effective when faced with severe multicollinearity. The Liu-Type estimator has a lower MSE when compared to ridge regression and fully addresses severe multicollinearity.
Since then, variations on ridge and Liu-type estimators have been created for use in different types of regression. The authors of [48] proposed a Liu-type estimator for binary logistic regression that is a generalization of the Liu-type estimator for a linear model. The authors of [49] stated that not much attention has been given to shrinkage estimators for generalized linear regression models, such as the Poisson regression model, logistic regression model, and negative binomial regression model. Therefore, they introduced a two-parameter shrinkage estimator for negative binomial models. It is a combination of the ridge estimator and Liu estimator. The authors of [50] modified the Jackknifed ridge regression estimator to form a Modified Jackknifed Poisson ridge regression estimator. The author of [2] reviewed the biased estimators in the Poisson regression model in the presence of multicollinearity. The regular maximum likelihood method in estimating regression coefficient is not reliable in the presence of multicollinearity. They compared the performance of four estimators in addition to the widely used ridge estimator and found that Liu-type estimators have superior performance over other methods in the Poisson regression model.
The penalty term, absolute value of the coefficient β is added to the lost function. In this equation, Y is a (nx1) vector of response, X is a (nxp) matrix of predictor variables and β is a (px1) vector of unknown constants. As with ridge regression, as Mathematics 2022, 10, x FOR PEER REVIEW work with. Furthermore, the modern optimization methods depend on determined indicators of relevance and similarity. This can be seen from [ authors suggested other measures of multicollinearity for future research. difficult to suggest which method is better without directly comparing pe the same dataset. Better performance can also be due to the specific problem

Modified Estimators Methods
Modified Estimators is another approach that use biased and shrunken exchange for lower variance and thus reduce overfitting [12]. The advanta theoretical model is not compromised because of the dropping of disadvantage is that the estimators are now biased. The most known metho regression developed by the author of [41]. This method adds a penalty term magnitude of the coefficient β to the loss function. The general equa regression is as follows: The main issue with ridge regression is how to find ridge parameter ƛ. zero, then the estimate will equal to the ordinary least square estimate. How is too big, it will lead to an underfitting of the model. ƛ is selected by lookin increase in the root mean square error (RMSE) within an appropriate decr variable inflation factors for each variable. A ridge trace is used to assist in t of coefficient ꞵ versus ƛ. It is used to pick the smallest ƛ at which the coeffi level off. Alternatively, a validation dataset is used, find ƛ that minimizes v Identify ƛ such that the reduction in the variance term of the slope parameter the increment in its squared bias. The authors of [42] reviewed estimation m and new methods were suggested. A more recent paper proposed a Bayesia solving the problem of finding the ridge parameter [43]. Simulation result the approach is more robust and provide more flexibility in handling mu Later, the authors of [44] proposed another way of solving the problem o ridge parameter. Their generalized cross-validation approach to is able to fi minimum.
More estimators have been developed from the ridge estimator. The a used a jack-knife procedure to reduce the significant amount of bias of est ridge regression. The author of [46] proposed a new class of estimator, the L based on the ridge estimator. It has the added advantage of a simple procedu parameter ƛ. This is because the estimate is a linear function of ƛ. The a proposed the Liu-Type estimator. They found that the shrinkage of ridge reg effective when faced with severe multicollinearity. The Liu-Type estimato MSE when compared to ridge regression and fully addresses severe multico Since then, variations on ridge and Liu-type estimators have been crea different types of regression. The authors of [48] proposed a Liu-type estima logistic regression that is a generalization of the Liu-type estimator for a line authors of [49] stated that not much attention has been given to shrinkage e generalized linear regression models, such as the Poisson regression m regression model, and negative binomial regression model. Therefore, they two-parameter shrinkage estimator for negative binomial models. It is a co the ridge estimator and Liu estimator. The authors of [50] modified the Jac regression estimator to form a Modified Jackknifed Poisson ridge regressi The author of [2] reviewed the biased estimators in the Poisson regression presence of multicollinearity. The regular maximum likelihood method regression coefficient is not reliable in the presence of multicollinearity. Th the performance of four estimators in addition to the widely used ridge e found that Liu-type estimators have superior performance over other m Poisson regression model. work with. Furthermore, the modern optimization methods depend determined indicators of relevance and similarity. This can be seen fro authors suggested other measures of multicollinearity for future resear difficult to suggest which method is better without directly comparing the same dataset. Better performance can also be due to the specific prob

Modified Estimators Methods
Modified Estimators is another approach that use biased and shrun exchange for lower variance and thus reduce overfitting [12]. The adv theoretical model is not compromised because of the dropping disadvantage is that the estimators are now biased. The most known m regression developed by the author of [41]. This method adds a penalty magnitude of the coefficient β to the loss function. The general e regression is as follows: The main issue with ridge regression is how to find ridge paramete zero, then the estimate will equal to the ordinary least square estimate. is too big, it will lead to an underfitting of the model. ƛ is selected by loo increase in the root mean square error (RMSE) within an appropriate variable inflation factors for each variable. A ridge trace is used to assist of coefficient ꞵ versus ƛ. It is used to pick the smallest ƛ at which the co level off. Alternatively, a validation dataset is used, find ƛ that minimiz Identify ƛ such that the reduction in the variance term of the slope param the increment in its squared bias. The authors of [42] reviewed estimati and new methods were suggested. A more recent paper proposed a Bay solving the problem of finding the ridge parameter [43]. Simulation re the approach is more robust and provide more flexibility in handling Later, the authors of [44] proposed another way of solving the proble ridge parameter. Their generalized cross-validation approach to is able minimum.
More estimators have been developed from the ridge estimator. Th used a jack-knife procedure to reduce the significant amount of bias o ridge regression. The author of [46] proposed a new class of estimator, t based on the ridge estimator. It has the added advantage of a simple pro parameter ƛ. This is because the estimate is a linear function of ƛ. T proposed the Liu-Type estimator. They found that the shrinkage of ridge effective when faced with severe multicollinearity. The Liu-Type estim MSE when compared to ridge regression and fully addresses severe mu Since then, variations on ridge and Liu-type estimators have been different types of regression. The authors of [48] proposed a Liu-type est logistic regression that is a generalization of the Liu-type estimator for a authors of [49] stated that not much attention has been given to shrinka generalized linear regression models, such as the Poisson regressio regression model, and negative binomial regression model. Therefore, t two-parameter shrinkage estimator for negative binomial models. It is the ridge estimator and Liu estimator. The authors of [50] modified the regression estimator to form a Modified Jackknifed Poisson ridge regr The author of [2] reviewed the biased estimators in the Poisson regress presence of multicollinearity. The regular maximum likelihood meth regression coefficient is not reliable in the presence of multicollinearity the performance of four estimators in addition to the widely used rid found that Liu-type estimators have superior performance over othe value is very large, the coefficient approaches zero. The Ridge regression shrinks the estimator but does nothing in variable selection, while Lasso achieves both. Due to this reason, Lasso is more desirable. It is more parsimonious and therefore better in explaining the relationship between independent and dependent variables.
When faced with multicollinearity, Ridge and Lasso perform differently. Ridge tends to spread the effect evenly and shrink the estimators of all the variables. While Lasso is unstable and tends to retain one of the variables and eliminates all the others. Lasso performs poorly in the case where the number of variables, p, is more than the number of observations, n. It selects at most n variables. When n > p, the performance of Lasso is not as good as Ridge regression. The authors of [53] proposed an elastic net that combines both Ridge and Lasso regression. It has the advantage of both the regularization methods, and it also shows grouping effect. The elastic net groups variables that are highly correlated together. It either drops or retains all of them together. Typically, cross-validation is used to choose the tuning parameter. It was originally used by [54]. In cross-validation, a subset of sample is a holdout in order to validate the performance.
The authors of [55] developed the algorithm least angle regression (LARs). It takes inspiration from Lasso and stagewise regression and aims to be a computationally simpler method. The LARs begins similarly to forward regression where it starts with all coefficients equal to zero and then adds the predictor most correlated with the response. The next variable has as much correlation as the current residuals. LARs proceeds equiangularly between the predictors, along the "least angle direction", until the next most correlated variable. The authors of [56] also improved upon the Lasso regression by using a mixedinteger programming approach. It eliminates structured noised and thus makes it perform better in a high dimensional environment where p > n. The authors of [57] further expanded on the idea and developed several penalized mixed-integer nonlinear programming models. The models are also solvable by a meta heuristic algorithm.
The authors of [58] introduced a strictly concave penalty function called modified log penalty. It is contrary to the strictly convex penalty of Elastic net. It is aimed at achieving a parsimonious model even under the effects of multicollinearity. Methods such as the Elastic net tend to focus on the grouping effect which means that collinear variables are included together. Table 3 provides a summary of the findings for modified estimator approaches. Modified estimators aim to improve the efficiency in parameter estimation in the presence of multicollinearity. This comes with a bias-variance trade-off. Researchers can select the methods based on their purpose such as grouping effect or parsimony. However, it can require extensive knowledge to know which one works better on the problem. For example, some methods are shown to work better in high or low dimensionality, degree of multicollinearity. Moreover, some methods are for linear regression and modifications need to be made for other functional form predictions or classification problem.

Machine Learning Methods
In this section, we attempt to present the overall state of the multicollinearity problem in machine learning and introduce interesting algorithms that deal with it implicitly. It is proven that a neural network is superior to traditional statistical models. The authors of [62] used a feed forward artificial neural network to model data with multicollinearity and found that it has much better performance in terms of RMSE compared to the traditional ordinary least square (OLS). This shows that machine learning methods with more complex architecture have the potential to produce much better estimates of the parameters than statistical methods. The authors of [51] provided reasons why a machine learning algorithm might be better. They have no requirement for assumptions about the function, can uncover complex patterns, and dynamically learn changing relationships.
Next, it is observed that variable selection methods have been applied in neural networks. The authors of [63] proposed a hybrid method that combines factor analysis and artificial neural network to combat multicollinearity. ANN is not able to perform variable selection, therefore PCA is used to extract components. ANN is then applied to the components. This method is named FA-ANN (factor analysis-artificial neural network). It is compared with regression analysis and genetic programming. FA-ANN has the best accuracy among them. The advantage of FA-ANN and genetic programming is that it is not based on any statistical assumptions, so it is more reliable and trustworthy. In addition, they can generalize over new sample data unlike regression analysis. However, they are considered a black-box model and are hard to interpret. A more recent version of this approach has been used in quality control research. The authors of [64] proposed a residual (r) control chart for data with multicollinearity. They suggested a neural network because a generalized linear model (GLM) may not work best in asymmetrically distributed data. They concluded that neural network model and functional PCA (FPCA) can deal with the high dimensional and correlated data.
Furthermore, regularization and penalty mechanisms can also be used to solve multicollinearity in machine learning models. For example, the Regularized OS-ELM algorithm [65], OS-ELM Time-varying (OS-ELM-TV) [66], Timeliness Online Sequential ELM algorithm [67], Least Squares Incremental ELM algorithm [68], and Regularized Recursive least-squares [69]. However, these mechanisms increase the computational complexity. For this reason, The authors of [70] proposed a method called the Kalman Learning Machine (KLM). It is an Extreme Learning Machine (ELM) that uses a Kalman filter to update the output weights of a Single Layer Feedforward Network (SLFN). A Kalman filter is an equation that can efficiently estimate the state of a process that minimizes mean squared error. The state is not updated in the learning stage as with the concept of ELM. The resulting model has shown to outperform basic machine learning models in prediction error (RMSE) and computing time. However, it requires manual optimization by humans. A constructive approach to building the model is suggested.
Although deep learning (DL) has emerged as an efficient method to automatically learn data representation without feature engineering, its discussion in terms of multicollinearity is very limited. Based on this motivation, our paper discussed the properties of neural networks, such as the convolutional neural network (CNN), recurrent neural network (RNN), attention mechanism, and graph neural network, before illustrating the example in mitigating the multicollinearity issue.
CNN is a neural network which was first introduced by the authors of [71] in the field of computer vision. It developed the concept of local receptive fields and shared weights to reduce the number of network parameter. It is very interesting in its way of addressing relationships between features. Traditional deep neural network suffers from booming parameter issues. CNN adopted multiple convolutional and pooling (subsampling) layers to detect the most representative features before connecting to a fully connected network for prediction. Specifically, the convolutional layer applied multiple feature extractors (filter) to detect the local features and produce its corresponding feature map to represent each local feature. The composition of multiple feature maps may represent the entire series. The pooling layer is a dimensional reducing method to extract the most representative features and lower the noise. The generated features maps are likely to be independent of each other and potentially mitigate the multicollinearity problem. For example, The authors of [72] proposed the CNNpred framework to model the correlations among different variable to predict the stock market movement. Two variants of CNNpred, namely 2D-CNNpred and 3D-CNNpred, were introduced in their paper to extract combined features from a diverse set of input data. It comprises five major US stock market indices, currencies exchange rate, future contracts, commodities prices, treasury bill rates, etc. Their results showed a significant predictive improvement as compared to the state-of-the-art baseline. Another interesting study by the authors of [73] proposed to integrate the features learned from different representation of the same data to predict the stock market movement. They employed chart images (e.g., Candle bar, Line bar, and F-line bar) derived from stock prices as additional input to the prediction the SPDR S&P 500 ETF movement. The proposed model ensembled Long Short-Term Memory (LSTM) and CNN models to exploit their advantages in extracting temporal and image features, respectively. Thus, the result showed that the prediction error can be efficiently reduced by integrating the temporal and image features from the same data.
Other than feature maps, there is another influential development, namely the attention mechanism in the Recurrent Neural Network (RNN). RNN was first proposed by the author of [74] to process sequential information. Based on [75], the term "recurrent" explained the general architecture idea where a similar function applied on each element of the sequence and the computed output of the previous element will be aggregately retained over the internal memory of RNN until the end of sequence. Based on this, RNN enables compressing the information and producing a fixed-size vector to represent a sequence. The recurrence operation of RNN is advantageous in series data since the inherent information of a sequential can be effectively captured. Unlike CNN, RNN is more flexible to model a variable length of a sequence that can capture unbounded contextual information. However, the authors of [76] criticized that the recurrent-based model may be problematic in handling the long-range dependencies in data due to the memory compression issue in which the neural network struggles to compress all the necessary information from a long sequence input into a fixed-length vector. In order words, it is difficult to represent the entire input sequence without any information loss using the fixed-length vector. Despite the help of the gated activation function, the forgetting issues of RNN-based model becomes serious as the length of input sequence grows. Based on this, the attention mechanism was proposed to deal with the long-range dependencies issue by enabling the model to focus on the relevant part of input sequence when predicting a certain part of the output sequence.
According to [77], the attention mechanism was used to simulate visual attention where humans usually adjust their focal point over time to perceive a "high resolution" when focusing on a particular region of an image but perceive a "low resolution" for the surrounding image. Similarly, the attention mechanism enables the model to learn to assign different weights according to their contribution and may capture asymmetric influence between the features to mitigate the multicollinearity problem. For example, the authors of [78] proposed a CNN based on deep factorization machine and attention mechanism (FA-CNN) to enhance feature learning. In addition to capturing temporal influence, the attention mechanism enables modeling of the intraday interaction between the input features. The result showed a 7.38% improvement over LSTM in predicting stock movement.
Recently, there is another promising research to apply Graph Convolutional Networks (GCN) or graph embeddings in series data. Graph neural networks convert series data into a graph-structured data while enabling the model to capture the interconnectivity between the nodes. The interconnectivity or correlation modeling is relatively useful in reducing the multicollinearity effect. For example, the authors of [79] proposed the hierarchical graph attention network (HATS) to process the relational data for stock market prediction. Their study defined the stock market graph as a spatial-temporal graph where each individual datasets. The authors of [82] compared various statistical and machine learning methods. It is important to note that comparisons between methods have their drawback. For example, different tuning parameters can affect performances of the methods. The author of [14] considered domain knowledge in the studied field to be important in selecting variables as statistics alone is not enough in practice. All the methods are performed at a different degrees when dealing with different types of data.
Both variable selection and modified estimators can be used together. The number of features can be rapidly reduced to below the number of samples and then modified estimators can be applied. This can be seen in machine learning papers. The findings in this review paper are that variable selection drops variables and reduces information gain, while the multicollinearity measures to optimize are subjective. In addition, modified estimators have inconsistent performance depending on the data and are not able to be applied in every problem. The literature review also showed that machine learning algorithms are better than the simple OLS estimator in fitting data with multicollinearity. They do not need to have information on the relationships among the data or the distribution. This paper suggests that the relevancy and redundancy concept from feature selection can be adopted when training a machine learning model.