The Higher-Order of Adaptive Lasso and Elastic Net Methods for Classiﬁcation on High Dimensional Data

: The lasso and elastic net methods are the popular technique for parameter estimation and variable selection. Moreover, the adaptive lasso and elastic net methods use the adaptive weights on the penalty function based on the lasso and elastic net estimates. The adaptive weight is related to the power order of the estimator. Normally, these methods focus to estimate parameters in terms of linear regression models that are based on the dependent variable and independent variable as a continuous scale. In this paper, we compare the lasso and elastic net methods and the higher-order of the adaptive lasso and adaptive elastic net methods for classiﬁcation on high dimensional data. The classiﬁcation is used to classify the categorical data for dependent variable dependent on the independent variables, which is called the logistic regression model. The categorical data are considered a binary variable, and the independent variables are used as the continuous variable. The high dimensional data are represented when the number of independent variables is higher than the sample sizes. For this research, the simulation of the logistic regression is considered as the binary dependent variable and 20, 30, 40, and 50 as the independent variables when the sample sizes are less than the number of the independent variables. The independent variables are generated from normal distribution on several variances, and the dependent variables are obtained from the probability of logit function and transforming it to predict the binary data. For application in real data, we express the classiﬁcation of the type of leukemia as the dependent variables and the subset of gene expression as the independent variables. The criterion of these methods is to compare by the average percentage of predicted accuracy value. The results are found that the higher-order of adaptive lasso method is satisﬁed with large dispersion, but the higher-order of adaptive elastic net method outperforms on small dispersion.


Introduction
The regression analysis is a statistical method for the estimation of the relationship between a dependent variable and one or more independent variables. The model from regression analysis is used to predict a continuous dependent variable from several independent variables. If the dependent variable is of a dichotomous scale, then the logistic regression should be used to explain the relationship between a binary dependent variable and one or more independent variables.
The use of logistic regression analysis is focused to predict whether or not an event occurred such as failure or success, diseased or healthy, yes or no. Especially, decision making of the patient is widely used to study in health science for classifying this event. The application of the logistic regression model obtained a cohort of the pregnant woman and the factor that influences the decision to opt for caesarean delivery or vaginal birth [1]. The logistic regression analysis is used to evaluate the effect of the number of events per variable from patients in which deaths occurred [2].
However, parameter estimation is one process for predicting the probability of logit function, which then classifies the categorical classes of the dependent variable. There

The Logistic Regression Model
The general class of logistic regression is written by where y i denotes the value of a dichotomous outcome variable, π(x i ) denotes the probability of the Bernoulli distribution depended on independent variable, x i , and ε i is The transformation of π(x i ) is a central of logistic regression model that is called logit function. This transformation is defined as where β is the vector of coefficient in terms of logit transformation on (k + 1) × 1, x i is a matrix of independent variable on n × (k + 1), k is a number of independent variable, and i = 1, 2, . . . , n is the number of observations. If y i is coded as 0 or 1, then the probability π(x i ) provides y i = 1. It follows that the probability 1 − π(x i ) gives y i = 0. The probability distribution function to contribute the likelihood function is expressed as The likelihood function is obtained from the terms of (4) as The likelihood from (5) can be expressed by taking log as The log-likelihood function in Equation (6) can be written in the form of penalized function as follows [10]: where λ is the tuning parameter and J(β) is the penalty function.

LASSO Method
The least absolute shrinkage and selector operator (Lasso) [6] is a method for estimation parameters in the linear model by minimizing the residual sum of the square to the sum of the absolute values of the coefficients. The shrinkage of the lasso composes by assigning some coefficients to zero and hence tries to retain the good features for variable selection. The lasso estimate β is define bŷ where λ k ∑ j=1 β j is the penalty function.
For the binary dependent variable, the lasso estimate β is regularized from (6) and (7) as: The tuning parameter λ is used to try out different values by the cross-validation method. Lasso estimators for all values of λ can be computed through a modification of the least-angle regression (LARS) algorithm [12] which is an algorithm for fitting linear regression model to high dimensional data.
Zou [8] proposed the adaptive lasso, where adaptive weights are used for penalizing different coefficients in the penalty function. The adaptive lasso estimators are defined aŝ The adaptive lasso can be applied to classify a dichotomous outcome variable, thus, the adaptive lasso estimate β is regularized from (6) and (7) as: The adaptive weight isŵ j = 1 |β j(L) | γ ; j = 1, 2, . . . , k, γ > 0 , whereβ j(L) is obtained from (8). The positive value γ is the power of the adaptive weight, which is concerned in higher-order. The tuning parameter λ and the order of adaptive weight γ are used as the two-dimensional cross-validation to tune the adaptive lasso.

Elastic Net Method
Zou and Hastie [7] proposed the elastic net as a new regularization and variable selection method. The elastic net of penalty function had the characteristics of both lasso and ridge regression [4]. In the parameter estimating, the elastic net estimator found the ridge regression coefficient and made the lasso shrinkage along with the lasso coefficient solution. The elastic net estimate β is define bŷ To consider the classification of y i , the elastic net estimate β is regularized from (6) and (7) as: Mathematics 2021, 9, 1091 5 of 14 From (10), the elastic net estimation is a ridge regression when λ 1 is zero aŝ The lasso estimator is in form of (8) when λ 2 is zero. The tuning parameters of λ 1 and λ 2 control the shrinkage ofβ E using cross-validation [13].
The adaptive elastic net is developed from the elastic net to solve the problem when the number of parameters diverges with the sample sizes and the dimension is high. Zou and Zhang [6] proposed the adaptive elastic net that combines the strengths of the quadratic regularization and adaptively weighted lasso shrinkage. The adaptive elastic net is defined as follows:β The adaptive elastic net reduces to the adaptive lasso when λ 2 closes to be zero. The penalty function combines the elastic net and adaptive lasso method, and then, the tuning parameter is checked using Bayesian information criterion (BIC) cross-validation [14] which is the method to select an optimal value of regularization parameter.
If we focus to classify a dichotomous outcome variable, then the adaptive elastic net of β can be constructed from (6) and (7) as: The (10). The γ is power of the adaptive weight, which is concentrated as the adaptive lasso.

Simulation Data and Results
In the following examples, we compare the classification methods consisted of the lasso, adaptive lasso, elastic net, and adaptive elastic net. The adaptive lasso and elastic net use the higher-order on the adaptive weights. For simulation, we generate the data in form of high dimension data when the number of independent variables is higher than the sample sizes (n). The independent variables test sets by 20 (n = 15), 30 (n = 15, 20, 25), 40 (n = 20, 25, 30, 35), and 50 (n = 20, 25, 30, 35, 40) variables from the normal distribution with mean (µ) and variance (σ 2 ) denoted by N(µ, σ 2 ). The variance is studied in sets of 1, 5, and 10 that are presented in Figure 1.
The y i is noted as 0 or 1, then the probability π(x i ) ≥ 0.5 provides y i = 1, and the probability π(x i ) < 0.5 gives y i = 0. After that, we obtain the estimator from the previous section aŝ β, and the probability can approximate byπ( This process can be seen in Figure 2. The performance of predictive analytics is the confusion matrix, which compared predicted values and actual data in Table 1. Table 1. The confusion matrix of actual data ( i y ) and predicted data (ˆi y ). The performance of predictive analytics is the confusion matrix, which compared predicted values and actual data in Table 1. Table 1. The confusion matrix of actual data (y i ) and predicted data (ŷ i ).

Predicted Data
Actual Data The confusion matrix is used when there are two or more classes as the output of the classifier. The predicted accuracy is computed from Table 1 using The average percentage of accuracy and number of selected variables of the lasso, adaptive lasso, elastic net, and adaptive elastic net under 20, 30, 40, and 50 variables is presented in Tables 2-5. The average percentage of accuracy and number of selected variables is computed by the mean of over 500 replications. For the adaptive lasso and elastic net, the higher-order is denoted by γ. For this process, we use the R program for simulation data and the R package called glmnet and HDeconometrics, which support estimating parameters on these methods. The code of commands presents in the Appendix A.     (17) From Tables 2-5, it can be seen that the adaptive elastic net showed a maximum average percentage of accuracy at the small dispersion when the order is defined at 2. When the dispersion is large, i.e., the adaptive lasso is of good performance depended on the high order, then the average percentage of accuracy of this method is closely related to high variance. An increase in the average percentage of accuracy causes an increase in the number of independent variables.
To concentrate on the average number of selected variables, when the variance is increasing value, there is no effects with selected variables. For the increase in sample sizes, the average number of selected variables is also increasing too. Furthermore, increasing order of adaptive lasso and the elastic net is slightly different number of selected variables.

Application in Real Data
To evaluate our proposed methods, gene expression monitoring (via DNA microarray) is used to classify 72 patients with acute myeloid leukemia and acute lymphoblastic leukemia. These data have consisted of 3571 genes from the bone marrow samples, which is described in detail by Golup et al. [15]. A subset of 3571 genes is the independent Mathematics 2021, 9, 1091 9 of 14 variables and simulated by setting the independent variables larger than the sample sizes (n) in 500 replications. The set of independent variables is 20 (n = 15), 30 (n = 15, 20, 25), 40 (n = 20, 25, 30, 35), and 50 (n = 20, 25, 30, 35, 40), which is similar the simulation data. The sample sizes are selected from 72 patients under simple random sampling. The example of leukemia genes is shown in Figure 3, which can be seen in the descriptive statistics on the box plot diagram. The multicollinearity of genes is presented in Figure 4 in terms of correlation matrix among 40 sample genes. The average percentage of accuracy and the average number of selected genes of the lasso, adaptive lasso, elastic net, and adaptive elastic net under 20, 30, 40, and 50 leukemia genes can be found in Tables 6-10. In Figure 3, the bold line in the box plot presents the middle value of the dataset, called median, and the whiskers indicate variability outside the first and the third quartiles. The dot lines drawn horizontally from the box represented the minimum and maximum of all data. Outliers may be plotted as individual points in x1, x5, x10, x15, x25, x35, x55, x65, and x75. It can be seen that the median is closed to zero and the variance is one.  The correlation matrix shows the different shades. The light shade presents a weak correlation, and the dark shade plays a strong correlation. From Figure 4, it can be seen that the weak correlation following the light shade is larger than the dark shade. It means that the sample dataset has a slight correlation. However, all genes are selected for computing parameter, so it reduces the multicollinearity problem.
In Figure 3, the bold line in the box plot presents the middle value of the dataset, called median, and the whiskers indicate variability outside the first and the third quartiles. The dot lines drawn horizontally from the box represented the minimum and maximum of all data. Outliers may be plotted as individual points in x1, x5, x10, x15, x25, x35, x55, x65, and x75. It can be seen that the median is closed to zero and the variance is one.
The correlation matrix shows the different shades. The light shade presents a weak correlation, and the dark shade plays a strong correlation. From Figure 4, it can be seen that the weak correlation following the light shade is larger than the dark shade. It means that the sample dataset has a slight correlation. However, all genes are selected for computing parameter, so it reduces the multicollinearity problem.
As can be seen from Tables 6-10, the higher-order of adaptive elastic net achieves a maximum average percentage of accuracy for 50 leukemia genes. Furthermore, it is clear from the results that the adaptive elastic net outperforms the lasso and adaptive lasso in terms of classification accuracy for these data sets. To mention the selection variables, the elastic net selects more genes than the other methods for all data sets, but the adaptive lasso selects fewer genes than the other methods. The adaptive has the potential to select genes that made the highest accuracy. The classification of the type of leukemia patient can use some gene expressions that it can save the time and budget to collect large data. The study of gene selection and classification has proposed the other techniques to detect the expression level of thousands of genes in a few experiments [16][17][18].

Discussion
From the simulated results in Tables 2-5, the factors influencing the average percentage of accuracy are the variance level of the independent variable, sample sizes, and the order of adaptive weights. An increase in the variance affects a decrease in the average percentage of accuracy in most cases. In the case of sample sizes, if the sample size is increased then the accuracy of all methods decreases in all cases. Moreover, an increase in the high order causes an increase in the average percentage of accuracy for all variables.
For the results of gene expression monitoring in Tables 6-10, the sample sizes and the order of adaptive weights are affected similarly to the simulation data. On the other hand, the adaptive elastic net shows the highest average percentage of accuracy. It is obvious that DNA microarray real data demonstrate the small variance as seen from Figure 1. Again, we compute the mean of 3571 genes and present in the histogram in Figure 5, which is collected from the 72 patients. Then, t-test is used to confirm that the mean of all genes is equal to zero significance. From hypothesis testing, the p-value (0.0466) is higher than the significant level (0.01), then we accept the null hypothesis H 0 : µ = 0. It can be concluded that the mean of the gene expression is equal to zero. The mean and variance are similar to the simulation dataset as mean equal to zero and variance equal to one. Overall, it is clear that the adaptive elastic net is a good performance for classification of the type of leukemia for some genes.
For the results of gene expression monitoring in Tables 6-10, the sample sizes and the order of adaptive weights are affected similarly to the simulation data. On the other hand, the adaptive elastic net shows the highest average percentage of accuracy. It is obvious that DNA microarray real data demonstrate the small variance as seen from Figure  1. Again, we compute the mean of 3571 genes and present in the histogram in Figure 5, which is collected from the 72 patients. Then, t-test is used to confirm that the mean of all genes is equal to zero significance. From hypothesis testing, the p-value (0.0466) is higher than the significant level (0.01), then we accept the null hypothesis 0 : 0 H μ = . It can be concluded that the mean of the gene expression is equal to zero. The mean and variance are similar to the simulation dataset as mean equal to zero and variance equal to one. Overall, it is clear that the adaptive elastic net is a good performance for classification of the type of leukemia for some genes.

Conclusions
We have proposed the lasso, elastic net, adaptive lasso, and adaptive elastic methods for estimating parameters for classification in binary data. For the empirical results on the high dimensional data, the adaptive elastic net of higher-order is performed in classification when the small dispersion is used in case of simulation data, but adaptive lasso of higher-order is provided in the large dispersion. When using the selection variable, these methods tend to reduce the large independent variables to outperform. For application to actual data, the gene expression data are used to classify the type of leukemia patients. The high dimensional data are concerned in this case because the number of patients is less than the large gene expression data. The simulation of thousands of gene expression are used to compute the percentage accuracy, and these methods are also used to select the influential variables. From the results, it was shown that the adaptive elastic net is effective in gene selection, which was based on the dispersion of data and higher-order of adaptive weights. Therefore, we can conclude that the higher-order is beneficial to classification.