A Robust Variable Selection Method for Sparse Online Regression via the Elastic Net Penalty

: Variable selection has been a hot topic, with various popular methods including lasso, SCAD, and elastic net. These penalized regression algorithms remain sensitive to noisy data. Fur-thermore, “concept drift” fundamentally distinguishes streaming data learning from batch learning. This article presents a method for noise-resistant regularization and variable selection in noisy data streams with multicollinearity, dubbed canal-adaptive elastic net, which is similar to elastic net and encourages grouping effects. In comparison to lasso, the canal adaptive elastic net is especially advantageous when the number of predictions ( p ) is signiﬁcantly larger than the number of observations ( n ), and the data are multi-collinear. Numerous simulation experiments have conﬁrmed that canal-adaptive elastic net has higher prediction accuracy than lasso, ridge regression, and elastic net in data with multicollinearity and noise.


Introduction
Most traditional algorithms are built on closed-world assumptions and use fixed training and test sets, which makes it difficult to cope with changeable scenarios, including the streaming data issue. However, most data in practical applications are provided as data streams. One of their common characteristics is that the data will continue to grow over time, and the uncertainty introduced by the new data will influence the original model. As a result, learning from streaming data has become more essential [1][2][3] in machine learning and data mining communities. In this article, we employ the online gradient descent (OGD) framework proposed by Zinkevich [4]. It is a real-time, streaming online technique that updates the model on top of the trained model once per piece of data, making the model time-sensitive. In this article, we will provide a novel noise-resistant variable selection approach for handling noisy data streams with multicollinearity.
Since the 1960s, the variable selection issue has been much research literature. Since Hirotugu Akaike [5] introduced the AIC criterion, variable selection techniques have advanced, including more classic methods such as subset selection and coefficient shrinkage [6]. Variable selection methods based on penalty functions were developed to optimize computational efficiency and accuracy. Using a multivariate linear model as an illustration, e.g., where the parameter vector is β = [β 0 , β 1 , ..., β p ] T , the parameters are estimated by methods such as OLS and Maximum Likelihood. The penalty function that balances the complexity of the model is added to this to construct a new penalty objective function. This penalty objective function is then optimized (maximized or minimized) to obtain parameter estimates. Its general framework is: where R(β) is loss function and P λ (|β|) is a penalty function. This strategy enables the S/E (selection/Estimation) phase of the subset selection method to be done concurrently by compressing part of the coefficients to zero, significantly lowering computing time and minimizing the chance of the subset selection method becoming unstable. The most often employed of these are bridge regression [7], ridge regression [8], and lasso [9], with bridge regression having the following penalty function: where λ is an adjustment parameter, since the ridge regression model introduces the 2norm, it has a more stable regression effect and outperforms OLS in prediction. While the lasso method is an ordered, continuous process, it offers the advantages of low computing effort, quick calculation, parameter estimation continuity, and adaptability to highdimensional data. However, lasso has several inherent disadvantages, one being the absence of the Oracle characteristic [10]. The adaptive lasso approach was proposed by Zou [11]. Similar to ADS (adaptive Dantzig selector) [12] for DS (Datnzig selector) [13] the adaptive lasso is an improvement on the lasso method with the same level of coefficient compression. The adaptive lasso has Oracle properties [11]. According to Zou [11], the greater the least squares estimate of a variable, the more probable it is to be a variable in the genuine model. Hence, the penalty for it should be reduced. The adaptive lasso method's penalty function is specified as: where λ and θ are adjustment parameters.
With several sets of explanatory variables known to be strongly correlated, the lasso method is powerless if the researcher wants to keep or remove a certain group of variables. Therefore, Yuan and Lin [14] proposed the group lasso method in 2006. The basic idea is to assume that there are J groups of strongly correlated variables, namely G 1 , · · · , G J , and the number of variables in each group is p 1 , · · · , p J , and β G j = β j j∈G j as the corresponding element of the sub-variables. The penalty function for the group Lasso method: is the elliptic norm determined by the positive definite matrix K j . Chesneau and Hebiri [15] proposed the grouped variables lasso method and investigated its theory in 2008. They proved this bound is better in some situations than the one achieved by the lasso and the Dantzig selector. The group variable lasso exploits the sparsity of the model more effectively. Percival [16] developed the overlapping groups lasso approach, demonstrating that permitting overlap does not remove many of the theoretical features and benefits of lasso and group lasso. This method can encode various structures as collections of groups, extending the group lasso method. Li, Nan, and Zhu [17] proposed the MSGLasso (Multivariate Sparse Group Lasso) method. The method can effectively remove unimportant groups and unimportant individual coefficients within important groups, especially for the p n problem. It can flexibly handle a variety of complex group structures, such as overlapping, nested, or multi-level hierarchies.
The prediction accuracy of the lasso drastically reduces when confronted with multicollinear data. A novel regularisation approach dubbed elastic net [18] has been presented to address these issues. Elastic net estimation may be conceived as a combination of lasso [9] and ridge regression [8] estimation. Compared to lasso, the elastic net approach performs better with data of the kind p n with several co-linearities between variables.
However, a basic elastic net is incapable of handling noisy data. To address the difficulties above, we propose canal-adaptive elastic net method in this article. This technique offers four significant advantages: 1. This model is efficient at handling streaming data. The suggested canal-adaptive elastic net dynamically updates the regression coefficients β for regularised linear models in real-time. Each time a batch of data is fetched, the OGD framework enables updating the original model. Can handle stream data more effectively. 2.
The model has a sparse representation. As illustrated in Figure 1, only a tiny subsection of samples with residuals in the ranges (− − δ, − ) and ( , + δ) are used to adjust the regression parameters. As a result, the model has perfect scalability and decreases computing costs. 3.
The improved loss function confers on the model a significant level of noise resistance. By dynamically modifying the δ parameter, noisy data with absolute errors (bigger than the threshold parameter δ + ) are recognized and excluded from being employed to alter the regression coefficients. 4.
The 1 -norm and 2 -norm are employed. Can handle the scenario of p n in the data more effectively. Simultaneous automatic variable selection and continuous shrinkage and can select groups of related variables. Overcoming the effects of data multicollinearity. The rest of this paper is structured in the following manner. Section 2 reviews some studies on variable selection, noise-tolerant loss functions, data multicollinearity, and streaming data. Section 3 summarizes previous work on the penalty aim function and then introduces the linear regression noise-resistant online learning technique. In Section 4, we conduct numerical simulations and tests on benchmark datasets to compare the canaladaptive elastic net presented in this research to lasso, ridge regression, and elastic net. Finally, Section 5 presents a concise discussion to conclude the paper.

Related Works
Variable selection has always been an important issue in building regression models. It has been one of the hot topics in statistical research since it was proposed in the 1960s, generating much literature on variable selection methods. For example, the Japanese scholar Akaike [5] proposed the AIC criterion based on the maximum likelihood method, which can be used both for the selection of independent variables and for setting the order of autoregressive models in time series analysis. Schwarz [19] proposed the BIC criterion based on the Bayes method. Compared to AIC, BIC strengthens the penalty and thus is more cautious in selecting variables into the model. All the above methods achieve variable selection through a two-step S/E (Selection/Estimation) process, i.e., first selecting a subset of variables in an existing sample according to a criterion. A subset of variables is selected from the existing sample according to a criterion (Selection). Then the unknown coefficients are estimated from the sample (Estimation). Because the correct variable is unknown in advance, the S-step is biased, which increases the risk of the E-step. To overcome this drawback, Seymour Geisser [20] proposed Cross-validation. Later on, variable selection methods based on penalty functions emerged. Tibshirani [9] proposed LASSO (Least Absolute Shrinkage and Selection Operator) inspired by the NG (Nonnegative Garrote) method. The lasso method avoids the drawback of over-reliance on the original least squares estimation of the NG method. As Fan and Li [21] pointed out that lasso does not possess the Oracle property, they thus proposed a new variable selection method, the SCAD (Smoothly Clipped Absolute Deviation) method, and proved that it has the Oracle property. Zou [11] proposed the adaptive lasso method based on the lasso. The variable selection methods with structured penalties (e.g., features are dependent and/or there are group structures between features) have become more popular because of the ever-increasing need to handle complex data, such as elastic net and group lasso [14].
While investigating noise-resistant loss functions, we generated interest in the truncated loss function. The losses generate learning models that are robust to substantial quantities of noise. Xu et al. [22] demonstrated that truncation could tolerate much higher noise for enjoying consistency than without truncation. The robust variable selection is a novel concept that incorporates robust losses from the robust statistics area into the model. Formed models that perform well empirically in noisy situations [23][24][25].
The concept of multicollinearity refers to the linear relationships between the independent variables in multiple regression analysis. Multicollinearity occurs when the regression model incorporates variables that are highly connected not only with the dependent variable but also to each other [26]. Some research has explored and discussed the challenges associated with multicollinearity in regression models, emphasizing that the primary issue of multicollinearity is uneven and biased standard errors and unworkable interpretations for the results [27,28]. There are many strategies for handling multicollinearity, one of which is ridge regression [29,30].
Many studies have been conducted over the last few decades on inductive learning approaches such as lasso [9], artificial neural networks [31,32]. Support vector regression [33], among others. These methods have been applied successfully to a variety of real-world problems. However, their usual implementation causes the simultaneous availability of all training data [34], making them unsuitable for large-scale data mining applications and streaming data mining tasks [35,36]. Compared to the traditional batch learning framework, the online learning algorithm (shown in Figure 2) is another framework for learning samples in a streaming fashion, which has the advantage of scalability and real-time. In recent years, great attention has been paid to developing online learning methods in the machine learning community, such as online ridge regression [37,38], adaptive regularization for lasso [39], projection [40] and bounded online gradient descent algorithm [41].

Method
Most currently available online regression algorithms learn information from clean data. Because of flaws in the human labeling process and sensor failures, noisy data are unavoidable and damaging. In this section, we propose a noise-tolerant online learning algorithm for the linear regression of streaming data. We employed a noise-resilient loss function, dubbed canal loss, for regression based on the well-known -insensitive loss, inspired by the ramp loss designed for classification problems. In addition, we will use a novel method to adjust the and δ dynamics of the canal loss parameters.

Canal-Adaptive Elastic Net
For a given n-group of data{(x i , y i )} n i=1 , y i ∈ R, we consider a simple liner regression model: where y = (y 1 , y 2 , · · · , y n ) T is the response and X = x 1 , x 2 , · · · , x p is the models column full rank design matrix, x j = x 1j , x 2j , · · · , x nj T , j = 1, 2, · · · , p is the n-dimensional explanatory variable, β = β 1 , β 2 , · · · , β p T is the associated vector of regression coefficient, are i.i.d random errors vector with mean of 0. Without losing generality we can assume that the response is centered and the predictors are standardized after a location and scale transformation, However, if X is not column full rank, or if the linear correlation between some columns is significant, the determinant of X T X is close to 0, i.e., X T X is close to singular. The traditional OLS method lacks stability and reliability. To solve the above problem, Hoerl and Kennard [20] proposed ridge regression: The penalty technique improves OLS by transforming the unfit problem into a fit problem. It loses the unbiasedness of OLS in exchange for higher numerical stability and obtains higher computational accuracy. Although ridge regression can effectively overcome the high correlation between variables and improve the prediction accuracy, model selection cannot be made with ridge regression alone. Therefore, Tibshirani [9] proposed the primary lasso criterion: where λ > 0 is a fixed adjustment parameter. Lasso is a penalized ordinary least squares method. Based on the singularity of the derivative of the penalty function at zero, the coefficients of the insignificant variables are compressed to zero, and a lighter compression is given to the significant independent variables with larger estimates. The accuracy of the parameter estimates is ensured.
However, lasso also has some inherent drawbacks: lasso does not have the Oracle property. It has the disadvantage of selecting at most n variables when considering data of sample size (p > n). Where numerous characteristics are interrelated, lasso selects one of these characteristics. Lasso is less effective than ridge regression when handing independent variables with multicollinearity. Therefore, Zou and Hastie proposed the elastic net: The elastic net uses both 1 -norm and 2 -norm as linear regression models with a priori canonical terms. It combines the advantages of lasso and ridge regression. It is a method to solve the group variable selection with unknown variable grouping. Compared with the lasso, elastic net also improves handling data with sample size (p > n) and data with multicollinearity among variables. Unfortunately, because of the loss function's shortcomings, the elastic net cannot erase the effects caused by noisy data.
To obtain a noise-resilient elastic net-type estimator. Based on the classical -insensitive loss function l (z)=max{0, |z| − }. Canal loss with noise-resilient parameter δ is proposed: , > 0 and δ > 0 are the threshold tuning parameter. The canal loss function's upper bound is maintained as a constant, i.e., δ, which considerably reduces the negative influence of outliers and makes it a noise-resistant loss function. Using the advantages of canal loss, we modify the elastic net and propose the canal-adaptive elastic net as a new method. We define the canal-adaptive elastic net as: We defineβ(en) as:β The canal loss approximates the absolute loss in the process of → 0 and δ → +∞, which is more clearly expressed as: The proposed canal-adaptive elastic net is predicted to be robust to outliers and to have the property of sparse representation.

Online Learning Algorithm for Canal-Adaptive Elastic Net
We employed the online gradient descent algorithm (OGD) and presented the minimization optimization strategy to solve the canal-adaptive elastic net model efficiently First, the literature has proposed numerous methods for estimating the regularisation parameter, including cross-validation, AIC, and BIC. We minimize the BIC-type objective function to optimize the regularisation parameter, which makes the calculation quicker and ensures consistency in variable selection, i.e., Second, although Equation (1) is not a convex optimization problem, it can be restated as a difference in convex (DC) programming issue. This problem may be solved using the Concave-Convex Procedure (CCCP). However, because CCCP is a batch learning algorithm, it does not meet real-time processing requirements when handling streaming data. We used the well-known OGD framework in our work to arrive at a near-optimal solution. This is a compromise between accuracy and scalability. To minimize Equation (1) by OGD, we reformulate it as: , and then, based on the basic structure of the OGD algorithm, we solve this optimization problem, Here, η t is the t-th step that satisfies the following constraints ∑ n t=1 η 2 t < ∞ and ∑ n t=1 η t = ∞. when n → ∞ [42]. Unlike the exact computation of the full gradient of L(λ 1 , λ 2 , β), the notation ∇ β J t (β)| β=β ( t−1) denotes the derivative of J t (β) with respect to β = β (t−1) of the derivative. We can deduce ∇ β J t (β)| β=β ( t−1) as following: where z t = y t − x t β (t−1) , substituting the gradient Equation (3) into Equation (2), otherwise. (4) Finally, as shown in Equation (3), the proposed canal-adaptive elastic net contains a sparsity parameter ≥ 0 and a noise-resilient parameter δ ≥ 0. The parameter determines the sparsity of the proposed model, whereas δ indicates the level of noise elasticity. Proposing a strategy for adjusting the channel loss parameters is a pressing issue. and δ are automatically iterated. In this study, we set the parameters: Adjusting and δ parameters is equivalent to adjusting ζ and γ. When γ is set to 0, the algorithm does not learn any examples of {(x t , y t )} n t=1 and instead updates β according to the regularization term. If γ is sufficiently big, our canal-adaptive elastic net will withstand noisy data. The proposed canal-adaptive elastic net algorithm is summarized as Algorithm 1.

Experiments
In this part, we perform experiments to evaluate the canal-adaptive elastic net algorithm performs. First, simulation studies on synthetic data with multicollinearity and noise are used to verify the method's efficiency. Second, the model's resistance to noise and the variable selection accuracy is evaluated using data sets with different noise proportions. Finally, we run thorough tests to assess the proposed algorithm's performance on four benchmark prediction tasks. The benchmark datasets used in the experiments are available from the UCI Machine Learning repository and LIBSVMP website.

Simulation Settings
We evaluate the proposed noise-resilient online regression algorithm on synthetic data sets with noise and multicollinearity. We examine the proposed canal-adaptive elastic net method's effectiveness in handling noisy and multi-collinear input and output data. In addition, we evaluated the canal-adaptive elastic net method's performance in simulation trials with and without multicollinearity datasets. The simulation experiment is described in detail below.

The Case of Both Multicollinearity and Noise
This experiment indicates that canal-adaptive elastic net on streaming data outperforms lasso, ridge regression, and elastic net in handling multicollinearity data and is a suitable variable selection procedure for handling noisy data than the other three methods.
We simulate 200 observations in each example and set the feature dimension d as 10. We let β j = 0.85 for all j. The correlation coefficients between x i and x j were greater than 0.8 in their absolute values. We trained the model on 70% of the data and then tested it on 30% of the data. We conducted 20 randomized trials and determined the MAE, RMSE, the number of discards, discard rate, and average computation time of the model on the test data set using varied noise proportions for x and y.
We generated simulated the data from the true model, where ρ = 3 and t is generated by the normal distribution N(0, 1). For a given t, the covariate x t is constructed using a standard d-dimensional multivariate normal distribution, which assures that the components of x t are independent and standard normal. Here, we change the noise ratio σ from {0, 0.1, 0.2, 0.3}. To be more precise, we randomly select some samples {x t , y t } with the ratio of σ, change the 6th explanatory variable of x t to 0 in the training set, and then evaluate the learning model on real test datasets. Table 1 contains the results. Furthermore, to find out the effect of the noisy response variable y. We randomly altered the response variable y to 0 in the training set at a rate of σ and then tested the learning model with the real test sample. Table 2 summarizes the corresponding findings.
To make a comparison, the compared models are solved using the online gradient descent (OGD) method, i.e., lasso, ridge regression, elastic net, and canal-adaptive elastic net. In the simulated experiments, we set the two hyper-parameters = 0.1 and δ = 2.0. First, we show that the canal-adaptive elastic net can avoid interfering with the explanatory variable x. Analysis of RMSE and MAE indicates lasso deviates significantly from the true value. In contrast, canal-adaptive elastic net, ridge regression, performs admirably. Lasso is sensitive to multicollinearity in the data because of its nature. In the presence of noise, the proposed canal-adaptive elastic net method outperforms the other three competing methods. In particular, canal-adaptive elastic net significantly outperforms lasso, ridge regression, elastic net, and in the presence of high noise level (σ = 0.3). Because of the inherent drawbacks of the elastic net, its loss function can reduce the impact of noisy data to some extent. However, the negative impact of noisy data is still serious. Figure 3a illustrates the predictive performance of different algorithms. It can be observed that when data are multicollinearity and noisy, the canal-adaptive elastic net outperforms lasso, ridge regression, and elastic net. This shows that the canal-adaptive elastic net is a method capable of overcoming multicollinearity and is noise-resistant. Each coefficient may have an influence when noisy data are present in the response variable y. In the presence of multicollinearity in the data, the proposed canal-adaptive elastic net significantly outperforms lasso, ridge regression, and elastic net. Due to the lasso itself, it does not predict data containing multicollinearity very well. The estimation of β deviates far from the true coefficient β than the other three methods. Compared with the lasso, ridge regression and elastic nets effectively overcome the effects of data with multicollinearity. However, their performance suffers when a certain level of noise is introduced.
The prediction performance of the different models is provided in Figure 3b for a more detailed comparison of the models. Canal-adaptive elastic net outperforms the other three approaches in the presence of noisy data and data with multicollinearity. The results show that the canal-adaptive elastic net is a successful technique for overcoming multicollinearity and handling noisy data when the response variable y contains considerable noise.

The Case of Noise
In this subsection, we present simulation experiments comparing the performance of canal-adaptive elastic nets with three competing approaches (lasso, ridge regression, and elastic net) on a limited sample of streaming noise data (n = 5000, 10,000). This experiment explores the performance of the four methods for group variable selection with unknown variable grouping. Because of its nature, group lasso cannot be included in this experiment. In addition, we set the β to (1, −2, 3, −4, 5, −6, 0, 0, . . . , 0). Where the feature dimension d is 50, the first six regression coefficients are significant, whereas the following 44 are insignificant. The covariate x t is created for a given t using a standard d-dimensional multivariate normal distribution, which ensures that the components of x t are independent and standard normal. The response variables are generated according to Equation (6) where ρ = 0.5 and t are generated from the normal distribution N (0,1).
Also, to investigate the effects of the noisy response variable y and the explanatory variable x, we applied a certain percentage of noise to the training dataset in the same way as in Section 4.1.1. We then tested the learning model with real test samples. Tables 3 and 4 indicate the related results. For each parameter setting, 20 random experiments were conducted to evaluate the average performance on datasets with sample sizes n equal to 5000 and 10,000, respectively. For comparison, lasso, ridge regression, elastic net, and canal-adaptive elastic net models were solved using the online gradient descent (OGD) method. After that, performing these approaches is compared by determining the MAE, RMSE, the number of discards, discard rate, and average computing time for the models on the test data. We pre-set the parameters to = 0.01 and σ = 0.8.
To begin, we show that the canal-adaptive elastic net is unaffected by the explanatory variable x. All four methods perform well in the presence of noise with a rate of σ = 0. As illustrated in Table 3, performing lasso, elastic net, and ridge regression under noisy data deviates significantly from the true value. In particular, when σ = 0.2 or 0.3, the canal-adaptive elastic net significantly outperforms the other three competing approaches. Because of the shortcomings of the loss functions of the lasso, elastic net, and ridge regression, these three methods are susceptible to noisy data. Figure 4 provides a complete comparison of the prediction performance of several algorithms. In the presence of noisy data input, the results show that canal-adaptive elastic net beats lasso, ridge regression, and elastic net on average.    Each coefficient may exert an effect if the response variable y contains noisy data. As illustrated in Table 4, the proposed canal-adaptive elastic net method outperforms the other three competing methods when dealing with noisy data. Because of the least square deviation, lasso, ridge regression, and elastic net are highly sensitive to noise. In order to compare the models more comprehensively, the prediction performance of the different models is presented in Figure 5. It can be observed that the proposed canal-adaptive elastic net method significantly outperforms the other three competing methods in all aspects of the noisy output case.  As seen in Tables 5 and 6, the canal-adaptive elastic net generates sparse solutions. Canal-adaptive elastic net behaves similarly to "Oracle". The additional "grouping effect" capability makes elastic net-type a better variable selection method than Lasso-type.

Benchmark Data Sets
We undertake thorough tests in this portion to evaluate the proposed canal-adaptive elastic net algorithm's performance in real-world tasks. Four benchmark datasets were employed for experimental evaluation: "Kin", "Abalone", "Pendigits", and "Letters". The The first two datasets are selected from the UCI datasets [43], and the last two are selected from Chang and Lin [44]. Table 7 summarizes four benchmark datasets. To demonstrate the statistical features of the various datasets, we created box line plots, as illustrated in Figure 6. To simulate the setup of the stream data, we replicate the samples three times. Before conducting the experiments, domain experts need to analyze and specify the parameter sensitivities of our models. Table 8 displays the parameter settings for the four benchmark datasets. Each experiment is repeated 20 times randomly, and the average performance is recorded.  On the benchmark datasets, Tables 9 and 10 summarize RMSE, MAE, discarded samples, discarded rate, and average running time of the four comparative methods lasso, ridge regression, elastic net, and canal-adaptive elastic net. The regression accuracy (RMSE and MAE) analysis results demonstrate that when data are clean (σ = 0), the performance of the four comparison methods is comparable. However, for noisy data (σ ≥ 0.1), the suggested canal adaptive elastic network technique significantly outperforms the other three approaches regarding noise immunity. As seen in the seventh column, the discard rate increases as noise σ grow. For a more comprehensive comparison, We give the average RMSE in Figures 7 and 8. We can see that the canal-adaptive elastic net proposed in this paper is the most stable of all four datasets.

Conclusions
This article presents a novel linear regression model called canal adaptive elastic net to address the novel challenge of online learning with noisy and multi-collinear data. The canal-adaptive elastic net generates a sparse model with a high prediction accuracy while promoting grouping. Additionally, the canal-adaptive elastic net is also solved using an efficient approach based on an online gradient descent framework. The empirical data and simulations demonstrate the canal-adaptive elastic net's outstanding performance and superiority over the other three approaches (e.g., Lasso, ridge regression, and elastic net). Future studies will focus on expanding the linear regression model to a non-linear regression model through the use of the kernel technique [45].