Next Article in Journal
Degradation Modeling for Lithium-Ion Batteries with an Exponential Jump-Diffusion Model
Next Article in Special Issue
Variable Selection for Spatial Logistic Autoregressive Models
Previous Article in Journal
Finite and Infinite Hypergeometric Sums Involving the Digamma Function
Previous Article in Special Issue
An Exact and Near-Exact Distribution Approach to the Behrens–Fisher Problem
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Robust Variable Selection Method for Sparse Online Regression via the Elastic Net Penalty

School of Science, China University of Petroleum, Qingdao 266580, China
*
Author to whom correspondence should be addressed.
Mathematics 2022, 10(16), 2985; https://doi.org/10.3390/math10162985
Submission received: 22 June 2022 / Revised: 12 August 2022 / Accepted: 16 August 2022 / Published: 18 August 2022
(This article belongs to the Special Issue Mathematical and Computational Statistics and Their Applications)

Abstract

:
Variable selection has been a hot topic, with various popular methods including lasso, SCAD, and elastic net. These penalized regression algorithms remain sensitive to noisy data. Furthermore, “concept drift” fundamentally distinguishes streaming data learning from batch learning. This article presents a method for noise-resistant regularization and variable selection in noisy data streams with multicollinearity, dubbed canal-adaptive elastic net, which is similar to elastic net and encourages grouping effects. In comparison to lasso, the canal adaptive elastic net is especially advantageous when the number of predictions (p) is significantly larger than the number of observations (n), and the data are multi-collinear. Numerous simulation experiments have confirmed that canal-adaptive elastic net has higher prediction accuracy than lasso, ridge regression, and elastic net in data with multicollinearity and noise.
MSC:
62F12; 62G08; 62G20; 62J07T07

1. Introduction

Most traditional algorithms are built on closed-world assumptions and use fixed training and test sets, which makes it difficult to cope with changeable scenarios, including the streaming data issue. However, most data in practical applications are provided as data streams. One of their common characteristics is that the data will continue to grow over time, and the uncertainty introduced by the new data will influence the original model. As a result, learning from streaming data has become more essential [1,2,3] in machine learning and data mining communities. In this article, we employ the online gradient descent (OGD) framework proposed by Zinkevich [4]. It is a real-time, streaming online technique that updates the model on top of the trained model once per piece of data, making the model time-sensitive. In this article, we will provide a novel noise-resistant variable selection approach for handling noisy data streams with multicollinearity.
Since the 1960s, the variable selection issue has been much research literature. Since Hirotugu Akaike [5] introduced the AIC criterion, variable selection techniques have advanced, including more classic methods such as subset selection and coefficient shrinkage [6]. Variable selection methods based on penalty functions were developed to optimize computational efficiency and accuracy. Using a multivariate linear model as an illustration, e.g.,
y = β 0 + x 1 β 1 + . . . + x p β p + ϵ ,
where the parameter vector is β = [ β 0 , β 1 , . . . , β p ] T , the parameters are estimated by methods such as OLS and Maximum Likelihood. The penalty function that balances the complexity of the model is added to this to construct a new penalty objective function. This penalty objective function is then optimized (maximized or minimized) to obtain parameter estimates. Its general framework is:
R ( β ) + P λ ( | β | ) ,
where R ( β ) is loss function and P λ ( | β | ) is a penalty function. This strategy enables the S/E (selection/Estimation) phase of the subset selection method to be done concurrently by compressing part of the coefficients to zero, significantly lowering computing time and minimizing the chance of the subset selection method becoming unstable. The most often employed of these are bridge regression [7], ridge regression [8], and lasso [9], with bridge regression having the following penalty function:
P λ ( | β | ) = λ j = 1 p | β j | m , λ , m > 0 ,
where λ is an adjustment parameter, since the ridge regression model introduces the 2 -norm, it has a more stable regression effect and outperforms OLS in prediction. While the lasso method is an ordered, continuous process, it offers the advantages of low computing effort, quick calculation, parameter estimation continuity, and adaptability to high-dimensional data. However, lasso has several inherent disadvantages, one being the absence of the Oracle characteristic [10]. The adaptive lasso approach was proposed by Zou [11]. Similar to ADS (adaptive Dantzig selector) [12] for DS (Datnzig selector) [13] the adaptive lasso is an improvement on the lasso method with the same level of coefficient compression. The adaptive lasso has Oracle properties [11]. According to Zou [11], the greater the least squares estimate of a variable, the more probable it is to be a variable in the genuine model. Hence, the penalty for it should be reduced. The adaptive lasso method’s penalty function is specified as:
P λ ( | β | ) = λ j = 1 p 1 β j ^ θ β j , λ , θ > 0 ,
where λ and θ are adjustment parameters.
With several sets of explanatory variables known to be strongly correlated, the lasso method is powerless if the researcher wants to keep or remove a certain group of variables. Therefore, Yuan and Lin [14] proposed the group lasso method in 2006. The basic idea is to assume that there are J groups of strongly correlated variables, namely G 1 , , G J , and the number of variables in each group is p 1 , , p J , and β G j = β j j G j as the corresponding element of the sub-variables. The penalty function for the group Lasso method:
P λ ( | β | ) = λ j = 1 J | | β G j | | K j ,
where β G j K j = β G j T K j β G j 1 / 2 is the elliptic norm determined by the positive definite matrix K j .
Chesneau and Hebiri [15] proposed the grouped variables lasso method and investigated its theory in 2008. They proved this bound is better in some situations than the one achieved by the lasso and the Dantzig selector. The group variable lasso exploits the sparsity of the model more effectively. Percival [16] developed the overlapping groups lasso approach, demonstrating that permitting overlap does not remove many of the theoretical features and benefits of lasso and group lasso. This method can encode various structures as collections of groups, extending the group lasso method. Li, Nan, and Zhu [17] proposed the MSGLasso (Multivariate Sparse Group Lasso) method. The method can effectively remove unimportant groups and unimportant individual coefficients within important groups, especially for the p n problem. It can flexibly handle a variety of complex group structures, such as overlapping, nested, or multi-level hierarchies.
The prediction accuracy of the lasso drastically reduces when confronted with multi-collinear data. A novel regularisation approach dubbed elastic net [18] has been presented to address these issues. Elastic net estimation may be conceived as a combination of lasso [9] and ridge regression [8] estimation. Compared to lasso, the elastic net approach performs better with data of the kind p n with several co-linearities between variables. However, a basic elastic net is incapable of handling noisy data. To address the difficulties above, we propose canal-adaptive elastic net method in this article. This technique offers four significant advantages:
  • This model is efficient at handling streaming data. The suggested canal-adaptive elastic net dynamically updates the regression coefficients β for regularised linear models in real-time. Each time a batch of data is fetched, the OGD framework enables updating the original model. Can handle stream data more effectively.
  • The model has a sparse representation. As illustrated in Figure 1, only a tiny subsection of samples with residuals in the ranges ϵ δ , ϵ and ϵ , ϵ + δ are used to adjust the regression parameters. As a result, the model has perfect scalability and decreases computing costs.
  • The improved loss function confers on the model a significant level of noise resistance. By dynamically modifying the δ parameter, noisy data with absolute errors (bigger than the threshold parameter δ + ϵ ) are recognized and excluded from being employed to alter the regression coefficients.
  • The 1 -norm and 2 -norm are employed. Can handle the scenario of p n in the data more effectively. Simultaneous automatic variable selection and continuous shrinkage and can select groups of related variables. Overcoming the effects of data multicollinearity.
The rest of this paper is structured in the following manner. Section 2 reviews some studies on variable selection, noise-tolerant loss functions, data multicollinearity, and streaming data. Section 3 summarizes previous work on the penalty aim function and then introduces the linear regression noise-resistant online learning technique. In Section 4, we conduct numerical simulations and tests on benchmark datasets to compare the canal-adaptive elastic net presented in this research to lasso, ridge regression, and elastic net. Finally, Section 5 presents a concise discussion to conclude the paper.

2. Related Works

Variable selection has always been an important issue in building regression models. It has been one of the hot topics in statistical research since it was proposed in the 1960s, generating much literature on variable selection methods. For example, the Japanese scholar Akaike [5] proposed the AIC criterion based on the maximum likelihood method, which can be used both for the selection of independent variables and for setting the order of autoregressive models in time series analysis. Schwarz [19] proposed the BIC criterion based on the Bayes method. Compared to AIC, BIC strengthens the penalty and thus is more cautious in selecting variables into the model. All the above methods achieve variable selection through a two-step S/E (Selection/Estimation) process, i.e., first selecting a subset of variables in an existing sample according to a criterion. A subset of variables is selected from the existing sample according to a criterion (Selection). Then the unknown coefficients are estimated from the sample (Estimation). Because the correct variable is unknown in advance, the S-step is biased, which increases the risk of the E-step. To overcome this drawback, Seymour Geisser [20] proposed Cross-validation. Later on, variable selection methods based on penalty functions emerged. Tibshirani [9] proposed LASSO (Least Absolute Shrinkage and Selection Operator) inspired by the NG (Nonnegative Garrote) method. The lasso method avoids the drawback of over-reliance on the original least squares estimation of the NG method. As Fan and Li [21] pointed out that lasso does not possess the Oracle property, they thus proposed a new variable selection method, the SCAD (Smoothly Clipped Absolute Deviation) method, and proved that it has the Oracle property. Zou [11] proposed the adaptive lasso method based on the lasso. The variable selection methods with structured penalties (e.g., features are dependent and/or there are group structures between features) have become more popular because of the ever-increasing need to handle complex data, such as elastic net and group lasso [14].
While investigating noise-resistant loss functions, we generated interest in the truncated loss function. The losses generate learning models that are robust to substantial quantities of noise. Xu et al. [22] demonstrated that truncation could tolerate much higher noise for enjoying consistency than without truncation. The robust variable selection is a novel concept that incorporates robust losses from the robust statistics area into the model. Formed models that perform well empirically in noisy situations [23,24,25].
The concept of multicollinearity refers to the linear relationships between the independent variables in multiple regression analysis. Multicollinearity occurs when the regression model incorporates variables that are highly connected not only with the dependent variable but also to each other [26]. Some research has explored and discussed the challenges associated with multicollinearity in regression models, emphasizing that the primary issue of multicollinearity is uneven and biased standard errors and unworkable interpretations for the results [27,28]. There are many strategies for handling multicollinearity, one of which is ridge regression [29,30].
Many studies have been conducted over the last few decades on inductive learning approaches such as lasso [9], artificial neural networks [31,32]. Support vector regression [33], among others. These methods have been applied successfully to a variety of real-world problems. However, their usual implementation causes the simultaneous availability of all training data [34], making them unsuitable for large-scale data mining applications and streaming data mining tasks [35,36]. Compared to the traditional batch learning framework, the online learning algorithm (shown in Figure 2) is another framework for learning samples in a streaming fashion, which has the advantage of scalability and real-time. In recent years, great attention has been paid to developing online learning methods in the machine learning community, such as online ridge regression [37,38], adaptive regularization for lasso [39], projection [40] and bounded online gradient descent algorithm [41].

3. Method

Most currently available online regression algorithms learn information from clean data. Because of flaws in the human labeling process and sensor failures, noisy data are unavoidable and damaging. In this section, we propose a noise-tolerant online learning algorithm for the linear regression of streaming data. We employed a noise-resilient loss function, dubbed canal loss, for regression based on the well-known ϵ - i n s e n s i t i v e loss, inspired by the ramp loss designed for classification problems. In addition, we will use a novel method to adjust the ϵ and δ dynamics of the canal loss parameters.

3.1. Canal-Adaptive Elastic Net

For a given n-group of data x i , y i i = 1 n , y i R , we consider a simple liner regression model:
y = X β + ϵ ,
where y = y 1 , y 2 , , y n T is the response and X = x 1 , x 2 , , x p is the models column full rank design matrix, x j = x 1 j , x 2 j , , x n j T , j = 1 , 2 , , p is the n-dimensional explanatory variable, β = β 1 , β 2 , , β p T is the associated vector of regression coefficient, ϵ are i.i.d random errors vector with mean of 0. Without losing generality we can assume that the response is centered and the predictors are standardized after a location and scale transformation,
i = 1 n y i = 0 , i = 1 n x i j = 0 , i = 1 n x i j 2 = 1 , j = 1 , 2 , , p .
However, if X is not column full rank, or if the linear correlation between some columns is significant, the determinant of X T X is close to 0, i.e., X T X is close to singular. The traditional OLS method lacks stability and reliability. To solve the above problem, Hoerl and Kennard [20] proposed ridge regression:
R i d g e R e g r e s s i o n : L ( β ) = i = 1 n y i x i β 2 + λ j = 1 p β j 2 .
The penalty technique improves OLS by transforming the unfit problem into a fit problem. It loses the unbiasedness of OLS in exchange for higher numerical stability and obtains higher computational accuracy. Although ridge regression can effectively overcome the high correlation between variables and improve the prediction accuracy, model selection cannot be made with ridge regression alone. Therefore, Tibshirani [9] proposed the primary lasso criterion:
L a s s o : L ( β ) = i = 1 n y i x i β 2 + λ j = 1 p β i ,
where λ > 0 is a fixed adjustment parameter. Lasso is a penalized ordinary least squares method. Based on the singularity of the derivative of the penalty function at zero, the coefficients of the insignificant variables are compressed to zero, and a lighter compression is given to the significant independent variables with larger estimates. The accuracy of the parameter estimates is ensured.
However, lasso also has some inherent drawbacks: lasso does not have the Oracle property. It has the disadvantage of selecting at most n variables when considering data of sample size ( p > n ) . Where numerous characteristics are interrelated, lasso selects one of these characteristics. Lasso is less effective than ridge regression when handing independent variables with multicollinearity. Therefore, Zou and Hastie proposed the elastic net:
E l a s t i c N e t : L ( λ 1 , λ 2 , β ) = i = 1 n y i x i β 2 + λ 2 j = 1 p β j 2 + λ 1 j = 1 p β j .
The elastic net uses both 1 -norm and 2 -norm as linear regression models with a priori canonical terms. It combines the advantages of lasso and ridge regression. It is a method to solve the group variable selection with unknown variable grouping. Compared with the lasso, elastic net also improves handling data with sample size ( p > n ) and data with multicollinearity among variables. Unfortunately, because of the loss function’s shortcomings, the elastic net cannot erase the effects caused by noisy data.
To obtain a noise-resilient elastic net-type estimator. Based on the classical ϵ - i n s e n s i t i v e loss function l ϵ z = max 0 , z ϵ . Canal loss with noise-resilient parameter δ is proposed:
l ϵ δ z i = min δ , max 0 , z i ϵ
where z i = y i x i β , ϵ > 0 and δ > 0 are the threshold tuning parameter. The canal loss function’s upper bound is maintained as a constant, i.e., δ , which considerably reduces the negative influence of outliers and makes it a noise-resistant loss function. Using the advantages of canal loss, we modify the elastic net and propose the canal-adaptive elastic net as a new method. We define the canal-adaptive elastic net as:
C a n a l A d a p t i v e E l a s t i c N e t : L ( λ 1 , λ 2 , β ) = i = 1 n l ϵ δ z i + λ 2 j = 1 p β j 2 + λ 1 j = 1 p w ^ j β j ,
where λ 1 , λ 2 > 0 , w ^ j = β ^ j ( e n ) γ for j = 1 , , p , γ is a positive constant. We can also define w ^ j = when β ^ j ( e n ) = 0 . β ^ j ( e n ) is the weight to correct the regression coefficient β j . We define β ^ ( en ) as:
β ^ ( e n ) = a r g m i n β i = 1 n l ϵ δ z i ,
where β ^ ( en ) = β ^ 1 ( e n ) , β ^ 2 ( e n ) , , β ^ p ( e n ) T .
The canal loss approximates the absolute loss in the process of ϵ 0 and δ + , which is more clearly expressed as:
lim ϵ 0 , δ + ϵ δ z i = lim ϵ 0 , δ + min δ , max 0 , z i ϵ = z i .
The proposed canal-adaptive elastic net is predicted to be robust to outliers and to have the property of sparse representation.

3.2. Online Learning Algorithm for Canal-Adaptive Elastic Net

We employed the online gradient descent algorithm (OGD) and presented the minimization optimization strategy to solve the canal-adaptive elastic net model efficiently
L ( λ 1 , λ 2 , β ) = t = 1 n l ϵ δ z t + λ 2 j = 1 p β j 2 + λ 1 j = 1 p w ^ j β j ,
where z t = y t x t β .
First, the literature has proposed numerous methods for estimating the regularisation parameter, including cross-validation, AIC, and BIC. We minimize the BIC-type objective function to optimize the regularisation parameter, which makes the calculation quicker and ensures consistency in variable selection, i.e.,
m i n λ t = 1 n l ϵ δ z t + λ 2 j = 1 p β j 2 + λ 1 j = 1 p w ^ j β j l o g ( λ 1 + λ 2 ) l o g ( n ) .
Second, although Equation (1) is not a convex optimization problem, it can be restated as a difference in convex (DC) programming issue. This problem may be solved using the Concave-Convex Procedure (CCCP). However, because CCCP is a batch learning algorithm, it does not meet real-time processing requirements when handling streaming data. We used the well-known OGD framework in our work to arrive at a near-optimal solution. This is a compromise between accuracy and scalability. To minimize Equation (1) by OGD, we reformulate it as:
a r g m i n L β β a r g m i n β t = 1 n l ϵ δ z t + λ 2 j = 1 p β j 2 + λ 1 j = 1 p w ^ j β j J t ( β ) ,
and then, based on the basic structure of the OGD algorithm, we solve this optimization problem,
β ( t ) = β ( t 1 ) η t β J t β β = β ( t 1 ) .
Here, η t is the t-th step that satisfies the following constraints t = 1 n η t 2 < and t = 1 n η t = . when n [42]. Unlike the exact computation of the full gradient of L ( λ 1 , λ 2 , β ) , the notation β J t β | β = β ( t 1 ) denotes the derivative of J t β with respect to β = β ( t 1 ) of the derivative. We can deduce β J t β | β = β ( t 1 ) as following:
β J t β | β = β ( t 1 ) = x t + 2 λ 2 β ( t 1 ) + λ 1 ( γ β ( t 1 ) γ 1 s i g n ( β ( t 1 ) ) β ( t 1 ) + β ( t 1 ) γ s i g n ( β ( t 1 ) ) ) , if ϵ δ < z t < ϵ , x t + 2 λ 2 β ( t 1 ) + λ 1 ( γ β ( t 1 ) γ 1 s i g n ( β ( t 1 ) ) β ( t 1 ) + β ( t 1 ) γ s i g n ( β ( t 1 ) ) ) , if ϵ < z t < ϵ + δ , 2 λ 2 β ( t 1 ) + λ 1 ( γ β ( t 1 ) γ 1 s i g n ( β ( t 1 ) ) β ( t 1 ) + β ( t 1 ) γ s i g n ( β ( t 1 ) ) ) , otherwise ,
where z t = y t x t β ( t 1 ) , substituting the gradient Equation (3) into Equation (2),
β ( t ) = β ( t 1 ) η t ( x t s i g n ( z t ) + 2 λ 2 β ( t 1 ) + λ 1 ( γ β ( t 1 ) γ 1 s i g n ( β ( t 1 ) ) β ( t 1 ) + β ( t 1 ) γ s i g n ( β ( t 1 ) ) ) ) , if ϵ < z t < ϵ + δ , β ( t 1 ) η t ( 2 λ 2 β ( t 1 ) + λ 1 ( γ β ( t 1 ) γ 1 s i g n ( β ( t 1 ) ) β ( t 1 ) + β ( t 1 ) γ s i g n ( β ( t 1 ) ) ) ) , otherwise .
Finally, as shown in Equation (3), the proposed canal-adaptive elastic net contains a sparsity parameter ϵ 0 and a noise-resilient parameter δ 0 . The parameter ϵ determines the sparsity of the proposed model, whereas δ indicates the level of noise elasticity. Proposing a strategy for adjusting the channel loss parameters is a pressing issue. ϵ and δ are automatically iterated. In this study, we set the parameters:
ϵ = ζ × mean y ^ t , y t , δ = γ × mean y ^ t , y t
Adjusting ϵ and δ parameters is equivalent to adjusting ζ and γ . When γ is set to 0, the algorithm does not learn any examples of ( x t , y t ) t = 1 n and instead updates β according to the regularization term. If γ is sufficiently big, our canal-adaptive elastic net will withstand noisy data. The proposed canal-adaptive elastic net algorithm is summarized as Algorithm 1.
Algorithm 1 Noise-Resilient Online Canal-adaptive Elastic Net Algorithm.
Input: Initial β ( 0 ) = [ 1 , 1 , , 1 ] d + 1 T estimate number of examples n and instance sequences
x t ( t = 1 , ) .
Output: Predict y ^ t ( t = 1 , )
1: X t = 1 x t T = 1 , x 1 t , x 2 t , , x d t T
2: for t = 1 , do
3:      Receive instance X t
4:      Predict value y ^ t = X t T β ( t 1 )
5:      Receive true value y t
6:      Update canal loss parameter ϵ and δ according to Equation (5)
7:      Compute residual error z t = y ^ t y t
8:      if    ϵ z t < ϵ + δ
9:         Update β ( t ) = β ( t 1 ) η t x t sign z t + 2 λ 2 β ( t 1 ) + λ 1 γ β ( t 1 ) γ 1 sign β ( t 1 ) β ( t 1 ) + β ( t 1 ) γ sign β ( t 1 ) ,
            according to Equation (4).
10:      else
11:         Update β ( t ) = β ( t 1 ) η t 2 λ 2 β ( t 1 ) + λ 1 γ β ( t 1 ) γ 1 sign β ( t 1 ) β ( t 1 ) + β ( t 1 ) γ sign β ( t 1 ) ,
              according to Equation (4).
12:      end if
13: end for

4. Experiments

In this part, we perform experiments to evaluate the canal-adaptive elastic net algorithm performs. First, simulation studies on synthetic data with multicollinearity and noise are used to verify the method’s efficiency. Second, the model’s resistance to noise and the variable selection accuracy is evaluated using data sets with different noise proportions. Finally, we run thorough tests to assess the proposed algorithm’s performance on four benchmark prediction tasks. The benchmark datasets used in the experiments are available from the UCI Machine Learning repository and LIBSVMP website.

4.1. Simulation Settings

We evaluate the proposed noise-resilient online regression algorithm on synthetic data sets with noise and multicollinearity. We examine the proposed canal-adaptive elastic net method’s effectiveness in handling noisy and multi-collinear input and output data. In addition, we evaluated the canal-adaptive elastic net method’s performance in simulation trials with and without multicollinearity datasets. The simulation experiment is described in detail below.

4.1.1. The Case of Both Multicollinearity and Noise

This experiment indicates that canal-adaptive elastic net on streaming data outperforms lasso, ridge regression, and elastic net in handling multicollinearity data and is a suitable variable selection procedure for handling noisy data than the other three methods.
We simulate 200 observations in each example and set the feature dimension d as 10. We let β j = 0.85 for all j. The correlation coefficients between x i and x j were greater than 0.8 in their absolute values. We trained the model on 70% of the data and then tested it on 30% of the data. We conducted 20 randomized trials and determined the MAE, RMSE, the number of discards, discard rate, and average computation time of the model on the test data set using varied noise proportions for x and y.
We generated simulated the data from the true model,
y t = x t β + ρ ϵ t , ϵ N ( 0 , 1 ) ,
where ρ = 3 and ϵ t is generated by the normal distribution N 0 , 1 . For a given t, the covariate x t is constructed using a standard d-dimensional multivariate normal distribution, which assures that the components of x t are independent and standard normal. Here, we change the noise ratio σ from { 0 , 0.1 , 0.2 , 0.3 } . To be more precise, we randomly select some samples { x t , y t } with the ratio of σ , change the 6th explanatory variable of x t to 0 in the training set, and then evaluate the learning model on real test datasets. Table 1 contains the results. Furthermore, to find out the effect of the noisy response variable y. We randomly altered the response variable y to 0 in the training set at a rate of σ and then tested the learning model with the real test sample. Table 2 summarizes the corresponding findings. To make a comparison, the compared models are solved using the online gradient descent (OGD) method, i.e., lasso, ridge regression, elastic net, and canal-adaptive elastic net. In the simulated experiments, we set the two hyper-parameters ϵ = 0.1 and δ = 2.0 .
First, we show that the canal-adaptive elastic net can avoid interfering with the explanatory variable x. Analysis of RMSE and MAE indicates lasso deviates significantly from the true value. In contrast, canal-adaptive elastic net, ridge regression, performs admirably. Lasso is sensitive to multicollinearity in the data because of its nature. In the presence of noise, the proposed canal-adaptive elastic net method outperforms the other three competing methods. In particular, canal-adaptive elastic net significantly outperforms lasso, ridge regression, elastic net, and in the presence of high noise level ( σ = 0.3 ) . Because of the inherent drawbacks of the elastic net, its loss function can reduce the impact of noisy data to some extent. However, the negative impact of noisy data is still serious. Figure 3a illustrates the predictive performance of different algorithms. It can be observed that when data are multicollinearity and noisy, the canal-adaptive elastic net outperforms lasso, ridge regression, and elastic net. This shows that the canal-adaptive elastic net is a method capable of overcoming multicollinearity and is noise-resistant.
Each coefficient may have an influence when noisy data are present in the response variable y. In the presence of multicollinearity in the data, the proposed canal-adaptive elastic net significantly outperforms lasso, ridge regression, and elastic net. Due to the lasso itself, it does not predict data containing multicollinearity very well. The estimation of β deviates far from the true coefficient β than the other three methods. Compared with the lasso, ridge regression and elastic nets effectively overcome the effects of data with multicollinearity. However, their performance suffers when a certain level of noise is introduced. The prediction performance of the different models is provided in Figure 3b for a more detailed comparison of the models. Canal-adaptive elastic net outperforms the other three approaches in the presence of noisy data and data with multicollinearity. The results show that the canal-adaptive elastic net is a successful technique for overcoming multicollinearity and handling noisy data when the response variable y contains considerable noise.

4.1.2. The Case of Noise

In this subsection, we present simulation experiments comparing the performance of canal-adaptive elastic nets with three competing approaches (lasso, ridge regression, and elastic net) on a limited sample of streaming noise data (n = 5000, 10,000). This experiment explores the performance of the four methods for group variable selection with unknown variable grouping. Because of its nature, group lasso cannot be included in this experiment. In addition, we set the β to ( 1 , 2 , 3 , 4 , 5 , 6 , 0 , 0 , , 0 ) . Where the feature dimension d is 50, the first six regression coefficients are significant, whereas the following 44 are insignificant. The covariate x t is created for a given t using a standard d-dimensional multivariate normal distribution, which ensures that the components of x t are independent and standard normal. The response variables are generated according to Equation (6) where ρ = 0.5 and ϵ t are generated from the normal distribution N (0,1).
Also, to investigate the effects of the noisy response variable y and the explanatory variable x, we applied a certain percentage of noise to the training dataset in the same way as in Section 4.1.1. We then tested the learning model with real test samples. Table 3 and Table 4 indicate the related results. For each parameter setting, 20 random experiments were conducted to evaluate the average performance on datasets with sample sizes n equal to 5000 and 10,000, respectively. For comparison, lasso, ridge regression, elastic net, and canal-adaptive elastic net models were solved using the online gradient descent (OGD) method. After that, performing these approaches is compared by determining the MAE, RMSE, the number of discards, discard rate, and average computing time for the models on the test data. We pre-set the parameters to ϵ = 0.01 and σ = 0.8 .
To begin, we show that the canal-adaptive elastic net is unaffected by the explanatory variable x. All four methods perform well in the presence of noise with a rate of σ = 0 . As illustrated in Table 3, performing lasso, elastic net, and ridge regression under noisy data deviates significantly from the true value. In particular, when σ = 0.2 or 0.3 , the canal-adaptive elastic net significantly outperforms the other three competing approaches. Because of the shortcomings of the loss functions of the lasso, elastic net, and ridge regression, these three methods are susceptible to noisy data. Figure 4 provides a complete comparison of the prediction performance of several algorithms. In the presence of noisy data input, the results show that canal-adaptive elastic net beats lasso, ridge regression, and elastic net on average.
Each coefficient may exert an effect if the response variable y contains noisy data. As illustrated in Table 4, the proposed canal-adaptive elastic net method outperforms the other three competing methods when dealing with noisy data. Because of the least square deviation, lasso, ridge regression, and elastic net are highly sensitive to noise. In order to compare the models more comprehensively, the prediction performance of the different models is presented in Figure 5. It can be observed that the proposed canal-adaptive elastic net method significantly outperforms the other three competing methods in all aspects of the noisy output case.
As seen in Table 5 and Table 6, the canal-adaptive elastic net generates sparse solutions. Canal-adaptive elastic net behaves similarly to “Oracle”. The additional “grouping effect” capability makes elastic net-type a better variable selection method than Lasso-type.

4.2. Benchmark Data Sets

We undertake thorough tests in this portion to evaluate the proposed canal-adaptive elastic net algorithm’s performance in real-world tasks. Four benchmark datasets were employed for experimental evaluation: “Kin”, “Abalone”, “Pendigits”, and “Letters”. The The first two datasets are selected from the UCI datasets [43], and the last two are selected from Chang and Lin [44]. Table 7 summarizes four benchmark datasets. To demonstrate the statistical features of the various datasets, we created box line plots, as illustrated in Figure 6. To simulate the setup of the stream data, we replicate the samples three times. Before conducting the experiments, domain experts need to analyze and specify the parameter sensitivities of our models. Table 8 displays the parameter settings for the four benchmark datasets. Each experiment is repeated 20 times randomly, and the average performance is recorded.
On the benchmark datasets, Table 9 and Table 10 summarize RMSE, MAE, discarded samples, discarded rate, and average running time of the four comparative methods lasso, ridge regression, elastic net, and canal-adaptive elastic net. The regression accuracy (RMSE and MAE) analysis results demonstrate that when data are clean ( σ = 0), the performance of the four comparison methods is comparable. However, for noisy data ( σ 0.1 ), the suggested canal adaptive elastic network technique significantly outperforms the other three approaches regarding noise immunity. As seen in the seventh column, the discard rate increases as noise σ grow. For a more comprehensive comparison, We give the average RMSE in Figure 7 and Figure 8. We can see that the canal-adaptive elastic net proposed in this paper is the most stable of all four datasets.

5. Conclusions

This article presents a novel linear regression model called canal adaptive elastic net to address the novel challenge of online learning with noisy and multi-collinear data. The canal-adaptive elastic net generates a sparse model with a high prediction accuracy while promoting grouping. Additionally, the canal-adaptive elastic net is also solved using an efficient approach based on an online gradient descent framework. The empirical data and simulations demonstrate the canal-adaptive elastic net’s outstanding performance and superiority over the other three approaches (e.g., Lasso, ridge regression, and elastic net). Future studies will focus on expanding the linear regression model to a non-linear regression model through the use of the kernel technique [45].

Author Contributions

Data curation, J.L.; Investigation, R.L.; Methodology, W.W. and M.Z.; Writing—original draft, Y.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was partially supported by the Fundamental Research Funds for the Central Universities (No. 20CX05012A), NSF project (ZR2019MA016) of Shandong Province of China, and Statistical research project(KT028) of Shandong Province of China.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets used in this paper are all available at the following links: 1. http://archive.ics.uci.edu/ml/ (accessed on 20 June 2022); 2. http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/ (accessed on 20 June 2022).

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Gama, J. Knowledge discovery from data streams. Intell. Data Anal. 2009, 13, 403–404. [Google Scholar] [CrossRef]
  2. Jian, L.; Gao, F.; Ren, P.; Song, Y.; Luo, S. A noise-resilient online learning algorithm for scene classification. Remote Sens. 2018, 10, 1836. [Google Scholar] [CrossRef]
  3. Jian, L.; Li, J.; Liu, H. Toward online node classification on streaming networks. Data Min. Knowl. Discov. 2018, 32, 231–257. [Google Scholar] [CrossRef]
  4. Zinkevich, M. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the Twentieth International Conference on Machine Learning, Washington, DC, USA, 21–24 August 2003; pp. 928–936. [Google Scholar]
  5. Aiken, L.S.; West, S.G. Multiple Regression: Testing and Interpreting Interactions; Sage: Newbury Park, CA, USA, 1991. [Google Scholar]
  6. Wang, D.; Zhang, Z. Summary of variable selection methods in linear regression models. Math. Stat. Manag. 2010, 29, 615–627. [Google Scholar]
  7. Frank, I.E.; Priedman, J.H. A statistical view of some chemomnetrics regression tools. Technometrics 1993, 35, 109–148. [Google Scholar] [CrossRef]
  8. Hoerl, A.; Kennard, R. Ridge regression. In Encyclopedia of Statistical Sciences; Wiley: New York, NY, USA, 1988; Volume 8, pp. 129–136. [Google Scholar]
  9. Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. 1996, 58, 267–288. [Google Scholar] [CrossRef]
  10. Huang, J.; Ma, S.G.; Zhang, C.H. Adaptive lasso for sparse high-dimensional regression models. Stat. Sin. 2008, 374, 1603–1618. [Google Scholar]
  11. Zou, H. The adaptive lasso and its oracle properties. J. Am. Stat. Assoc. 2006, 101, 1418–1429. [Google Scholar] [CrossRef]
  12. Dicker, L.; Lin, X. Parallelism, uniqueness, and large-sample asymptotics for the Dantzig selector. Can. J. Stat. 2013, 41, 23–35. [Google Scholar] [CrossRef]
  13. Candes, E.; Tao, T. The Dantzig selector: Statistical estimation when p is much larger than n. Ann. Stat. 2007, 35, 2313–2351. [Google Scholar]
  14. Yuan, M.; Lin, Y. Model selection and estimation in regression with grouped variables. J. R. Stat. Soc. Ser. 2006, 68, 49–67. [Google Scholar] [CrossRef]
  15. Chesneau, C.; Hebiri, M. Some theoretical results on the Grouped Variables Lasso. Math. Methods Stat. 2008, 17, 317–326. [Google Scholar] [CrossRef]
  16. Percival, D. Theoretical properties of the overlapping groups lasso. Electron. J. Stat. 2012, 6, 269–288. [Google Scholar] [CrossRef]
  17. Li, Y.; Nan, B.; Zhu, J. Multivariate sparse group lasso for the multivariate multiple linear regression with an arbitrary group structure. Biometrics 2015, 71, 354–363. [Google Scholar] [CrossRef] [PubMed]
  18. Zou, H.; Hastie, T. Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. 2005, 67, 301–320. [Google Scholar] [CrossRef]
  19. Schwarz, G. Estimating the dimension of a model. Ann. Stat. 1978, 6, 15–18. [Google Scholar] [CrossRef]
  20. Geisser, S.; Eddy, W.F. A predictive approach to model selection. J. Am. Stat. Assoc. 1979, 74, 153–160. [Google Scholar] [CrossRef]
  21. Fan, J.; Li, R. Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 2001, 96, 1348–1360. [Google Scholar] [CrossRef]
  22. Xu, Y.; Zhu, S.; Yang, S.; Zhang, C.; Jin, R.; Yang, T. Learning with non-convex truncated losses by SGD. arXiv 2008, arXiv:1805.07880. [Google Scholar]
  23. Chang, L.; Roberts, S.A. Welsh, Robust lasso regression using tukey’s biweight criterion. Technometrics 2018, 60, 36–47. [Google Scholar] [CrossRef]
  24. Xu, S.; Zhang, C.-X. Robust sparse regression by modeling noise as a mixture of gaussians. J. Appl. Stat. 2019, 46, 1738–1755. [Google Scholar] [CrossRef]
  25. Wang, X.; Jiang, Y.; Huang, M.; Zhang, H. Robust variable selection with exponential squared loss. J. Am. Stat. Assoc. 2013, 108, 632–643. [Google Scholar] [CrossRef] [PubMed]
  26. Young, D.S. Handbook of Regression Methods; CRC Press: Boca Raton, FL, USA, 2017; pp. 109–136. [Google Scholar]
  27. Akaike, H. Information theory and an extension of the maximum likelihood principle. In Proceedings of the Second International Symposium on Information Theory; Petrov, B.N., Csaki, F., Eds.; Akademiai Kiado: Budapest, Hungary, 1973; pp. 267–281. [Google Scholar]
  28. Gunst, R.F.; Webster, J.T. Regression analysis and problems of multicollinearity. Commun. Stat. 1975, 4, 277–292. [Google Scholar] [CrossRef]
  29. Guilkey, D.K.; Murphy, J.L. Directed Ridge Regression Techniques in cases of Multicollinearity. J. Am. Stat. Assoc. 1975, 70, 767–775. [Google Scholar] [CrossRef]
  30. El-Dereny, M.; Rashwan, N. Solving multicollinearity problem Using Ridge Regression Models. Sciences 2011, 12, 585–600. [Google Scholar]
  31. Bhadeshia, H. Neural networks and information in materials science. Stat. Anal. Data Min. Asa Data Sci. J. 2009, 1, 296–305. [Google Scholar] [CrossRef]
  32. Zurada, J.M. Introduction to Artifificial Neural Systems; West Publishing Company: St. Paul, MN, USA, 1992; Volume 8. [Google Scholar]
  33. Gunn, S.R. Support vector machines for classifification and regression. ISIS Tech. Rep. 1998, 14, 5–16. [Google Scholar]
  34. Wang, Z.; Vucetic, S. Online training on a budget of support vector machines using twin prototypes. Stat. Anal. Data Min. ASA Data Sci. J. 2010, 3, 149–169. [Google Scholar] [CrossRef]
  35. Aggarwal, C.C. Data Mining: The Textbook; Springer: Berlin/Heidelberg, Germany, 2015. [Google Scholar]
  36. Bottou, L. Online learning and stochastic approximations. On-Line Learn. Neural Netw. 1998, 17, 142. [Google Scholar]
  37. Gao, F.; Song, X.; Jian, L.; Liang, X. Toward budgeted online kernel ridge regression on streaming data. IEEE Access 2019, 7, 26136–26145. [Google Scholar] [CrossRef]
  38. Arce, P.; Salinas, L. Online ridge regression method using sliding windows. In Proceedings of the Chilean Computer Science Society (SCCC), Washington, DC, USA, 12–16 November 2012; pp. 87–90. [Google Scholar]
  39. Monti, R.P.; Anagnostopoulos, C.; Montana, G. Adaptive regularization for lasso models in the context of nonstationary data streams. Stat. Anal. Data Min. ASA Data Sci. J. 2018, 11, 237–247. [Google Scholar] [CrossRef]
  40. Orabona, F.; Keshet, J.; Caputo, B. The projectron: A bounded kernel-based perceptron. In Proceedings of the 25th International Conference on Machine Learning, Helsinki, Finland, 5–9 July 2008; pp. 720–727. [Google Scholar]
  41. Zhao, P.; Wang, J.; Wu, P.; Jin, R.; Hoi, S.C. Fast bounded online gradient descent algorithms for scalable kernel-based online learnin. arXiv 2012, arXiv:1206.4633. [Google Scholar]
  42. Robbins, H.; Monro, S. A stochastic approximation method. Ann. Math. Stat. 1951, 1, 400–407. [Google Scholar] [CrossRef]
  43. Dheeru, D.; Karra Taniskidou, E. UCI Machine Learning Repository; School of Information and Computer Scienc: Irvine, CA, USA, 2017; Available online: http://archive.ics.uci.edu/ml (accessed on 20 June 2022).
  44. Chang, C.C.; Lin, C.J. LIBSVM: A library for support vector machines. Acm Trans. Intell. Syst. Technol. 2011, 2, 1–27. [Google Scholar] [CrossRef]
  45. Liu, W.; Pokharel, P.P.; Principe, J.C. The kernel least-mean-square algorithm. IEEE Trans. Signal Process. 2008, 56, 543–554. [Google Scholar] [CrossRef]
Figure 1. I. absolute loss; II. ϵ -insensitive loss; III. canal loss.
Figure 1. I. absolute loss; II. ϵ -insensitive loss; III. canal loss.
Mathematics 10 02985 g001
Figure 2. An illustration schematic of the online regression learning procedure.
Figure 2. An illustration schematic of the online regression learning procedure.
Mathematics 10 02985 g002
Figure 3. Results of simulations of noisy explanatory variable x and noise response variables y in the presence of data multicollinearity. (a) The noise explanatory variable x. (b) The noise response variable y.
Figure 3. Results of simulations of noisy explanatory variable x and noise response variables y in the presence of data multicollinearity. (a) The noise explanatory variable x. (b) The noise response variable y.
Mathematics 10 02985 g003
Figure 4. Simulation results for noisy explanatory variable x. (a) n = 5000. (b) n = 10,000.
Figure 4. Simulation results for noisy explanatory variable x. (a) n = 5000. (b) n = 10,000.
Mathematics 10 02985 g004
Figure 5. Simulation results for noisy response y. (a) n = 5000. (b) n = 10,000.
Figure 5. Simulation results for noisy response y. (a) n = 5000. (b) n = 10,000.
Mathematics 10 02985 g005
Figure 6. Box plots of four benchmark datasets. (a) Letters. (b) Kin. (c) Abalone. (d) Pendigits.
Figure 6. Box plots of four benchmark datasets. (a) Letters. (b) Kin. (c) Abalone. (d) Pendigits.
Mathematics 10 02985 g006
Figure 7. Experimental results on the dataset “Kin” and dataset “Abalone”. (a) Kin. (b) Abalone.
Figure 7. Experimental results on the dataset “Kin” and dataset “Abalone”. (a) Kin. (b) Abalone.
Mathematics 10 02985 g007
Figure 8. Experimental results on the dataset “Letters” and dataset “Pendigits”. (a) Letters. (b) Pendigits.
Figure 8. Experimental results on the dataset “Letters” and dataset “Pendigits”. (a) Letters. (b) Pendigits.
Mathematics 10 02985 g008
Table 1. Results of simulations of noisy explanatory variable x in the presence of data multicollinearity.
Table 1. Results of simulations of noisy explanatory variable x in the presence of data multicollinearity.
σ MethodRMSEMAEDiscarded SamplesDiscarded RateTime (s)
0Lasso1.7274 ± 0.20749.6486 ± 2.203500.00%0.0011
Elastic Net1.7739 ± 0.23249.8651 ± 3.669400.00%0.0013
Ridge Regression1.6251 ± 0.25997.7511 ± 1.669900.00%0.0012
Canal-Adaptive Elastic Net1.6109 ± 0.18267.0687 ± 1.875200.00%0.0014
0.1Lasso2.1554 ± 0.327515.2898 ± 4.919500.00%0.0011
Elastic Net1.7073 ± 0.27969.3282 ± 2.853000.00%0.0013
Ridge Regression1.6610 ± 0.26938.2822 ± 2.576500.00%0.0012
Canal-Adaptive Elastic Net1.5942 ± 0.26847.2797 ± 2.517726.000 ± 2.00013.00%0.0015
0.2Lasso2.3174 ± 0.289918.0449 ± 4.736800.00%0.0012
Elastic Net2.1440 ± 0.306615.3115 ± 4.771400.00%0.0013
Ridge Regression1.5664 ± 0.14767.7189 ± 1.426400.00%0.0011
Canal-Adaptive Elastic Net1.4850 ± 0.19597.0048 ± 1.359144.000 ± 4.00022.00%0.0015
0.3Lasso2.3753 ± 0.336018.8495 ± 4.606400.00%0.0012
Elastic Net2.2583 ± 0.260816.8451 ± 3.747300.00%0.0013
Ridge Regression1.7206 ± 0.10099.0645 ± 1.789100.00%0.0012
Canal-Adaptive Elastic Net1.6157 ± 0.13408.0100 ± 2.161772.000 ± 7.00036.00%0.0015
Table 2. Results of simulations of noise response variables y in the presence of data multicollinearity.
Table 2. Results of simulations of noise response variables y in the presence of data multicollinearity.
σ MethodRMSEMAEDiscarded SamplesDiscarded RateTime (s)
0Lasso2.1554 ± 0.327515.2898 ± 4.919500.00%0.0011
Elastic Net1.7739 ± 0.23249.8651 ± 3.669400.00%0.0013
Ridge Regression1.6251 ± 0.25998.0043 ± 1.669900.00%0.0012
Canal-Adaptive Elastic Net1.6109 ± 0.18267.6687 ± 1.875200.00%0.0014
0.1Lasso2.2082 ± 0.396614.8886 ± 5.240700.00%0.0012
Elastic Net1.8817 ± 0.233010.2875 ± 3.976800.00%0.0015
Ridge Regression1.7057 ± 0.18539.0843 ± 2.675400.00%0.0012
Canal-Adaptive Elastic Net1.6013 ± 0.17437.0020 ± 2.372076.000 ± 7.00038.00%0.0016
0.2Lasso2.3659 ± 0.396618.9535 ± 3.240700.00%0.0012
Elastic Net1.9372 ± 0.296013.1065 ± 4.095700.00%0.0013
Ridge Regression1.8668 ± 0.23699.8637 ± 2.841900.00%0.0012
Canal-Adaptive Elastic Net1.6585 ± 0.21787.7834 ± 2.039984.000 ± 8.00042.00%0.0015
0.3Lasso2.4068 ± 0.215719.7197 ± 3.511500.00%0.0011
Elastic Net2.0314 ± 0.278714.2434 ± 4.986300.00%0.0014
Ridge Regression2.0668 ± 0.177914.4463 ± 1.661500.00%0.0012
Canal-Adaptive Elastic Net1.7624 ± 0.30578.6620 ± 2.741490.000 ± 0.80045.00%0.0015
Table 3. Simulation results for noisy explanatory variable x.
Table 3. Simulation results for noisy explanatory variable x.
n σ MethodRMSEMAEDiscarded SamplesDiscarded RateTime (s)
50000Lasso0.1618 ± 0.00180.8085 ± 0.017000.00%0.1787
Elastic Net0.1627 ± 0.00140.8165 ± 0.015300.00%0.3423
Ridge Regression0.1626 ± 0.00210.8154 ± 0.022900.00%0.1951
Canal-Adaptive Elastic Net0.1621 ± 0.00310.8122 ± 0.0260694 ± 3213.89%0.2663
0.1Lasso0.1955 ± 0.01011.1882 ± 0.118200.00%0.1702
Elastic Net0.1885 ± 0.01061.1054 ± 0.121400.00%0.3072
Ridge Regression0.1697 ± 0.00470.8966 ± 0.050900.00%0.186
Canal-Adaptive Elastic Net0.1693 ± 0.00590.8896 ± 0.0620903 ± 2518.07%0.2528
0.2Lasso0.2073 ± 0.01031.3278 ± 0.138200.00%0.1706
Elastic Net0.2036 ± 0.00911.2797 ± 0.123900.00%0.2939
Ridge Regression0.1841 ± 0.00631.0430 ± 0.070400.00%0.1736
Canal-Adaptive Elastic Net0.1809 ± 0.00451.0084 ± 0.05481118 ± 1822.37%0.2442
0.3Lasso0.2151 ± 0.01091.4320 ± 0.129200.00%0.1685
Elastic Net0.2099 ± 0.00461.3597 ± 0.063600.00%0.3036
Ridge Regression0.1941 ± 0.01071.1614 ± 0.139200.00%0.1655
Canal-Adaptive Elastic Net0.1884 ± 0.00611.0968 ± 0.07071325 ± 2926.52%0.2329
10,0000Lasso0.1354 ± 0.00100.8009 ± 0.015900.00%0.466
Elastic Net0.1355 ± 0.00110.8010 ± 0.012700.00%0.7631
Ridge Regression0.1358 ± 0.00170.8044 ± 0.015900.00%0.7302
Canal-Adaptive Elastic Net0.1358 ± 0.00110.8054 ± 0.01381353 ± 5013.54%0.6464
0.1Lasso0.1669 ± 0.01101.2193 ± 0.162500.00%0.4322
Elastic Net0.1581 ± 0.01061.0950 ± 0.146600.00%0.819
Ridge Regression0.1449 ± 0.00330.9172 ± 0.047600.00%0.4735
Canal-Adaptive Elastic Net0.1434 ± 0.00350.8981 ± 0.04661889 ± 4418.90%0.6209
0.2Lasso0.1758 ± 0.00481.3500 ± 0.074600.00%0.4313
Elastic Net0.1737 ± 0.00751.3187 ± 0.123100.00%0.778
Ridge Regression0.1560 ± 0.00871.0660 ± 0.108600.00%0.4408
Canal-Adaptive Elastic Net0.1517 ± 0.00691.0067 ± 0.09212300 ± 3423.00%0.587
0.3Lasso0.1825 ± 0.00601.4521 ± 0.094800.00%0.5801
Elastic Net0.1785 ± 0.00531.3895 ± 0.085400.00%0.7243
Ridge Regression0.1656 ± 0.00641.1980 ± 0.091900.00%0.4343
Canal-Adaptive Elastic Net0.1603 ± 0.00451.1198 ± 0.05992661 ± 5126.61%0.5867
Table 4. Simulation results for noisy response variable y.
Table 4. Simulation results for noisy response variable y.
n σ MethodRMSEMAEDiscarded SamplesDiscarded RateTime (s)
50000Lasso0.1618 ± 0.00180.8085 ± 0.017000.00%0.1787
Elastic Net0.1627 ± 0.00140.8165 ± 0.015300.00%0.3423
Ridge Regression0.1626 ± 0.00210.8154 ± 0.022900.00%0.1951
Canal-Adaptive Elastic Net0.1621 ± 0.00310.8122 ± 0.0260694 ± 3213.89%0.2663
0.1Lasso0.6225 ± 0.076712.3388 ± 3.048500.00%0.1875
Elastic Net0.4406 ± 0.06076.7411 ± 1.833200.00%0.3103
Ridge Regression0.2215 ± 0.05321.5738 ± 0.701800.00%0.1983
Canal-Adaptive Elastic Net0.1641 ± 0.00310.8260 ± 0.03501126 ± 2122.54%0.2627
0.2Lasso0.8503 ± 0.051123.1978 ± 2.923400.00%0.1789
Elastic Net0.6047 ± 0.064813.0583 ± 3.112800.00%0.3212
Ridge Regression0.2678 ± 0.05112.2870 ± 0.851100.00%0.1956
Canal-Adaptive Elastic Net0.1643 ± 0.00400.8354 ± 0.03821539 ± 2530.79%0.2648
0.3Lasso0.9959 ± 0.076131.8982 ± 5.215500.00%0.1769
Elastic Net0.6990 ± 0.083617.7096 ± 4.266000.00%0.2947
Ridge Regression0.2583 ± 0.04472.1066 ± 0.700100.00%0.2001
Canal-Adaptive Elastic Net0.1668 ± 0.00300.8548 ± 0.03631973 ± 2939.47%0.2608
10,0000Lasso0.1354 ± 0.00100.8009 ± 0.015900.00%0.466
Elastic Net0.1355 ± 0.00110.8010 ± 0.012700.00%0.7631
Ridge Regression0.1358 ± 0.00170.8044 ± 0.015900.00%0.7302
Canal-Adaptive Elastic Net0.1358 ± 0.00110.8054 ± 0.01381353 ± 5013.54%0.6464
0.1Lasso0.5265 ± 0.063112.4175 ± 2.873100.00%0.4823
Elastic Net0.3636 ± 0.05736.3646 ± 2.002300.00%0.7922
Ridge Regression0.1549 ± 0.01721.0593 ± 0.225300.00%0.5049
Canal-Adaptive Elastic Net0.1360 ± 0.00190.8052 ± 0.01732274 ± 2922.74%0.641
0.2Lasso0.6783 ± 0.065820.8429 ± 4.308600.00%0.4626
Elastic Net0.4885 ± 0.061211.9550 ± 3.050400.00%0.7855
Ridge Regression0.1638 ± 0.02241.1977 ± 0.321400.00%0.4963
Canal-Adaptive Elastic Net0.1370 ± 0.00170.8204 ± 0.02423093 ± 2930.94%0.6472
0.3Lasso0.8442 ± 0.072732.6705 ± 5.688300.00%0.4569
Elastic Net0.6067 ± 0.048518.8505 ± 2.916600.00%0.7683
Ridge Regression0.1840 ± 0.03781.5151 ± 0.576700.00%0.4911
Canal-Adaptive Elastic Net0.1402 ± 0.00250.8608 ± 0.02983967 ± 2339.67%0.6422
Table 5. Median of non-zero coefficients in the presence of noisy data in the explanatory variable x.
Table 5. Median of non-zero coefficients in the presence of noisy data in the explanatory variable x.
nMethod σ = 0 σ = 0.1 σ = 0.2 σ = 0.3
5000Lasso6678
Elastic Net681112
Canal-Adaptive Elastic Net67910
10,000Lasso6667
Elastic Net78911
Canal-Adaptive Elastic Net7789
Table 6. Median of non-zero coefficients in the presence of noisy data in the response variable y.
Table 6. Median of non-zero coefficients in the presence of noisy data in the response variable y.
nMethod σ = 0 σ = 0.1 σ = 0.2 σ = 0.3
5000Lasso68910
Elastic Net8101214
Canal-Adaptive Elastic Net7101111
10,000Lasso6899
Elastic Net7101014
Canal-Adaptive Elastic Net791011
Table 7. Details of the benchmark datasets.
Table 7. Details of the benchmark datasets.
Dataset#Sample#Features#Train Number#Test Number
Kin3000 × 382100 × 3900 × 3
Abalone4177 × 372924 × 31253 × 3
Letters5000 × 3153500 × 31500 × 3
Pendigits7129 × 3144990 × 32139 × 3
Table 8. Parameter settings for the four benchmark datasets.
Table 8. Parameter settings for the four benchmark datasets.
DatasetKinAbaloneLettersPendigits
ζ 0.10.10.10.1
σ 1.92.01.51.9
Table 9. Experimental results on the dataset “Kin” and dataset “Abalone”.
Table 9. Experimental results on the dataset “Kin” and dataset “Abalone”.
Dataset σ MethodRMSEMAEDiscarded SamplesDiscarded RateTime (s)
Kin0Lasso0.0683 ± 0.00080.1951 ± 0.0052000.2258
Elastic Net0.0684 ± 0.00070.1946 ± 0.0045000.2821
Ridge Regression0.0696 ± 0.00190.2016 ± 0.0111000.2659
Canal-Adaptive Elastic Net0.0681 ± 0.00110.1677 ± 0.00511982 ± 12622.02%0.3297
0.1Lasso0.1074 ± 0.02650.4981 ± 0.2223000.2222
Elastic Net0.0911 ± 0.02210.3552 ± 0.1579000.2891
Ridge Regression0.0683 ± 0.00150.1951 ± 0.0081000.2631
Canal-Adaptive Elastic Net0.0662 ± 0.00140.1914 ± 0.00742466 ± 15327.40%0.3321
0.2Lasso0.1365 ± 0.02300.8678 ± 0.3194000.2184
Elastic Net0.1036 ± 0.01560.4722 ± 0.1389000.2877
Ridge Regression0.0692 ± 0.00230.2502 ± 0.0124000.2675
Canal-Adaptive Elastic Net0.0673 ± 0.00260.1972 ± 0.01352620 ± 11029.11%0.3300
0.3Lasso0.1695 ± 0.01701.3323 ± 0.2765000.2232
Elastic Net0.1242 ± 0.01300.6804 ± 0.1548000.2809
Ridge Regression0.0746 ± 0.00290.2322 ± 0.0140000.2622
Canal-Adaptive Elastic Net0.0693 ± 0.00500.2000 ± 0.03012931 ± 42032.57%0.3317
Abalone0Lasso0.1898 ± 0.00321.6091 ± 0.0466000.3906
Elastic Net0.1932 ± 0.00211.6553 ± 0.0492000.4951
Ridge Regression0.2015 ± 0.00531.6560 ± 0.0513000.4873
Canal-Adaptive Elastic Net0.1939 ± 0.00241.7259 ± 0.0415758 ± 1316.00%0.5314
0.1Lasso0.4633 ± 0.0654811.1963 ± 2.6828000.3707
Elastic Net0.3807 ± 0.05528.4261 ± 2.4375000.4934
Ridge Regression0.2201 ± 0.00342.0897 ± 0.0523000.4642
Canal-Adaptive Elastic Net0.2010 ± 0.00241.8987 ± 0.08671305 ± 9810.40%0.5315
0.2Lasso0.6084 ± 0.046819.1165 ± 2.9824000.3763
Elastic Net0.5164 ± 0.080815.3841 ± 4.4489000.4929
Ridge Regression0.2672 ± 0.00413.5966 ± 0.0945000.4571
Canal-Adaptive Elastic Net0.2248 ± 0.00372.3715 ± 0.09636161 ± 56228.81%0.5253
0.3Lasso0.7372 ± 0.054328.0564 ± 4.3451000.3821
Elastic Net0.6355 ± 0.075123.1508 ± 6.1709000.4924
Ridge Regression0.3372 ± 0.00706.1546 ± 0.1947000.4484
Canal-Adaptive Elastic Net0.2646 ± 0.00563.1383 ± 0.10934351 ± 86034.70%0.5209
Table 10. Experimental results on the dataset “Letters” and dataset “Pendigits”.
Table 10. Experimental results on the dataset “Letters” and dataset “Pendigits”.
Dataset σ MethodRMSEMAEDiscarded SamplesDiscarded RateTime(s)
Letters0Lasso0.3463 ± 0.00286.5160 ± 0.1389000.5545
Elastic Net0.3478 ± 0.00356.3841 ± 0.0890000.7206
Ridge Regression0.3503 ± 0.00246.3821 ± 0.1006000.6559
Canal-Adaptive Elastic Net0.3507 ± 0.00306.3834 ± 0.08022558 ± 21117.05%0.7862
0.1Lasso0.4708 ± 0.055612.1065 ± 2.8283000.5841
Elastic Net0.3905 ± 0.04358.2491 ± 1.7055000.7142
Ridge Regression0.3219 ± 0.00695.6313 ± 0.2547000.6690
Canal-Adaptive Elastic Net0.3162 ± 0.00235.4247 ± 0.09733385 ± 16222.57%0.7994
0.2Lasso0.5841 ± 0.066519.5762 ± 4.6320000.5656
Elastic Net0.4731 ± 0.061412.7691 ± 3.1076000.7180
Ridge Regression0.3345 ± 0.00976.0297 ± 0.2622000.6621
Canal-Adaptive Elastic Net0.3296 ± 0.00295.8493 ± 0.08844245 ± 14328.30%0.8144
0.3Lasso0.7574 ± 0.066633.8068 ± 6.3160000.5730
Elastic Net0.5822 ± 0.078020.2497 ± 5.1140000.7283
Ridge Regression0.3658 ± 0.01517.1164 ± 0.5715000.6688
Canal-Adaptive Elastic Net0.3453 ± 0.00526.3131 ± 0.18684950 ± 11633.00%0.7812
Pendigits0Lasso0.1806 ± 0.00142.0619 ± 0.0623001.0078
Elastic Net0.1823 ± 0.00171.9752 ± 0.0340001.3111
Ridge Regression0.1839 ± 0.00231.9378 ± 0.0369001.2064
Canal-Adaptive Elastic Net0.1843 ± 0.00141.9436 ± 0.02394271 ± 44519.97%1.4879
0.1Lasso0.2569 ± 0.03564.3298 ± 0.9613000.976
Elastic Net0.2041 ± 0.02192.6950 ± 0.5324001.2822
Ridge Regression0.1890 ± 0.00212.0031 ± 0.0424001.2495
Canal-Adaptive Elastic Net0.1809 ± 0.00181.8885 ± 0.04315143 ± 35624.05%1.4029
0.2Lasso0.3089 ± 0.04256.2770 ± 1.3394000.9855
Elastic Net0.2477 ± 0.03663.9761 ± 1.0450001.2542
Ridge Regression0.2085 ± 0.00182.8034 ± 0.0480001.2359
Canal-Adaptive Elastic Net0.1813 ± 0.00141.9111 ± 0.03276161 ± 56228.81%1.3915
0.3Lasso0.3736 ± 0.03589.5432 ± 2.4227000.9999
Elastic Net0.2625 ± 0.02444.4683 ± 0.7597001.2752
Ridge Regression0.2320 ± 0.00243.5514 ± 0.0530001.2178
Canal-Adaptive Elastic Net0.1825 ± 0.00231.9490 ± 0.03737803 ± 42936.48%1.4096
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Wang, W.; Liang, J.; Liu, R.; Song, Y.; Zhang, M. A Robust Variable Selection Method for Sparse Online Regression via the Elastic Net Penalty. Mathematics 2022, 10, 2985. https://doi.org/10.3390/math10162985

AMA Style

Wang W, Liang J, Liu R, Song Y, Zhang M. A Robust Variable Selection Method for Sparse Online Regression via the Elastic Net Penalty. Mathematics. 2022; 10(16):2985. https://doi.org/10.3390/math10162985

Chicago/Turabian Style

Wang, Wentao, Jiaxuan Liang, Rong Liu, Yunquan Song, and Min Zhang. 2022. "A Robust Variable Selection Method for Sparse Online Regression via the Elastic Net Penalty" Mathematics 10, no. 16: 2985. https://doi.org/10.3390/math10162985

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop