Exponential Squared Loss-Based Robust Variable Selection with Prior Information in Linear Regression Models

Wei, Hejun; Jin, Tian; Song, Yunquan

doi:10.3390/axioms14070516

Open AccessArticle

Exponential Squared Loss-Based Robust Variable Selection with Prior Information in Linear Regression Models

by

Hejun Wei

,

Tian Jin

and

Yunquan Song

^*

College of Science, China University of Petroleum, Qingdao 266580, China

^*

Author to whom correspondence should be addressed.

Axioms 2025, 14(7), 516; https://doi.org/10.3390/axioms14070516

Submission received: 28 May 2025 / Revised: 28 June 2025 / Accepted: 2 July 2025 / Published: 4 July 2025

(This article belongs to the Special Issue Computational Statistics and Its Applications, 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

This paper proposes a robust variable selection method that incorporates prior information through linear constraints. For more than a decade, penalized likelihood frameworks have been the predominant approach for variable selection, where appropriate loss and penalty functions are selected to formulate unconstrained optimization problems. However, in many specific applications, some prior information can be obtained. In this paper, we reformulate variable selection by incorporating prior knowledge as linear constraints. In addition, the loss function adopted in this paper is a robust exponential squared loss function, which ensures that the estimation of model parameter coefficient will not have a great impact when there are a few outliers in the dataset. This paper uses the designed solution algorithm to calculate the estimated values of coefficients and some other parameters, and finally conducts numerical simulations and a real-data experiment. Experimental results demonstrate that our model significantly improves estimation robustness compared to existing methods, even in outlier-contaminated scenarios.

Keywords:

exponential squared loss; linear constraints; robust; variable selection

MSC:

62F12; 62G08; 62G20; 62J07

1. Introduction

The selection of vital explanatory variables is one of the most important parts of statistics. For this reason, significant progress has been made in the penalty regression method, which performs parameter estimation and variable selection simultaneously. The Lasso method [1] and its derivatives, such as fused Lasso [2], adaptive Lasso [3], elastic network [4], etc., can be used to select variables and estimate coefficients of the model simultaneously, which is favored by many researchers. At the same time, due to its simplicity of solution, it also plays an important role in practical applications. The

L_{0}

norm can also be used as a penalty term in the process of variable selection, but because its non-convexity solution is difficult to obtain, various algorithms are proposed and applied, for example, the Smoothed L0 (SL0) algorithm [5]. Tibshirani et al. [6] proposed the method of generalized Lasso. By designing the selection matrix, the generalized Lasso could be transformed into Lasso and adaptive Lasso. Moreover, the proof of its properties and the design of solving algorithm are given. However, both of the methods mentioned above are based on the least squares method. As we know, the least squares method does not have good robustness. In statistics, outliers refer to measured values that deviate from the mean by more than twice the standard deviation. When there are abnormal observations in a sample, the final estimations of coefficients in the model obtained using the least squares method are often greatly deviated. In a regression setting, the robustness of the estimator is closely related to the choice of loss function. Therefore, we consider choosing one robust loss to replace the least squares. Penalty robust regression, combining different loss functions and different penalty functions, has attracted significant attention and seen considerable development, for example, penalty regression based on Huber loss [7], Minimum Absolute Deviation (LAD)-Lasso [8], and penalty quantile regression [9]. Although these existing penalty regressions can yield robust parameter estimators, their robustness (such as breakdown point and impact function) has not been well characterized. To this end, a suitable variable selection process based on exponential squared loss is presented [10] and has proven to satisfy the

\sqrt{n} -

consistent and oracle properties. It also has better robustness and numerical performance than the penalty regression method mentioned above. Therefore, in this paper, the least squares component of the generalized Lasso is replaced by the exponential squared loss when constructing the objective model based on the method of a penalized likelihood framework [11], and the objective function of the model used in the later part is obtained.

In practical applications, there are many cases where we can obtain some information about the real model in advance, which is called prior information. The traditional way of using prior knowledge is via the Bayes method. We can also take advantage of prior information by setting constraints on parameters. For example, in the studies by Fan, Zhang, and Yu [12], the sum of all portfolio ratios is 1, which is a linear equation constraint. In this case, the problem of choosing the portfolio with the maximum return is transformed into a linear regression problem with a lasso penalty term and linear constraint. The shape of the non-parametric regression has some conditional restrictions [13], which requires adding linear inequality constraints to the coefficients to solve the regression problem. To solve the problems in the above application scenarios, corresponding models are respectively established. In addition, variable selection methods based on least squares [14] and robust quantile loss [15] under linear constraints are respectively established. Their studies also provide solution algorithms for the corresponding models and numerical simulation studies. However, these authors neither conducted simulations under outliers nor specifically described the robustness of their models. Moreover, variable selection methods based on quantile loss may experience reduced efficiency when normally distributed data are involved [16]. In this paper, for the convenience of studying the problem, we integrate the prior information into the variable selection model in the form of linear constraints, so as to build an optimization model under linear constraints. It is advisable to incorporate prior information into the objective model through linear constraints.

Based on the discussion above, we use the exponential squared loss function and generalized Lasso penalty to construct the objective model, and then incorporate it into the linear constraint conditions of prior information to build the following equation:

min_{β} \sum_{i = 1}^{n} {1 - exp {- \frac{{(Y_{i} - X_{i}^{T} β)}^{2}}{γ}}} + λ {∥ D β ∥}_{1},

(1)

s . t . C β \geq d,

(2)

E β = f,

(3)

where

Y \in R^{n}

is the response variable,

X \in R^{n \times p}

is the design matrix, and

β \in R^{p}

is the parameter vector. Penalty matrix

D \in R^{m \times p}

is the matrix which is be chosen by real application. The choice of the punishment matrix D can vary, such as different forms of Lasso punishment, including Lasso, adaptive lasso, or fused lasso.

C \in R^{q \times p}

,

d \in R^{q}

,

E \in R^{s \times p}

, and

f \in R^{s}

can be determined based on prior information.

λ

and

γ

are tuning parameters.

{∥ \cdot ∥}_{1}

means

ℓ_{1}

-norm of one vector.

In fact, the above model is an optimization problem of non-convex function under linear constraints, and it is often difficult to solve such a problem directly. Therefore, we aim to transform the model into a more tractable form before attempting to solve it. X. Wang et al. [10] showed that the exponential squared loss can be approximated via Taylor expansion, transformed into a quadratic model, and then solved by the Batch Size Gradient Descent (BSGD) method. Based on this, we first transform the model into a quadratic programming problem with linear constraints, then solve its dual model, and finally solve the dual problem using the coordinate descent method. Solving the dual problem is preferred because coordinate descent is significantly more efficient when applied to the dual problem than to the original quadratic programming problem [14]. The main contributions of this paper lie in the following aspects:

(i): In terms of computing, an effective algorithm is designed in this paper. The original non-convex model is transformed, then parameters ( $γ$ and $λ$ ) in the model are selected using established criteria, and finally, coefficients are estimated by the designed algorithm.
(ii): In terms of theory, based on the research work of some scholars, a formula for calculating model degrees of freedom is developed to measure model complexity.
(iii): In terms of robustness, three sets of simulated data with different outliers are designed in this paper. The results obtained by the model on these three sets of data are in line with expectations, which indicates that the model has the ability to resist noise points or outliers.

This paper is organized as follows. In Section 2, we elaborate on the construction method of our model, derive its dual form, and calculate the degrees of freedom. In Section 3, we present the parameter selection method and establish the solution algorithm of the model. In Section 4, we design two sets of experiments: numerical simulation and real-data experimentation. Finally, we summarize and discuss this paper.

2. Methodology

This section covers several topics related to the proposed method. Firstly, we discuss some basic contents of exponential squared loss function. Then, a robust variable selection method with linear constraints is constructed on the basis of this loss function. Next, this problem is transformed into a quadratic programming formulation with linear constraints. Afterwards, the dual form of the transformed quadratic programming problem is derived, along with the Karush–Kuhn–Tucker (KKT) conditions of its dual solution, which is the optimality criteria for constrained nonlinear programming problems. Finally, the degrees of freedom related to model complexity are discussed.

2.1. Exponential Squared Loss Function

In the design of the variable selection method, the choice of loss function often determines the robustness. The exponential squared loss function enhances robustness against outliers by imposing an exponential decay on the gradient magnitude for large residuals. Unlike the squared error loss, whose gradient grows linearly with residuals, the exponential squared loss suppresses the influence of large errors through its multiplicative exponential term. This bounded gradient ensures that outliers contribute minimally to parameter updates, while smaller residuals retain near-quadratic behavior for efficient learning. Exponential squared loss has been shown to perform well as a loss function in AdaBoost for classification problems [17]. Specifically, AdaBoost (Adaptive Boosting), a well-known ensemble learning algorithm for classification that iteratively adjusts sample weights and combines weak classifiers through weighted majority voting, demonstrated enhanced robustness against label noise when employing this loss function. In the regression field, X. Wang et al. [10] used exponential squared loss to design a noise-resistant variable selection method, which also performed well. In energy management systems, the exponential squared loss function can be used to make robust estimates of signal states with errors [18]. In this paper, we also use this loss function to construct a robust model with linear constraints.

The exponential squared loss function has the following form:

ϕ_{γ} (t) = 1 - exp (- t^{2} / γ) .

(4)

In (4),

t = Y - x^{T} β

, which represents one regression residual.

γ

is a tuning parameter, whose numerical value determines the robustness in the estimation of coefficients. Under large

γ

conditions (usually greater than 10), the

ϕ_{γ} (t)

is approximately equal to

t^{2} / γ

, leading the estimators to closely resemble least squares methods in limiting cases. When

γ

is small, observations with large absolute values of t will lead to large losses of

ϕ_{γ} (t)

, which diminishes their impact to

β

parameter estimation. Therefore, selecting a smaller

γ

value limits the impact of an outlier on the estimators; this comes at the cost of decreased estimator responsiveness to data variations. We use the following two images to intuitively explain why the exponential squared loss function is of certain robustness.

Firstly, we consider

γ

as a small fixed value. When there are noise points or abnormal data in the samples, the absolute value of residual t tends to be larger. From Figure 1a,b, we can see that the value of the exponential squared loss function gradually approaches 1 as the absolute value of residual t increases, which would have a small impact on the estimation of

β

. In contrast, as the absolute value of the residual t increases, the least squares loss function will result in a larger loss value, which leads to an inaccurate estimation of

β

.

When the value of

γ

is significantly high, we can consider

- t^{2} / γ \approx 0

. According to the theorem of infinitesimal substitution of equal values, we can determine that

ϕ_{γ} (t) \approx t^{2} / γ

. This approximation implies that the exponential squared is analogous to the least squares loss.

In conclusion, the selection of the parameter

γ

in the exponential squared loss function should be guided by the statistical properties of the data and the desired robustness–efficiency trade-off.

2.2. Penalized Exponential Squared Loss with Linear Constraints

The idea of model construction is as follows: Firstly, we design a general unconstrained variable selection method based on exponential squared loss and generalized Lasso penalty. Then, prior information is incorporated into the variable selection method as constraints. Finally, variable selection is reformulated as a constrained optimization problem. Notably, we mainly consider the case where prior information exists as linear constraints.

The model we built is as follows:

\begin{matrix} min_{β} & \sum_{i = 1}^{n} {1 - exp {- \frac{{(Y_{i} - x_{i}^{T} β)}^{2}}{γ}}} + λ {∥ D β ∥}_{1}, \end{matrix}

(5)

\begin{matrix} s . t . & C β \geq d, \end{matrix}

(6)

\begin{matrix} E β = f, \end{matrix}

(7)

where

Y \in R^{n}

,

X \in R^{n \times p}

,

β \in R^{p}

.

D \in R^{m \times p}

is a matrix that is chosen based on the application.

C \in R^{q \times p}

,

d \in R^{q}

,

E \in R^{s \times p}

, and

f \in R^{s}

can be determined based on prior information.

{∥ \cdot ∥}_{1}

represents the

ℓ_{1}

-norm of a vector. We know that the existence of a constant term has no effect on the solution of an optimization model. Therefore, we remove the constant term in Equation (5), and the model is changed into the following form:

\begin{matrix} max_{β} & ℓ_{n} (β) = \sum_{i = 1}^{n} exp {- \frac{{(Y_{i} - x_{i}^{T} β)}^{2}}{γ}} - λ {∥ D β ∥}_{1}, \end{matrix}

(8)

\begin{matrix} s . t . & C β \geq d \end{matrix}

(9)

\begin{matrix} E β = f . \end{matrix}

(10)

Equation (8) contains the non-convex term

exp - {(Y_{i} - x_{i}^{T} β)}^{2} / γ

. As a result, the computational complexity of finding a numerical solution is high, making the programming process challenging. For this reason, we consider applying some transformation to Equation (8) to turn it into a form that can be easily processed. The basic idea behind this transformation is Taylor expansion. This strategy has also been employed in prior research [7,10,18]. Specifically, we use the Taylor formula to expand the exponential squared function in Equation (8). By discarding terms above the second order, we convert the exponential squared function into a quadratic form.

Let

ℓ^{*} (β) = \sum_{i = 1}^{n} exp {- {(Y_{i} - x_{i}^{T} β)}^{2} / γ} .

(11)

Assuming that

β^{0}

is the initial estimate of the coefficients for the explanatory variables, we can obtain this value using various methods, such as via an MM-estimator (multiple M-estimator, a robust regression method involving two M-estimation steps), a least squares estimator, and so on. It is obtained by solving the objective function with only the loss function (

ℓ^{*} (β)

) and no penalty term, so

\nabla^{1} ℓ^{*} (β^{0}) = 0

is valid. We can use Taylor’s formula to expand (11) at

β^{0}

to get

ℓ^{*} (β) \approx ℓ^{*} (β^{0}) + \frac{1}{2} {(β - β^{0})}^{T} \nabla^{2} ℓ^{*} (β^{0}) (β - β^{0})

(12)

Substituting Equation (12) into (8), the model can be transformed into the following form:

\begin{matrix} max_{β} & ℓ_{n} (β) = ℓ^{*} (β^{0}) + \frac{1}{2} {(β - β^{0})}^{T} \nabla^{2} ℓ^{*} (β^{0}) (β - β^{0}) - λ {∥ D β ∥}_{1}, \end{matrix}

(13)

\begin{matrix} s . t . & C β \geq d \end{matrix}

(14)

\begin{matrix} E β = f, \end{matrix}

(15)

where

\nabla^{2} ℓ^{*} (β^{0})

is the Hessian matrix of

ℓ^{*} (β)

at

β^{0}

. Formulas (13) to (15) are quadratic programming problems with linear constraints, which are easier to solve than the original model. Next, we consider removing the constant term

ℓ^{*} (β^{0})

from Equation (13), which is independent of optimization. Simultaneously, we also consider the processing of the Hessian matrix, so as to further simplify the model.

The Hessian matrix is represented as

Q (β)

. According to the previous statement, the following is true:

Q (β^{0}) = \nabla^{2} ℓ^{*} (β^{0}) .

Q (β)

can be calculated using the following formula:

Q = - H^{T} W (x) [I - d i a g (\frac{{(Y - X^{T} β)}^{2}}{γ / 2})] H

H (x)

is a Jacobian matrix.

W (x)

is a diagonal matrix,

W_{i i} = 2 w_{i} (x) / γ

, where

w_{i} (x) = exp (- {(Y_{i} - x_{i}^{T} β)}^{2} / γ)

. Existing research [18] shows that for the model defined by Equations (13)–(15) to attain its maximum, matrix Q must be negative definite. Let

Φ = W [I - d i a g (2 {(Y - X^{T} β)}^{2} / γ)] .

Therefore,

Q = - H^{T} Φ H

. At the same time,

Φ

is a diagonal matrix. Based on the algebraic theory, if all

Φ_{i i} > 0

, then

Φ

is a positive definite matrix, which makes Q negative. However, when there are outliers or contaminated data, some residuals will be too large and the corresponding diagonal elements will be negative. Therefore, we need to relax the condition of

Φ

as follows:

Φ_{i i} > ϵ, \forall i .

(16)

ϵ

in (16) is a small negative number, such as −0.01, which is selected according to the specific situation of the problem. If (16) is true in the study, we also believe that the second-order condition of the optimization problem is satisfied.

We now rearrange Equation (13): since it serves as the optimization objective function, its constant term does not affect the solution. Therefore, we remove the constant term from (13). Then, we rewrite (13) as follows:

ℓ_{n} (β) = \frac{1}{2} {(β - β^{0})}^{T} \nabla^{2} ℓ^{*} (β^{0}) (β - β^{0}) - λ {∥ D β ∥}_{1} .

(17)

Substituting the above discussion of the Hessian matrix into (17), we obtain

ℓ_{n} (β) = \frac{1}{2} {(β - β^{0})}^{T} Q (β_{0}) (β - β^{0}) - λ {∥ D β ∥}_{1} .

Finally, we get

\begin{matrix} s . t . & C β \geq d, \\ E β = f, \end{matrix}

where

S = H^{T} W (x) [I - d i a g (\frac{{(Y - X^{T} β_{0})}^{2}}{γ / 2})] H .

(18)

r = S β^{0} .

(19)

The above is equivalent to

\begin{matrix} min_{β} & ℓ_{n} (β) = \frac{1}{2} β^{T} S β - r^{T} β + λ {∥ D β ∥}_{1}, \end{matrix}

(20)

\begin{matrix} s . t . & C β \geq d, \end{matrix}

(21)

\begin{matrix} E β = f . \end{matrix}

(22)

In this way, (20)–(22) is the model we will solve later in the paper, which is much easier to handle than the previous model (13)–(15).

2.3. The Dual Problem and KKT Conditions

We adopted a methodology similar to Hu et al. [14] for processing model (20)–(22). However, there is a difference between (20) and their model, which lies in the values of the variables S and r. In their study, S and r are

S = X^{T} X

r = X^{T} y,

which are obviously different from (18) and (19).

The aim of our research is to analyze the model (20)–(22). First, we derive the primal problem’s dual form. Then, we analyze the dual solution’s properties through its KKT conditions. Next, we present a proposition that establishes the relationship between the dual problem and the primal model. Finally, Section 3 develops a coordinate descent algorithm for the dual model.

We introduce a new variable

z = D β

into (20). The model can be transformed into the following form:

\begin{matrix} min_{β} & ℓ_{n} (β) = \frac{1}{2} β^{T} S β - r^{T} β + λ {∥ z ∥}_{1}, \end{matrix}

(23)

\begin{matrix} s . t . & D β = z, \end{matrix}

(24)

\begin{matrix} C β \geq d, \end{matrix}

(25)

\begin{matrix} E β = f . \end{matrix}

(26)

The Lagrangian equation of the above model is as follows:

L (β, z, u, ξ, η) = \frac{1}{2} β^{T} S β - r^{T} {∥ z ∥}_{1} + u^{T} (D β - z) - ξ^{T} (C β - d) - η^{T} (E β - f) .

where

u = {(u_{1}, u_{2}, \dots, u_{m})}^{T} \in R^{m}

,

ξ = {(ξ_{1}, ξ_{2}, \dots, ξ_{q})}^{T} \in R^{q}

, and

η = {(η_{1}, η_{2}, \dots, η_{s})}^{T} \in R^{s}

are Lagrange multipliers. Each of the

ξ_{i}

is non-negative for

i = 1, \dots, q

. Therefore, the resulting Lagrangian dual function is as follows:

Q^{*} (u, ξ, η) = min_{β, z} L (β, z, u, ξ, η) .

Boyd and Vandenberghe [19] have discussed the duality problem under convex optimization in detail. Since matrix S is positive definite, Equation (23) is convex. Moreover, all constraints are linear, so the model including (23)–(26) is a convex optimization problem. We know that the primal model minimization is equivalent to the dual problem maximization.

We can divide L into the sum of two functions. One is related to

β

and another is only concerned with z:

L (β, z, u, ξ, η) = L_{1} (β, u, ξ, η) + L_{2} (z, u) .

L_{1} (β, u, ξ, η) = \frac{1}{2} β^{T} S β - r^{T} β + u^{T} D β - ξ^{T} (C β - d) - η^{T} (E β - f) .

L_{2} (z, u) = λ {∥ z ∥}_{1} - u^{T} z .

Minimizing L is equivalent to minimizing both

L_{1}

and

L_{2}

simultaneously. When

L_{1}

takes the minimum value, the corresponding value of parameter

β

is

β = S^{- 1} (r - D^{T} u + C^{T} ξ + E^{T} η) .

The result of minimizing

L_{2}

is

min_{z} L_{2} (z, u) = \{\begin{matrix} 0 & {if ∥ u ∥}_{\infty} \leq λ \\ - \infty & otherwise \end{matrix}

where

{∥ u ∥}_{\infty} = {max}_{k} | u_{k} |

is the ∞-norm of u. The dual model of the original problem obtained is

\begin{matrix} min_{u, ξ, η} & P (u, ξ, η), \\ s . t . & {∥ u ∥}_{\infty} \leq λ, \\ ξ_{k} \geq 0 k = 1, \dots, q, \end{matrix}

where

P (u, ξ, η) = \frac{1}{2} {(r - D^{T} u + C^{T} ξ + E^{T} η)}^{T} S^{- 1} (r - D^{T} u + C^{T} ξ + E^{T} η) - d^{T} ξ - f^{T} η .

We transform the original problem into its dual form because coordinate descent is more efficient on the dual than on the primal problem. The KKT conditions for the dual model are given below:

- D S^{- 1} (r - D^{T} u + C^{T} ξ + E^{T} η) + v^{+} - v^{-} = 0,

C S^{- 1} (r - D^{T} u + C^{T} ξ + E^{T} η) - τ = d,

E S^{- 1} (r - D^{T} u + C^{T} ξ + E^{T} η) = f,

v_{k}^{+} (u_{k} - λ) = 0,

v_{k}^{-} (u_{k} + λ) = 0,

τ_{k} ξ_{k} = 0,

where

v_{k}^{+} \geq 0

,

v_{k}^{-} \geq 0

and

τ_{k} \geq 0

. When

{\hat{u}}_{k} = \pm λ

or

{\hat{ξ}}_{k} = 0

is known, the KKT conditions above can be simplified. Firstly, we need to define two boundary sets for the dual problem:

U = {i : | {\hat{u}}_{i} | = λ}

and

C = {k : {\hat{ξ}}_{k} = 0}

. Both sets are related to

λ

. By substituting Equation (7) into the KKT conditions above, we obtain

D_{k} β = s_{k} a_{k} k \in U,

D_{k} β = 0 k \notin U,

C_{k} β = d_{k} + τ_{k} k \in C,

C_{k} β = d_{k} k \notin C,

E_{k} β = f_{k} \forall k .

In fact, the primal model also has two similar boundary sets, which have been discussed in some studies [14], so this paper will not go into detail.

Here, we provide a proposition that describes the uniqueness of the solution.

Proposition 1.

If S is positive definite, then

\hat{β}

and

\hat{μ}

are unique, where

\hat{μ} = X \hat{β}

.

The proof of Proposition 1 is as follows.

Considering the model including (20)–(22), the objective function (20)’s Hessian matrix is

\nabla^{2} ℓ_{n} (x) = S .

Since S is positive definite, we can determine that (20) is a strictly convex function.

Assuming that both

β^{1}

and

β^{2}

are the minimizers of the model including (20) and (21), where

β^{1} \neq β^{2}

, then we have

Q_{m i n} = Q (β^{1}) = Q (β^{2})

At the same time, we know

C β^{(i)} > d, E β^{(i)} = f, i = 1, 2 .

Let

β^{'} = (1 - α) β^{1} + α β^{2}, 0 < α < 1

; it is easy to know that

β^{'}

satisfies all of the linear constraints. Therefore, we have

Q (β^{'}) = Q ((1 - α) β^{1} + α β^{2}) \leq (1 - α) Q (β^{1}) + α Q (β^{2}) = Q_{m i n},

where the equality holds only when

β^{1} = β^{2}

. This is inconsistent with the requirements of

β^{1} \neq β^{2}

. Therefore,

\hat{β}

is unique. In the meantime, since

\hat{μ} = S \hat{β}

, we can determine that

\hat{μ}

is also unique.

2.4. Degrees of Freedom

The degrees of freedom actually introduce a measure of the complexity of a model. We know that the same problem may admit multiple modeling approaches. It is not possible to directly compare the results of each model, because the number of parameters in each model is not the same. However, the results of different models can be compared at the same degrees of freedom, which can be used for model selection and assessment.

When studying the theory of unbiased risk estimation, Stein [20] proposed a method to calculate the degrees of freedom. Later, on the basis of Stein’s work, Efron [21] studied a new estimation method, which can be reduced to Stein’s estimate formula in some cases. We assume that

Y = {(Y_{1}, \dots, Y_{n})}^{T} \sim N (μ, σ^{2} I)

,

Y_{i}, i = 1, \dots, n

are independent of each other. Here, the superscript “T” denotes the transpose; the notation

N (μ, σ^{2} I)

denotes that Y has a normal distribution with mean

μ

and variance

σ^{2} I

, where I is the identity matrix. Let

\hat{Y} = g (X, Y)

, where g is a statistical model. The degree of freedom of the model can be calculated by the following formula:

d f (g) = \sum_{i = 1}^{n} C o v ({\hat{Y}}_{i}, Y_{i}) / σ^{2} .

The above formula can be interpreted as indicating that predictive performance improves with increasing model complexity. As a result, stronger correlation emerges between the observed samples and the predicted data. Ultimately, it leads to higher degrees of freedom. When Y is normally distributed, there is

C o v ({\hat{Y}}_{i}, Y_{i}) = σ^{2} E (\partial {\hat{Y}}_{i} / \partial Y_{i})

. If

\hat{Y}

is a continuously differentiable function of Y, then we have

d f (g) = \sum_{i = 1}^{n} E (\frac{\partial {\hat{Y}}_{i}}{\partial Y_{i}}),

which is Stein’s estimate. In order to apply the above theory, we need to prove that

\hat{Y}

is a continuous and almost differentiable function of Y. In fact, according to Hu et al. [14], we can prove that given

λ

,

β

is continuous and

\hat{Y}

satisfies a uniformly Lipschitz condition. We give three lemmas in the following text. Based on these and Hu et al. [14], we derive the formula for calculating degrees of freedom for the dual problem.

Lemma 1.

For any fixed λ, there is a set

N_{λ}

, whose measure is zero. For

\forall Y \notin N_{λ}

, there is always a neighbourhood where

U

and

C

are constant.

Lemma 2.

For any fixed λ,

\hat{β}

is a continuous function about Y.

Lemma 3.

For any fixed

λ \geq 0

and

Y \notin N_{λ}

,

\hat{Y} = X \hat{β}

is uniformly Lipschitz.

Then, we can obtain the following theorem.

Theorem 1.

Assuming that Y is a random vector consisting of independent random variables, each has a normal distribution. For any matrices D, C, E and any scalar

λ \geq 0

, we can obtain the formula for the degrees of freedom. Here,

D_{- U}

denotes the matrix D with the rows or columns associated with u removed, and

C_{- C}

is defined analogously.

d f (\hat{Y}) = E [n u l l i t y (\begin{matrix} D_{- U} \\ - C_{- C} \\ - E \end{matrix})] .

3. Choice of Tuning Parameter and Algorithm Designing

In this section, the coordinate descent method is used to solve the dual model. However, there are other parameters in the model besides

β

, such as

γ

and

λ

. Therefore, we also need to introduce corresponding methods for the selection of

γ

and

λ

.

3.1. Selection of $γ$

The selection of

γ

determines the robustness of coefficient estimations, or in other words, the sensitivity to outliers. Following X. Wang et al. [10], we obtain a

γ

selection method according to sample data, which can bypass prior identification and processing of outliers. In the meantime, the value of

γ

obtained by this method is of high efficiency and robustness. This approach can be broken down into the following two steps:

1. Determining the set of outliers in the sample

Let

D_{n} = {(x_{1}, Y_{1}), (x_{2}, Y_{2}), \dots, (x_{n}, Y_{n})}

. First, we calculate

r_{i} (\hat{β}) = Y_{i} - x_{i}^{T} \hat{β}

. Next, we obtain

S_{n} = 1.4826 \times | r_{i} (\hat{β}) - m e d i a n_{j} (r_{j} (\hat{β})) |

. Finally, we can determine the set of outliers as follows:

D_{m} = {(x_{i}, y_{i}) : | r_{i} \hat{β} | \geq 2.5 S_{n}}

. Let

m = ♯ {1 \leq i \leq n : | r_{i} (\hat{β}) | \geq 2.5 S_{n}}

,

D_{n - m} = D_{n} / D_{m}

.

2. Updating $γ$

γ

is essentially minimizing the

det (\hat{V} (γ))

on the set

G = {γ : ς (γ) \in (0, 1]}

, where

ς (γ) = \frac{2 m}{n} + \frac{2}{n} \sum_{i = m + 1}^{n} ϕ_{γ} {r_{i} (\hat{β})} .

det() is the determinant and

\hat{V (γ)} = {{\hat{I}}_{1} (\hat{β})}^{- 1} Σ_{2} {{\hat{I}}_{1} (\hat{β})}^{- 1}

, where

{\hat{I}}_{1} ({\hat{β}}_{n}) = \frac{2}{γ} {\frac{1}{2} \sum_{i = 1}^{n} exp (- r_{i}^{2} (\hat{β} / γ)) (\frac{2 r_{i}^{2} (\hat{β})}{γ} - 1)} \times (\frac{1}{n} \sum_{i = 1}^{n} x_{i} x_{i}^{T})

Σ_{2} = C o v {exp (- r_{1}^{2} (\hat{β}) / γ) \frac{2 r_{1} (\hat{β})}{γ} x_{1}, \dots, exp (- r_{n}^{2} (\hat{β}) / γ) \frac{2 r_{n} (\hat{β})}{γ} x_{n}} .

In this way, we can obtain an estimate of

γ

.

3.2. Updating $β$

Hu et al. [14] proposed the coordinate descent method for the dual model of a quadratic programming problem under linear constraints. Taking their work into consideration, we first define a few new matrices as follows:

P_{E} = E^{T} {(E S^{- 1} E^{T})}^{- 1} E S^{- 1} .

\tilde{r} = (I - P_{E}) r + E^{T} {(E S^{- 1} E^{T})}^{- 1} f .

\tilde{D} = D (I - P_{E}^{T}) .

\tilde{C} = C (I - P_{E}^{T}) .

\tilde{d} = d - C S^{- 1} E^{T} {(E S E^{T})}^{- 1} f .

\tilde{g} = D S^{- 1} E^{T} {(E S^{- 1} E^{T})}^{- 1} f .

Before we give the coordinate descent algorithm for the dual problem

\begin{matrix} min_{u, ξ, η} & P (u, ξ, η), \end{matrix}

(27)

\begin{matrix} s . t . & {∥ u ∥}_{\infty} \leq λ, \end{matrix}

(28)

\begin{matrix} ξ_{k} \geq 0 k = 1, \dots, q, \end{matrix}

(29)

the calculation formula of the primal solution can be given:

β = S^{- 1} (\tilde{r} - {\tilde{D}}^{T} u + {\tilde{C}}^{T} ξ) .

The four steps of the coordinate descent method are given below:

1. The initial value of

β

can be given by a robust estimate such as the MM estimator. At the same time, we set the initial values of u and

ξ

as the zero vectors.

2. Update every component of u. We define the current kth component to be

{\tilde{u}}_{k}^{(0)}

and the latest estimate of

β

as

{\hat{β}}^{(0)}

. First, calculate

{\hat{u}}_{k}^{*} = {\hat{u}}_{k}^{(0)} + ({\tilde{g}}_{k} + {\tilde{D}}_{k}^{T} \hat{β}) / ({\tilde{D}}_{k}^{T} S^{- 1} {\tilde{D}}_{k})

. Then, update

{\hat{u}}_{k} = s i g n ({\hat{u}}_{k}^{*}) [λ - (λ - | {\hat{u}}_{k}^{*} {|)}_{+}]

. Finally, we can obtain

\hat{β} = {\hat{β}}^{0} - S^{- 1} {\tilde{D}}_{k} ({\hat{u}}_{k} - {\hat{u}}_{k}^{(0)})

.

3. Update every component of

ξ

. We define the current kth component to be

{\tilde{ξ}}_{k}^{(0)}

and the latest estimate of

β

as

{\hat{β}}^{(0)}

. First, calculate

{\hat{ξ}}_{k}^{*} = {\hat{ξ}}_{k}^{(0)} + ({\tilde{d}}_{k} - {\tilde{C}}_{k}^{T} β) / ({\tilde{C}}_{k}^{T} S^{- 1} {\tilde{C}}_{k})

. Then, update

\hat{ξ} = {({\hat{ξ}}_{k}^{*})}_{+}

. Finally, we can obtain

\hat{β} = {\hat{β}}^{(0)} + S^{- 1} {\tilde{C}}_{k} ({\hat{ξ}}_{k} - {\hat{ξ}}_{k}^{(0)})

.

4. Calculate the value of P using Equation (27) to check for convergence. If the value of P in (27) obtained over two consecutive times satisfies the convergence condition, we stop the iteration and output the estimation of

β

. Otherwise, repeat 2–3 until convergence.

3.3. Choice of $λ$

First, we determine the interval for selecting

λ

. Then, we select an appropriate length and obtain a series of

λ

values from the interval, which is called as set R. For any

λ_{i} \in R

, we calculate the degrees of freedom and the

B I C

value using the following formula:

B I C (\hat{Y}) = \frac{∥ Y - \hat{Y} ∥_{2}^{2}}{n σ^{2}} + \frac{log (n)}{n} d f (\hat{Y}) .

We expect to obtain a U-shaped curve of

B I C

values. The

λ

at the curve’s lowest point represents the optimal value. The reason for choosing the

B I C

criterion is that it requires less computation than the optimal

λ

value obtained by cross-validation.

In the next section, we will discuss numerical simulations and real-data experiments using the algorithm given in this section.

4. Simulation and Result Analysis

Up to now, we have presented our model’s construction process, the parameter calculation methods, and solution algorithm. In this section, we will first implement the algorithm described in the previous sections using Matlab [22]. Next, key model parameters are estimated using simulated and real data. Finally, model evaluation is performed based on these parameters.

4.1. Numerical Simulation Experiment

In this part, we consider two basic cases of data with and without outliers. We will choose a model for comparison, which is also a variable selection method with penalty function under linear constraints based on the least squares loss [14].

4.1.1. Sample Without Outliers

First, let us look at the process of generating simulation data without outliers. The basic model is as follows:

Y = X β + ϵ .

where

β = {(1, 0.5, - 1, 0, \dots, 0, 1, 0.5, - 1, 0, \dots, 0)}^{'}

and

ϵ_{i} \overset{i . i . d}{\sim} N (0, 2 . 0^{2})

. The dimension of

β

is 100. Simultaneously, there are six non-zero elements in

β

located at positions 1, 2, 3, 11, 12, and 13. In addition,

X_{i} \sim N (0, Ω)

, where

0 \in R^{100}

and

Ω_{i j} = 0 . 5^{| i - j |}

.

In this paper, we consider the problem of parameter estimation and variable selection with linear constraints; the loss function is the exponential squared loss and the penalty term is generalized Lasso. In the simulation experiment, we chose one case in which the penalty term satisfies the fused lasso. The penalty function is shown below:

λ \sum_{k = 1}^{100} | β_{k} | + λ \sum_{k = 2}^{100} | β_{k} - β_{k - 1} | .

In the case that the penalty term satisfies the fused lasso, the matrix D in Equation (5) is

{(\begin{matrix} - 1 & 1 & 0 \dots & 0 & 0 \\ 0 & - 1 & 1 \dots & 0 & 0 \\ \dots \\ 0 & 0 & 0 \dots & - 1 & 1 \end{matrix})}_{99 \times 100}

The linear constraints are

\begin{matrix} β_{1} + β_{2} + β_{3} & \geq 0, \\ β_{2} + β_{5} + β_{11} & \geq 1, \\ β_{1} + β_{3} + β_{11} + β_{13} & = 0, \\ β_{2} + β_{8} + β_{12} & = 1 . \end{matrix}

Here, the parameter settings and constraints employed in the simulations are based on the work of Wu et al. [18]. To facilitate the following discussion, we denote the variable selection method based on generalized Lasso and exponential squared loss under linear constraints as ExpGL. Similarly, the variable selection method based on generalized Lasso and least squares loss under linear constraints [14] is denoted as LSGL. We designed four numerical experiments with sample sizes of 1000, 4000, 7000, and 10,000, performing 50 repeated trials for each sample size. For each group of experiment, we apply both ExpGL and LSGL methods to calculate

B I C

,

d f

,

M S E

,

M A E

,

P S R

(positive selection rate), and

F D R

(false discovery rate).

B I C

is used to select the optimal

λ

and thus determines the corresponding coefficient estimate

β

.

d f

is used to measure the complexity of the model.

M S E

and

M A E

are used to evaluate the accuracy of ExpGL and LSGL in coefficient estimation.

P S R

and

F D R

are used to evaluate the presence of over-selected or missed valid variables [23].

This paper presents some results from our numerical simulations, which are shown below.

Figure 2 and Figure 3 show distinct U-shaped curves, which is in line with our expectations. If we fit the U-shaped curve to minimize the value, then we will obtain the optimal

λ

. In Figure 4 and Figure 5, the degrees of freedom decrease with the gradual increase in

λ

. As

λ

increases, the sparsity of the model becomes stronger, which then leads to the emergence of a simpler model that is finally reflected as the gradual decrease in

d f

(degrees of freedom).

Table 1 below summarizes the performance of ExpGL and LSGL under different sample sizes using the optimal

λ

obtained from the BIC criterion. Based on the results in Table 1, the following conclusions can be drawn:

1.: From the perspective of $d f$ , the model obtained by LSGL is relatively simple because its degrees of freedom with the increase in sample size is always smaller than that of the ExpGL method.
2.: In terms of $M S E$ and $M A E$ , their estimates of coefficients are not significantly different, which means both methods provide good estimates of the real model.
3.: Based on these $P S R$ and $F D R$ values, we can determine that both the ExpGL and LSGL methods successfully identified all the real valid variables. According to these values of $F D R$ , it can be found that the estimates of effective variables obtained by the two methods differ little.

When no outliers are present, both variable selection methods provide final models that can effectively estimate the true model. Nevertheless, LSGL is preferable in this outlier-free scenario since it generates a simpler model with faster computational speed.

4.1.2. Sample with Outliers

In this paper, the model is expected to have certain robustness in the face of data with certain noise points or outliers. Therefore, this part is mainly to verify the noise resistance ability of our model. Outliers can exist only in the explanatory variable X, in the response variable Y, or in both.

The original data follow the data generation rules in Section 4.1.1. According to X. Wang et al. [10], the data without outliers can be transformed into data with noise points in the following three ways:

Type-1. Outliers in the predictors

X_{i}

follows the d-dimensional mixed normal distribution, which is shown below:

X_{i} \sim 0.8 N (0, \tilde{Ω}) + 0.2 N (μ, Ω) .

where

\tilde{Ω} = I_{d \times d}

and

μ = 3 1_{d}

. The parameter d is set to 100 in this study, which is the same as that in Section 4.1.1—this information will not be repeated later. In addition,

ϵ_{i} \overset{i . i . d}{\sim} N (0, 2 . 0^{2})

.

Type-2. Outliers in the response

X_{i}

follows the d-dimensional normal distribution, which is shown below:

X_{i} \sim N (0, Ω) .

ϵ_{i}

follows the mixed normal distribution, which is shown below:

ϵ \sim 0.8 N (0, 1) + 0.2 N (10, 6^{2}) .

Type-3. Outliers in both the predictors and response

X_{i}

follows the d-dimensional mixed normal distribution, which is shown below:

X_{i} \sim 0.8 N (0, \tilde{Ω}) + 0.2 N (μ, Ω) .

At the same time,

ϵ_{i}

follows a Cauchy distribution.

For each type, we designed four samples with different sizes (1500, 2000, 2500, and 3000). Below are some results for the three different types of outliers.

Based on the analysis of Figure 6, Figure 7, Figure 8, Figure 9, Figure 10, Figure 11, Figure 12, Figure 13, Figure 14 and Figure 15 and Table 2, Table 3 and Table 4, we can obtain the following results:

In each case, the $B I C$ curve is U-shape, so we can indeed select the optimal $λ$ value according to the $B I C$ curve. Compared with the cross-validation method, this approach of parameter selection is easier to operate and can also lead to the desired result.
Across all cases, the $M S E$ and $M A E$ values of the model obtained by the ExpGL method are lower than those calculated by LSGL. Particularly, considering the third outlier type (Table 4), LSGL fails to provide the correct model. However, our proposed method can produce a better estimate of the real model.
According to the $P S R$ and $F D R$ values in different cases, we can conclude that our model can select all real valid variables, and the estimated number of valid variables is much smaller than that obtained by LSGL. Furthermore, increasing $λ$ progressively reduces model complexity, as reflected by decreasing degrees of freedom.

To sum up, regardless of the type of outliers, ExpGL performs better than LSGL. Even with outliers present, ExpGL still maintains satisfactory variable selection and estimation capabilities. Especially in the third type of outliers, LSGL fails to provide the correct model, while ExpGL can still obtain a prediction model close to the real model.

4.2. Real-Data Experiments

In this section, we first choose the Boston Housing Price Dataset as a sample. Then, we apply our model to this dataset. Finally, we analyze the results. This dataset was originally proposed by Harrison Jr and Rubinfeld [24] and has been widely used as a benchmark dataset in numerous studies.

Table 5 shows the names and meanings of variables in the dataset. This dataset comprises 506 samples with 13 explanatory variables and 1 response variable. Key statistics include the following: CRIM with mean 3.61, standard deviation (SD) 8.60, and range 0.006 88.98; RM with mean 6.28, SD 0.70, and range 3.56–8.78; and LSTAT with mean 12.65, SD 7.14, range 1.73–37.97. The target variable MEDV has a mean of 22.53, SD 9.20, ranging from 5.0 to 50.0. In addition, notable features include PTRATIO with mean 18.46, AGE with mean 68.57, and so on. The data exhibit heteroscedasticity, with strong correlations between variables like RM/LSTAT and MEDV. Note that the values described here for the data features are rounded to two decimal places.

First, the dataset is standardized with a mean value of 0 and variance of 1. Next, we build the following model:

\begin{matrix} min_{β} & \sum_{i = 1}^{n} {1 - exp {- \frac{{(Y_{i} - x_{i}^{T} β)}^{2}}{γ}}} + λ {∥ β ∥}_{1}, \\ s . t . & β_{1} < 0.001, β_{2} < 0.001, β_{8} < 0.001, β_{12} < 0.001, \\ β_{3} + β_{5} + β_{6} + β_{12} = 0.6 . \end{matrix}

where the limitations of linear constraints are actually set according to the results obtained by some scholars [10]. Then, the method proposed in this paper is used to obtain the estimation of

β

; the results are presented below. Moreover, we calculate that the values of MSE and MAE are 0.377 and 0.397, respectively, while the value of

R^{2}

is greater than 0.9.

From Table 6, it can be found that the final estimated coefficient value conforms to the linear constraint conditions set earlier. At the same time, it can also be confirmed that the model does have good variable selection ability, as only five significant variables were chosen. Finally, we know that most common regression models—such as linear regression and ridge regression, among others—typically achieve

R^{2}

values between 0.7 and 0.9. Similar to MSE and MAE, the R² value also indicates that the estimated model is a good characterization of the original model.

So far, we have discussed the performance of the model with and without outliers and evaluated it using actual data. In the following section, we summarize the work of this article and present a discussion.

5. Discussion

The main purpose of this paper is to design a robust variable selection method combining prior information. In order to ensure robustness, that is, to resist the influence of certain abnormal or noise points, we chose the exponential squared loss function as the loss under a penalized likelihood framework. Simultaneously, in order to obtain a more realistic model, we considered adding some prior information to the variable selection method through linear constraints. Finally, we constructed an optimization problem with the linear constraints. Since the objective function is non-convex, it is not easy to deal with. Therefore, we employed Taylor expansion to transform the model into a more tractable quadratic programming problem. Then, in order to deal with the transformed model more easily by using the coordinate descent method, we obtained the dual form of the quadratic programming problem first. Finally, the estimation of the final coefficient was obtained by applying the coordinate descent method to the dual problem. Note that some parameter selections were involved in the process, but we will not repeat them here.

After the algorithm was obtained, numerical simulations and real-data experiments were carried out. The experimental results show that the performance of our method without outliers is not significantly different from that of LSGL, but the latter is easier to operate. When outliers are present, our method significantly outperforms LSGL. In the third outlier scenario, LSGL is no longer effective, but our method maintains satisfactory performance. Our method demonstrates effective handling of outliers in the predictors, outliers in the response, as well as scenarios involving both. These results suggest that our method is indeed robust. Furthermore, as established in the work of X. Wang et al. [10], the breakdown point (BP) is fundamentally determined by the breakdown point of the initial estimate and the tuning parameter

γ

. The results of the real-data experiments show that the coefficients estimated by our method meet the linear constraints set previously, and the sparse model can be obtained by compressing variables. In other words, both numerical simulations and real-data experiments show that our method conforms to the expectations established.

So far, there is little research on the oracle properties of variable selection models under linear constraints. For future research, in addition to the model proposed in this paper, we intend to uniformly prove the oracle properties of all variable selection models proposed so far under linear constraints, for example, variable selection models based on least squares loss under linear constraints, variable selection models based on quantile loss under linear constraints, and so on.

Author Contributions

H.W.: Conceptualization, methodology, Software, validation, Visualization, Writing—review and editing. T.J.: Data curation, validation, Visualization. Y.S.: Data curation, Formal analysis, Software, validation, Visualization, Writing—review and editing. All authors have read and agreed to the published version of the manuscript.

Funding

The researches are supported by the National Key Research and Development Program of China (2021YFA1000102), Natural Science Foundation (NSF) project of Shandong Province of China (ZR2024MA074), the Ministry of education of Humanities and Social Science project (24YJA910003), and Fundamental Research Funds for the Central Universities (No.23CX03012A).

Data Availability Statement

The data that support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

All authors declare that they have no conflicts of interest.

References

Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B (Methodol.) 1996, 58, 267–288. [Google Scholar] [CrossRef]
Tibshirani, R.; Saunders, M.; Rosset, S.; Zhu, J.; Knight, K. Sparsity and smoothness via the fused lasso. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 2005, 67, 91–108. [Google Scholar] [CrossRef]
Zou, H. The adaptive lasso and its oracle properties. J. Am. Stat. Assoc. 2006, 101, 1418–1429. [Google Scholar] [CrossRef]
Zou, H.; Hastie, T. Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 2005, 67, 301–320. [Google Scholar] [CrossRef]
Bahrami, K.; Kot, A.C. A fast approach for no-reference image sharpness assessment based on maximum local variation. IEEE Signal Process. Lett. 2014, 21, 751–755. [Google Scholar] [CrossRef]
Tibshirani, R.J. The solution path of the generalized lasso. Ann. Stat. 2011, 39, 1335–1371. [Google Scholar] [CrossRef]
Fan, J.; Li, R. Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 2001, 96, 1348–1360. [Google Scholar] [CrossRef]
Wang, H.; Li, G.; Jiang, G. Robust regression shrinkage and consistent variable selection through the LAD-Lasso. J. Bus. Econ. Stat. 2007, 25, 347–355. [Google Scholar] [CrossRef]
Wu, Y.; Liu, Y. Variable Selection in Quantile Regression. Stat. Sin. 2009, 19, 801–817. [Google Scholar]
Wang, X.; Jiang, Y.; Huang, M.; Zhang, H. Robust variable selection with exponential squared loss. J. Am. Stat. Assoc. 2013, 108, 632–643. [Google Scholar] [CrossRef]
Fan, J.; Lv, J. A selective overview of variable selection in high dimensional feature space. Stat. Sin. 2010, 20, 101. [Google Scholar] [PubMed]
Fan, J.; Zhang, J.; Yu, K. Vast portfolio selection with gross-exposure constraints. J. Am. Stat. Assoc. 2012, 107, 592–606. [Google Scholar] [CrossRef] [PubMed]
Wang, J.; Ghosh, S.K. Shape restricted nonparametric regression with Bernstein polynomials. Comput. Stat. Data Anal. 2012, 56, 2729–2741. [Google Scholar] [CrossRef]
Hu, Q.; Zeng, P.; Lin, L. The dual and degrees of freedom of linearly constrained generalized lasso. Comput. Stat. Data Anal. 2015, 86, 13–26. [Google Scholar] [CrossRef]
Liu, Y.; Zeng, P.; Lin, L. Generalized l1-penalized quantile regression with linear constraints. Comput. Stat. Data Anal. 2020, 142, 106819. [Google Scholar] [CrossRef]
Yang, J.; Tian, G.; Lu, F.; Lu, X. Single-index modal regression via outer product gradients. Comput. Stat. Data Anal. 2020, 144, 106867. [Google Scholar] [CrossRef]
Friedman, J.; Hastie, T.; Tibshirani, R. Additive logistic regression: A statistical view of boosting (with discussion and a rejoinder by the authors). Ann. Stat. 2000, 28, 337–407. [Google Scholar] [CrossRef]
Wu, W.; Guo, Y.; Zhang, B.; Bose, A.; Hongbin, S. Robust state estimation method based on maximum exponential square. IET Gener. Transm. Distrib. 2011, 5, 1165–1172. [Google Scholar] [CrossRef]
Boyd, S.P.; Vandenberghe, L. Convex Optimization; Cambridge University Press: Cambridge, UK, 2004. [Google Scholar]
Stein, C.M. Estimation of the mean of a multivariate normal distribution. Ann. Stat. 1981, 9, 1135–1151. [Google Scholar] [CrossRef]
Efron, B. The estimation of prediction error: Covariance penalties and cross-validation. J. Am. Stat. Assoc. 2004, 99, 619–632. [Google Scholar] [CrossRef]
Haigang Kang, B.D. Matlab for Data Analysis; China Machine Press: Beijing, China, 2021. [Google Scholar]
Chen, J.; Chen, Z. Extended Bayesian information criteria for model selection with large model spaces. Biometrika 2008, 95, 759–771. [Google Scholar] [CrossRef]
Harrison, D., Jr.; Rubinfeld, D.L. Hedonic housing prices and the demand for clean air. J. Environ. Econ. Manag. 1978, 5, 81–102. [Google Scholar] [CrossRef]

Figure 1. Exponential loss function

ϕ_{γ} (t)

, where

γ = 1

. Figure (a) shows the overall image of the Exponential loss function, and Figure (b) shows the image of interval 2.4 to 4 in detail. These two graphs demonstrate the robustness of the exponential squared loss function to outliers.

Figure 1. Exponential loss function

ϕ_{γ} (t)

, where

γ = 1

. Figure (a) shows the overall image of the Exponential loss function, and Figure (b) shows the image of interval 2.4 to 4 in detail. These two graphs demonstrate the robustness of the exponential squared loss function to outliers.

Figure 2.

B I C

for ExpGL (the relation between horizontal axis x and

λ

is

λ = 0.2 \times (x - 1)

). This shows optimal

λ

selection via U-shaped curves. Figures (a–d) show the results for sample sizes of 1000, 4000, 7000 and 10,000, respectively.

Figure 2.

B I C

for ExpGL (the relation between horizontal axis x and

λ

is

λ = 0.2 \times (x - 1)

). This shows optimal

λ

selection via U-shaped curves. Figures (a–d) show the results for sample sizes of 1000, 4000, 7000 and 10,000, respectively.

Figure 3.

B I C

for LSGL (the relation between horizontal axis x and

λ

is

λ = 5 \times (x - 1)

). Figures (a–d) show the results for sample sizes of 1000, 4000, 7000 and 10,000, respectively.

Figure 3.

B I C

for LSGL (the relation between horizontal axis x and

λ

is

λ = 5 \times (x - 1)

). Figures (a–d) show the results for sample sizes of 1000, 4000, 7000 and 10,000, respectively.

Figure 4.

d f

for ExpGL (the relation between horizontal axis x and

λ

is

λ = 0.2 \times (x - 1)

). This shows the regulatory effect of

λ

on model complexity.

Figure 4.

d f

for ExpGL (the relation between horizontal axis x and

λ

is

λ = 0.2 \times (x - 1)

). This shows the regulatory effect of

λ

on model complexity.

Figure 5.

d f

for LSGL (the relation between horizontal axis x and

λ

is

λ = 5 \times (x - 1)

).

Figure 5.

d f

for LSGL (the relation between horizontal axis x and

λ

is

λ = 5 \times (x - 1)

).

Figure 6.

B I C

for ExpGL (Type-1) (the relation between horizontal axis x and

λ

is

λ = 0.5 \times (x - 1)

). Figures (a–d) show the results for sample sizes of 500, 2000, 2500, and 3000, respectively.

Figure 6.

B I C

for ExpGL (Type-1) (the relation between horizontal axis x and

λ

is

λ = 0.5 \times (x - 1)

). Figures (a–d) show the results for sample sizes of 500, 2000, 2500, and 3000, respectively.

Figure 7.

d f

for ExpGL (Type-1) (the relation between horizontal axis x and

λ

is

λ = 0.5 \times (x - 1)

).

Figure 7.

d f

for ExpGL (Type-1) (the relation between horizontal axis x and

λ

is

λ = 0.5 \times (x - 1)

).

Figure 8.

B I C

for LSGL (Type-1) (the relation between horizontal axis x and

λ

is

λ = 5 \times (x - 1)

). Figures (a–d) show the results for sample sizes of 500, 2000, 2500, and 3000, respectively.

Figure 8.

B I C

for LSGL (Type-1) (the relation between horizontal axis x and

λ

is

λ = 5 \times (x - 1)

). Figures (a–d) show the results for sample sizes of 500, 2000, 2500, and 3000, respectively.

Figure 9.

d f

for LSGL (Type-1) (the relation between horizontal axis x and

λ

is

λ = 5 \times (x - 1)

).

Figure 9.

d f

for LSGL (Type-1) (the relation between horizontal axis x and

λ

is

λ = 5 \times (x - 1)

).

Figure 10.

B I C

for ExpGL (Type-2) (the relation between horizontal axis x and

λ

is

λ = 0.1 \times (x - 1)

). Figures (a–d) show the results for sample sizes of 500, 2000, 2500, and 3000, respectively.

Figure 10.

B I C

for ExpGL (Type-2) (the relation between horizontal axis x and

λ

is

λ = 0.1 \times (x - 1)

). Figures (a–d) show the results for sample sizes of 500, 2000, 2500, and 3000, respectively.

Figure 11.

d f

for ExpGL (Type-2) (the relation between horizontal axis x and

λ

is

λ = 0.1 \times (x - 1)

).

Figure 11.

d f

for ExpGL (Type-2) (the relation between horizontal axis x and

λ

is

λ = 0.1 \times (x - 1)

).

Figure 12.

B I C

for LSGL (Type-2) (the relation between horizontal axis x and

λ

is

λ = 2 \times (x - 1)

). Figures (a–d) show the results for sample sizes of 500, 2000, 2500, and 3000, respectively.

Figure 12.

B I C

for LSGL (Type-2) (the relation between horizontal axis x and

λ

is

λ = 2 \times (x - 1)

). Figures (a–d) show the results for sample sizes of 500, 2000, 2500, and 3000, respectively.

Figure 13.

d f

for LSGL (Type-2) (the relation between horizontal axis x and

λ

is

λ = 2 \times (x - 1)

).

Figure 13.

d f

for LSGL (Type-2) (the relation between horizontal axis x and

λ

is

λ = 2 \times (x - 1)

).

Figure 14.

B I C

for ExpGL (Type-3) (the relation between horizontal axis x and

λ

is

λ = 0.5 \times (x - 1)

). Figures (a–d) show the results for sample sizes of 500, 2000, 2500, and 3000, respectively.

Figure 14.

B I C

for ExpGL (Type-3) (the relation between horizontal axis x and

λ

is

λ = 0.5 \times (x - 1)

). Figures (a–d) show the results for sample sizes of 500, 2000, 2500, and 3000, respectively.

Figure 15.

d f

for ExpGL (Type-3) (the relation between horizontal axis x and

λ

is

λ = 0.5 \times (x - 1)

).

Figure 15.

d f

for ExpGL (Type-3) (the relation between horizontal axis x and

λ

is

λ = 0.5 \times (x - 1)

).

Table 1. Performance metrics of ExpGL and LSGL methods across varying sample sizes.

Sample	Method	$λ_{opt}$	$df$	$MSE$	$MAE$	$PSR$	$FDR$
1000	LSGL	57.0895	30	4.0914	1.6212	1.000	0.25
1000	ExpGL	1.0387	28	4.0980	1.6226	1.000	0.40
4000	LSGL	105.9475	31	4.1197	1.6122	1.000	0.25
4000	ExpGL	3.9100	32	4.1185	1.6118	1.000	0.25
7000	LSGL	130.4245	27	3.9601	1.5874	1.000	0.25
7000	ExpGL	4.9712	27	3.9603	1.5874	1.000	0.25
10,000	LSGL	233.1305	32	3.9348	1.5854	1.000	0.25
10,000	ExpGL	7.4606	37	3.9285	1.5836	1.000	0.25

Table 2. ExpGL and LSGL (Type-1).

Sample	Method	$λ_{opt}$	$df$	$MSE$	$MAE$	$PSR$	$FDR$
1500	LSGL	95	20	3.1757	1.3464	1.000	0.54
1500	ExpGL	5	14	1.2618	0.9063	1.000	0.33
2000	LSGL	125	23	3.1876	1.3371	1.000	0.57
2000	ExpGL	6	18	1.2327	0.8918	1.000	0.33
2500	LSGL	145	18	3.0470	1.3071	1.000	0.68
2500	ExpGL	7.5	13	1.1205	0.8431	1.000	0.40
3000	LSGL	220	12	3.2550	1.3674	1.000	0.14
3000	ExpGL	8	12	1.2134	0.8860	1.000	0.14

Table 3. ExpGL and LSGL (Type-2).

Sample	Method	$λ_{opt}$	$df$	$MSE$	$MAE$	$PSR$	$FDR$
1500	LSGL	64	41	0.3573	0.4753	1.000	0.78
1500	ExpGL	1.8	19	0.2592	0.4071	1.000	0.14
2000	LSGL	66	43	0.2889	0.4303	1.000	0.81
2000	ExpGL	4.5	20	0.1490	0.3096	1.000	0.33
2500	LSGL	86	36	0.1923	0.3510	1.000	0.76
2500	ExpGL	2.9	30	0.1012	0.2528	1.000	0.40
3000	LSGL	100	41	0.1486	0.3050	1.000	0.86
3000	ExpGL	3.6	31	0.0754	0.2210	1.000	0.25

Table 4. ExpGL (Type-3) *.

Sample	Method	$λ_{opt}$	$df$	$MSE$	$MAE$	$PSR$	$FDR$
1500	ExpGL	7	10	1.1543	0.8682	1.000	0.33
2000	ExpGL	4.5	21	1.0659	0.8424	1.000	0.33
2500	ExpGL	10	15	1.2022	0.87367	1.000	0.25
3000	ExpGL	8.5	9	1.1545	0.86611	1.000	0.25

* In this scenario, no matter how large the sample size is, the LSGL method always fails to select the appropriate penalty parameters. For example, when the sample size is 1500, the resulting MSE and MAE values are 958 and 24, respectively. For the above reasons, the final results obtained by the LSGL method are not shown here.

Table 5. Variables in dataset.

	Variable	Meaning
Explanatory Variables	CRIM	per capita crime rate by town
	ZN	proportion of residential land zoned for lots over 25,000 sq.ft.
	INDUS	proportion of non-retail business acres per town
	CHAS	Charles River dummy variable (=1 if tract bounds river; 0 otherwise)
	NOX	nitric oxide concentration (parts per 10 million)
	RM	average number of rooms per dwelling
	AGE	proportion of owner-occupied units built prior to 1940
	DIS	weighted distances to five Boston employment centers
	RAD	index of accessibility to radial highways
	TAX	full-value property-tax rate per USD 10,000
	PTRATIO	pupil/teacher ratio by town
	B	1000(Bk − 0.63)⌃2, where Bk is the proportion of blacks by town
	LSTAT	% lower status of the population
Response Variable	MEDV	median value of owner-occupied homes in USD 1000’s

Table 6. Estimated regression coefficients from the dataset.

Explanatory
Variables

CRIM

ZN

INDUS

CHAS

NOX

RM

AGE

DIS

RAD

TAX

PTRATIO

B

LSTAT

Estimation
by ExpGL

0

0.601

−0.026

0

−0.116

−0.1

0

−0.128

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wei, H.; Jin, T.; Song, Y. Exponential Squared Loss-Based Robust Variable Selection with Prior Information in Linear Regression Models. Axioms 2025, 14, 516. https://doi.org/10.3390/axioms14070516

AMA Style

Wei H, Jin T, Song Y. Exponential Squared Loss-Based Robust Variable Selection with Prior Information in Linear Regression Models. Axioms. 2025; 14(7):516. https://doi.org/10.3390/axioms14070516

Chicago/Turabian Style

Wei, Hejun, Tian Jin, and Yunquan Song. 2025. "Exponential Squared Loss-Based Robust Variable Selection with Prior Information in Linear Regression Models" Axioms 14, no. 7: 516. https://doi.org/10.3390/axioms14070516

APA Style

Wei, H., Jin, T., & Song, Y. (2025). Exponential Squared Loss-Based Robust Variable Selection with Prior Information in Linear Regression Models. Axioms, 14(7), 516. https://doi.org/10.3390/axioms14070516

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Exponential Squared Loss-Based Robust Variable Selection with Prior Information in Linear Regression Models

Abstract

1. Introduction

2. Methodology

2.1. Exponential Squared Loss Function

2.2. Penalized Exponential Squared Loss with Linear Constraints

2.3. The Dual Problem and KKT Conditions

2.4. Degrees of Freedom

3. Choice of Tuning Parameter and Algorithm Designing

3.1. Selection of $γ$

3.2. Updating $β$

3.3. Choice of $λ$

4. Simulation and Result Analysis

4.1. Numerical Simulation Experiment

4.1.1. Sample Without Outliers

4.1.2. Sample with Outliers

4.2. Real-Data Experiments

5. Discussion

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Exponential Squared Loss-Based Robust Variable Selection with Prior Information in Linear Regression Models

Abstract

1. Introduction

2. Methodology

2.1. Exponential Squared Loss Function

2.2. Penalized Exponential Squared Loss with Linear Constraints

2.3. The Dual Problem and KKT Conditions

2.4. Degrees of Freedom

3. Choice of Tuning Parameter and Algorithm Designing

3.1. Selection of γ

3.2. Updating β

3.3. Choice of λ

4. Simulation and Result Analysis

4.1. Numerical Simulation Experiment

4.1.1. Sample Without Outliers

4.1.2. Sample with Outliers

4.2. Real-Data Experiments

5. Discussion

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

3.1. Selection of $γ$

3.2. Updating $β$

3.3. Choice of $λ$