1. Introduction
In the clinical field, biological studies, and econometrics, much data are dynamic and change over time. To model and analyze such dynamic relationships between variables in time-series data, the Autoregressive Distributed Lag (ADL, see [
1]) model is a powerful tool and has gained significant traction for its ability to capture both short-term and long-term dependencies and to facilitate the exploration of complex temporal dynamics [
2]. The ADL model incorporates both autoregressive terms (lags of the dependent variable) and distributed lag terms (lags of the independent variables), allowing researchers to examine how past values of the explanatory variables influence the current value of the dependent variable. This makes it particularly useful in contexts where past information plays a crucial role in shaping future outcomes.
For many situations, the time-varying dependent variable of interest not only depends on its previous state, but is also correlated with time-varying predictors. For such situations, the sequential modeling approach of [
3] can be adopted. This sequential linear modeling approach can effectively model the dependent variable using its last state and predictors at sequential time points. Designed to handle sequential information, sequential linear models are well-suited for absorbing the time-series process and predicting the dependent variable through time-varying lagged outcomes and predictors. For example, to cope with sequential predictive modeling, ref. [
3] proposed sequentially predicting outcomes at intermediate time points, which are then used as predictors in the model for the subsequent time point, as interventions involving recovery times or chronic conditions entail outcome measures at various intermediate follow-up points.
An ordinary least squares (OLS) regression can also model dynamic data by incorporating past values of the lagged dependent variable to demonstrate the influence of dynamic variables on the prediction process. However, ref. [
2] pointed out that correlation effects arise because the lagged dependent variable causes the coefficients of the independent variables to be biased toward lower values. To deal with this problem in sequential time-series data, ref. [
2] introduced the lagged dependent variable in OLS regression, where the lagged dependent variable coefficient indicates the timing effect relative to the independent variable. Ref. [
4] combined estimated residuals with the SCAD penalty and Yule–Walker equations to determine the order of autoregressive errors and estimate the autoregressive parameters. Ref. [
5] proposed a new class of hierarchical lag structures (HLag), which embeds lag selection into a convex regularizer and improves forecasting performance.
Motivated by the ADL and the above related work, we employ a sequential linear modeling approach and utilize sequential linear models to describe the dependent variable by incorporating the lagged dependent variable and time-varying independent variables. Like the ADL model, sequential linear modeling faces challenges in estimation due to overfitting and multicollinearity, particularly when the number of potential explanatory variables is large. To address these challenges, recent advances in penalized regression techniques, such as the Lasso [
6] and its variants, have gained popularity for variable selection and parameter estimation in dynamic models such as the ADL. These methods impose penalties on the regression coefficients, allowing for sparse models that are both interpretable and predictive.
We propose using the SCAD-penalized method to estimate and select variables in sequential linear modeling. The SCAD method, introduced by [
7], is known for its ability to handle large-scale variable selection problems while providing less bias in coefficient estimation compared to the Lasso. The SCAD-penalized method has become a preferred choice in high-dimensional regression models due to its superior ability to perform variable selection while preserving model accuracy. We aim to evaluate the performance of sequential linear modeling estimated with the SCAD penalty and to compare it with the Lasso and adaptive Lasso [
8,
9], which have been widely applied for their effectiveness in variable selection and regularization. Our study examines the strengths and weaknesses of these penalized regression approaches by comparing their ability in estimation and variable selection. Specifically, we assess whether the SCAD-penalized method outperforms the Lasso and adaptive Lasso in terms of predictive accuracy and coefficient estimation. By doing so, we aim to contribute to the literature on penalized regression techniques and their applications to dynamic modeling, providing insights into the choice of regularization methods for dynamic and time-series data.
Although SCAD has been widely used since [
7], its application in time-series and dynamic settings remains narrower than in purely cross-sectional regression. Ref. [
4] studied SCAD in varying-coefficient models with autoregressive errors. Ref. [
10] established SCAD consistency in partially linear models. Ref. [
11] proposed data-driven tuning-parameter selection for SCAD via BIC-type criteria. Ref. [
12] analyzed SCAD-type penalization in autoregressive models to distinguish stationary from nonstationary regimes. Ref. [
13] derived oracle inequalities for adaptive penalized methods in high-dimensional correlated panel data. Ref. [
14] used SCAD to perform change-point detection in dynamic panel models. The vector-autoregressive literature has also seen related hierarchical-lag regularization [
5].
Among these works, the dynamic panel models of [
13,
14] are the broader class most closely related to sequential linear models. In both, however, the lag dimension is fixed, and the model is not refit at each time point. For this reason, the sequential linear model of [
3] is not a special case of either framework. Its defining features are that (i) the dimension of the lagged-outcome coefficient vector
grows with the time index
t, and (ii) the model is refit at each time point with an updated history of dependent-variable measurements. These two features place the formulation outside the scope of existing SCAD theory for dynamic models. Extending SCAD theory to this setting is the gap addressed in the present paper. In particular, at each time point
t, we establish the oracle property for the SCAD-penalized estimator in the high-dimensional regime
, where the nominal lag dimension may exceed the cross-sectional sample size.
The incremental contributions of this paper can be summarized as follows:
We extend the SCAD-penalized framework to sequential linear models in which the lagged-outcome coefficient vector has dimension growing with the time index, a setting not covered by existing SCAD theory for dynamic panel or autoregressive models.
We establish the oracle property for the proposed estimator (Theorem 1), including selection consistency and asymptotic normality on the active set, and we derive a corollary on lag-order selection consistency (Corollary 1). Our argument adapts the high-dimensional SCAD theory of [
14,
15] to the sequential-lag setting of model (
2).
We demonstrate, through simulations across low-, medium-, and high-dimensional settings and through two real longitudinal applications, that the proposed estimator outperforms OLS, Lasso, and adaptive Lasso on mean absolute prediction error, relative risk, and empirical coverage and that it recovers the correct effective lag order.
The remainder of the paper is organized as follows.
Section 2 introduces the sequential linear models and the regularity conditions.
Section 3 develops the SCAD-penalized method, states the oracle property theorem with a proof, gives the lag-order selection corollary, describes the coordinate-descent algorithm and tuning-parameter selection, and discusses algorithmic and statistical convergence.
Section 4 presents the simulation study.
Section 5 contains two applications.
Section 6 concludes and discusses.
2. Sequential Linear Models
We have a sample of data with repeated outcomes with and a sequence of time points .
At each time point t, we have a model that sequentially predicts the outcomes based on the predictors with a parameter/coefficient vector , where p is the number of predictors, and a parameter/coefficient vector whose length varies with t, , denoted by .
We can also have subsequent outcomes at any time point t, denoted by . As the time index t increases by one unit at each step, the outcomes can be predicted based on the covariates or predictors and the prior predicted outcomes . The distribution of the predicted outcomes is taken to be normal, denoted as . We repeat the previous models until the terminal time point .
Before proceeding with the sequential modeling, we can visualize the setting through an example. Measurements on patients can be collected at different time points, and these measurements include outcomes such as a measure detecting an illness together with predictors that are all time-varying, such as the many factors that affect this illness. We can then naturally assume that the outcome at the current time point is predicted by the predictors at that time point, together with the state of previous outcomes.
The sequential linear models are as follows.
At the first time point
, the linear model is
where the error term is normally distributed as
.
At the second time point
, the subsequent linear model is
where
, and the outcome vector is
.
At the third time point
, the subsequent linear model is
where
, and
.
For any time point
t, the sequential linear models can be written as
where
and
. This recursion is applied until the terminal time point
.
We remark that sequential linear models involve lagged dependent variables across time points, so the key assumptions need to be verified before fitting the model and predicting. A conceptual description is given below. Precise mathematical statements follow after model (
2) is introduced.
The relationship between the dependent variable and the independent variables is linear.
The error terms at each time point are independent and follow a normal distribution with constant variance.
The independent variables do not exhibit perfect collinearity at each time point.
The independent variables are uncorrelated with the error term.
A fixed effect incorporated for each observation in the sequential linear model is permitted.
After model (
2) is introduced, these key assumptions are phrased in mathematical form for use in the proofs. For the sequential modeling in Equation (
1), we see that this model is a combination of linear predictors and autoregressive time-series regressions. In
Section 3, we propose penalized methods to conduct estimation. Such methods enjoy a sparsity property, meaning that adopting a suitable penalty term can shrink some parameter estimates to zero. Relying on this sparsity property, variable selection can be implemented simultaneously with estimation. Estimating the parameter vectors
and
in model (
1) targets selection of the predictors/features within
and the number of lags for the autoregressive part within
. Regarding the number of lags, for example, if the estimation finds that only
and
are significant (i.e., do not shrink to zero), we conclude that the response variable at the current time point can be predicted by those for the last two points, so the lag order equals 2. As time proceeds, the response variables typically become increasingly uncorrelated with distant past values, and the corresponding lag coefficients shrink to zero. Theoretically, it is interpretable that the response variable at the current time point depends only on recent time points, so the effective lag order cannot be a large integer. Our purpose is therefore to propose a well-functioning penalized method that selects the predictors in
and the lag order in
.
Model (
1) serves as a combined linear regression and autoregressive time-series specification, with the lag order determined by the proposed method. For a succinct form, we arrange all predictors into an
matrix
where
n is the sample size and
p is the number of predictors.
The response/outcome variables up to time
t are represented by
which is an
matrix.
To represent model (
1), the outcome at time point
t, we use the outcome
at time point
as predictors, and combine all predictors into one matrix
For any time point
t in a succinct format, the sequential linear models in Equation (
1) can be rewritten as
where
is the predictor matrix at time
t,
is the response/outcome vector at time
t, and
is the error-term vector at time
t. We assume that
are independent and identically distributed (i.i.d.) with variance
. Again, model (
2) is repeated until
. At the initial time point
, the model does not involve the term
, and at later time points
.
Specifically, in the sequential linear model (
2), we partition the design matrix as
, where
collects the columns corresponding to nonzero regression coefficients and
collects the columns corresponding to zero regression coefficients. We also write
, where
is the column corresponding to the
jth parameter at time
t. We also impose model assumptions for weak (or no) autocorrelation and absence of heteroscedasticity. Let
s be a positive integer labeling the lags. Without loss of generality, and similar to those in [
14,
16], the assumptions are as follows.
- A1.
.
- A2.
and
become asymptotically uncorrelated as
, i.e.,
- A3.
The process is weakly (covariance) stationary. That is, the mean vector is constant in t. The variances are finite and constant in t, i.e., and for all j. The autocovariance depends only on the lag s and not on t.
- A4.
The active-set Gram matrix is almost surely positive definite for all n sufficiently large and each fixed .
- A5.
consists of n i.i.d. random variables whose distribution does not vary with t, with , , and .
We note that A1 assumes that the error term is mean-independent of the predictors and lags. A2 encodes asymptotic uncorrelatedness, weakening linear dependence over time rather than requiring full independence. A2 does not imply full statistical independence unless additional distributional assumptions are imposed. A3 is weak stationarity, which is a mild and commonly used assumption for linear time-series models. Any strictly stationary process with finite second moments automatically satisfies A3, so the strictly stationary construction explored in Remark 1 below suffices for this compatibility. A4 is the identifiability condition for the active-set parameters. It guarantees that the active-set Gram matrix is invertible, so the regression on the active set is well-posed at each
n. In the high-dimensional regime
, the full Gram matrix
is necessarily rank-deficient, so identifiability cannot be imposed on the full parameter vector. When we develop the SCAD-penalized method in
Section 3.2, we impose a further assumption that strengthens A4 to a uniform lower bound on the minimum eigenvalue (A7). A4 and A5 together guarantee that the model is well-posed and its active-set parameters are estimable.
For the sequential models in Equations (
1) and (
2), we aim to conduct variable selection and estimation in order to identify the most appropriate model to describe the time-varying data, to predict the outcomes, and to perform further inferences.
After we estimate
and
by
and
respectively, the prediction of the dependent variable for the next time point is
where
. Equation (
3) is the prediction formula. Each time point’s predicted output is a function of the predictors at the current point and of the lagged response at previous time points, which is taken as the true
when available and as the predicted
otherwise. In stable time-series settings, the entries of the lag coefficient vector
typically decrease in magnitude as the lag index increases [
17], which further motivates the lag-order selection carried out by the penalized method in
Section 3.
3. SCAD-Penalized Method for Estimation and Variable Selection
Having introduced sequential linear models (
1) and (
2) in the previous section, we now develop parameter estimation and variable selection for the sequential modeling. To facilitate prediction with the most appropriate model, we adopt penalized methods to improve estimation and variable selection and to mitigate overfitting. One typical penalized method is the Lasso [
6], which introduces an
penalty term. The Lasso estimator possesses consistency and sparsity, which makes the model more interpretable and efficient in predicting and handling dependencies in sequential modeling. Beginning with the Lasso, this section then introduces the adaptive Lasso and proposes the SCAD-penalized method in sequential linear models for estimation and variable selection.
3.1. Presentation of SCAD-Penalized Method for Estimation and Variable Selection
In what follows, we present the methods used for estimation and variable selection in this study. The basic idea is to penalize Equation (
2) to obtain the estimators. Recall
presents the design matrix containing all predictors
and previous outcomes
, i.e.,
. Let
denote the
ith row vector of
, for
. For the regression in model (
2), let
be the parameter vector combining
and
, denoted by
. The dimension of
is
and the dimension of
is
, so the length of
is
. We denote each parameter element of
by
, for
.
To estimate the parameter vector
, we start with the Lasso method. According to model (
2), the Lasso estimator is obtained by minimizing the penalized squared-error loss:
where
is the
jth parameter element at time point
t,
, and
is the tuning parameter.
The Lasso method was initiated by [
6]. As established in the literature, the Lasso is a regularization method that simultaneously performs parameter estimation and variable selection, and its estimators are consistent and sparse. Specifically, the
penalty
shrinks the coefficients of non-significant variables to zero, giving the Lasso the sparsity property on which variable selection relies.
In sequential linear modeling, observed or generated data are modeled at each time point, creating a dependency between successive iterations. A common challenge in this framework is multicollinearity, arising from significant correlations between the predictors. The Lasso regression method helps handle multicollinearity by automatically selecting relevant predictors and potentially excluding correlated ones. It also addresses temporal dependencies by considering the sequential ordering of observations and selecting lagged features that contribute to predicting future values.
When different weights are assigned to different parameters in
, the procedure is called the adaptive Lasso [
8], which is essentially a weighted
penalization method. Under certain regularity conditions, the estimators of
are
where
is the weight for parameter element
(e.g., the reciprocal of a preliminary estimate), and
is the tuning parameter.
The Lasso and adaptive Lasso in Equations (
4) and (
5) are employed for comparison against the SCAD-penalized method introduced below.
We now apply the penalized method with the SCAD penalty [
7] to estimate the parameters in sequential linear models (
1) and (
2). The estimator is defined as the minimizer of the sum of the squared error and the SCAD penalty:
where
denotes the absolute value, and the penalty function
, following [
7,
18], is explicitly written as
where
is an additional tuning parameter to
that controls the smoothness of the penalty, and
controls the strength of the penalty. Equation (
7) is a quadratic spline with knots at
and
. It is continuous, with the first derivative of
where
. Expression (
8) shows that the SCAD penalty is continuously differentiable on
except at the origin, where it is singular. The derivatives vanish outside the interval
.
3.2. Oracle Property
We now explore the oracle property of the SCAD-penalized estimator in the sequential linear model. For clarity, we write . Let be the true active set at time t, with cardinality . Let be the index set of the significant SCAD estimators, with cardinality . We organize the true parameter vector as , where collects the nonzero coefficients and collects the zero coefficients. Similarly, we organize as , and set .
Following [
14], to establish the oracle property within a high-dimensional and sparse framework, we impose the following supplementary regularity conditions in [
19], in conjunction with A1–A5.
- A6.
There exists a positive constant
such that,
for all
and for all
n, at time
t.
- A7.
There exists a positive constant
such that
for all
with
.
- A8.
for some .
- A9.
There exist positive constants
and
such that
and
A6 bounds the squared column norms of
and is trivial when the predictors and lagged outcomes are normalized. A7 is the restricted eigenvalue condition and ensures that the active-set Gram matrix
is strictly positive definite uniformly in
n, so that the predictors and the lags of
are sufficiently spread out for correct variable selection even when the number of predictors and lags is much larger than
n. A8 controls the divergence rate of the effective dimension. It ensures that the number of significant predictors
K grows at a controlled polynomial rate relative to
n and prevents overfitting. A9 mandates a gap of order
between the smallest nonzero coefficient and zero, ensuring that the signal is not overwhelmed by the stochastic error. See [
14,
15] for further background.
Before turning to the oracle theorem, we verify that A1–A9 are jointly satisfiable.
Remark 1 (Compatibility of Assumptions A1–A9)
. Assumptions A1–A9 are mutually consistent. A concrete construction satisfying all of them is as follows. Let be a doubly-indexed sequence of i.i.d. vectors, independent across i and t. For simplicity, fix a maximum lag and time-invariant lag coefficients , with , satisfyingDefine the nonzero entries of by for , and for . Let , independent across i and t. Then the process admits a strictly stationary causal solution [17], which is therefore also weakly stationary. A1–A9 all hold for any sparsity index with . Note that an analogous construction applies to any nonzero-lag support with (for instance, sparse supports with non-consecutive lags), provided the corresponding polynomial satisfies the same root condition. Although the parameter vector has nominal dimension that grows with t, A8 requires all but a shrinking fraction of its entries to be zero. The resulting finite effective lag order is precisely what makes the weak stationarity in A3 compatible with the growing nominal dimension of . The simulation study in Section 4, which uses with effective lag order , is an explicit instance of this construction. We now state the oracle property as a formal theorem.
Theorem 1 (Oracle property of the SCAD-penalized estimator).
Assume that model (2) holds, assumptions A1–A9 are satisfied, and the tuning parameter λ satisfies , , and as . Fix . Assume further that there exists a positive-definite matrix such that as . Let denote a local minimizer of the SCAD criterion in (6). Then, as , the following conclusions hold.- (i)
- (ii)
Asymptotic normality on the active set.
Let be a fixed integer and let be a sequence of matrices satisfying as , where is a positive-definite matrix. Thenwhere denotes the symmetric positive-definite square root of .
Proof. The proof proceeds by reduction to Theorem 2 of [
20], which establishes the oracle property for SCAD-penalized estimators under a diverging number of parameters. We verify that the hypotheses of that theorem are satisfied by the sequential linear model (
2) under A1–A9 and the tuning-parameter conditions of Theorem 1.
The asymptotic regime is fixed
t with
. At each fixed
t, model (
2) is an
n-sample regression across i.i.d. subjects (by A5), with temporal dependence entering only through the regressors
that are part of the design matrix
. This matches the i.i.d. across-subjects setting assumed in [
20].
First, model (
2) is the Gaussian linear model
with parameter vector
. This is a special case of the general likelihood setup ([
20], Section 1.2) with log-density
. Under this specialization, the Fisher information reduces to
, and the likelihood-regularity conditions (E), (F), and (G) of [
20] translate into conditions on the design and error distribution of model (
2).
The correspondence between A1–A9 of the present paper and conditions (A)–(H) of [
20] is as follows.
Penalty conditions (A)–(D) are satisfied by the SCAD penalty with
under the tuning-parameter rates in Theorem 1. In particular,
and
eventually ([
20], Section 3.1.1).
Condition (E) (i.i.d. observations with common density, identifiable model, score with mean zero at the truth) follows from A5 (i.i.d. subjects with Gaussian errors, giving common support), A4 (active-set Gram matrix positive definite, giving identifiability on the active set), and A1 (conditional mean-zero errors, giving score mean zero under correct specification).
Condition (F) (Fisher information positive definite with bounded eigenvalues and bounded fourth moments of score components) follows from A7 (active-set Gram matrix bounded below), A6 (column norms bounded above), and A5 (finite fourth moment of ).
Condition (G) (third-derivative regularity of the log-density) is trivial for the Gaussian linear model because the log-density is quadratic.
Condition (H) (minimum signal condition ) follows from A9 combined with the tuning-parameter rate , which together give .
The sparsity rate A8 supplies the growth condition on the number of parameters required by [
20], Theorem 2, provided
is sufficiently small relative to the rate constraints imposed there.
The remaining conditions, A2 (asymptotic uncorrelatedness) and A3 (weak stationarity) are descriptive of the sequential linear model framework and support the interpretation of the assumptions.
Under these conditions, [
20], Theorem 2, yields the existence of a local minimizer
of the SCAD criterion (
6) satisfying
together with consistency on the active set at rate
, which combined with A9 gives
for all
and hence conclusion (i) of Theorem 1. The same theorem also yields
for
and
as in the statement of Theorem 1(ii), which is conclusion (ii).
The analogous argument for the finite-dimensional SCAD oracle property appears in [
7], Theorems 1 and 2, and has been extended to high-dimensional settings by [
15]. □
The selection-consistency conclusion of Theorem 1(i) automatically implies a lag-order selection guarantee, which we record as the following corollary.
Corollary 1 (Lag-order selection consistency)
. Define the true effective lag order at time t asand its SCAD-based estimator aswith the convention that the maximum of an empty set is zero. Under the conditions of Theorem 1, as . Proof. The lag coefficient
for
corresponds to coordinate
of
, since
with
. The map
is a bijection between lag indices
and the lag coordinates
of
. Under this bijection,
if and only if
, and analogously for the SCAD-selected counterpart. Hence, the event
implies
, and
by Theorem 1(i). □
We note that Corollary 1 shows that the SCAD-penalized method performs predictor selection in and lag-order selection in simultaneously and consistently within a single optimization, without requiring a separate information-criterion search over candidate lag orders. In the sequential setting where the nominal dimension of grows with t, this is a meaningful computational and statistical advantage.
3.3. Algorithmic Implementation
As a consequence of the sparsity property established in Theorem 1, the SCAD-penalized method selects the true model, yielding a sparse set of solutions and approximately unbiased coefficients for significant predictors and for significant lags of .
In the orthonormal special case, the SCAD objective (
6) is separable in the coordinates, with closed-form minimizer
where
is the sign function and
denotes the unpenalized least-squares coefficient for the
jth coordinate.
For the general non-orthonormal design that arises in model (
2), Equation (
9) is no longer an exact closed-form solution to problem (
6). Instead, we compute
iteratively via coordinate-descent with SCAD thresholding: at each coordinate, the thresholding rule (
9) is applied with all other coordinates held fixed (so
in (
9) is the current partial-residual univariate least-squares update for coordinate
j), and the algorithm iterates until the relative change in the objective falls below a convergence tolerance (we used
in our experiments). This is the solver implemented in the
ncvreg R package 3.16.0, which we used throughout the simulations and applications. The convergence properties of this algorithm, together with the statistical rate of convergence of the resulting estimator, are discussed in
Section 3.5.
The optimal pair
can be obtained through a two-dimensional grid search using cross-validation, which is computationally expensive. Fan and Li [
7] recommended setting
as a robust choice across diverse scenarios and showed that the choice of
a does not materially affect overall performance relative to
. We adopt
throughout and select
by cross-validation.
In the context of sequential linear models, we employ the SCAD penalty as in Equations (
7)–(
9) to enhance variable selection and refine coefficient estimation. The sparsity property, combined with the smooth penalization of the SCAD penalty, proves beneficial for handling the sequential and correlated nature of time-series data.
We remark that the SCAD penalty [
7] is more effective for parameter estimation and variable selection, especially in high-dimensional data, because it promotes sparsity, achieves asymptotic unbiasedness, and satisfies the oracle property. With advantages similar to the Lasso, the SCAD-penalized method removes variables with small coefficients. Because the SCAD penalty is non-concave, it applies heavy penalization to small coefficients while leaving large coefficients nearly unbiased. As the sample size grows, the SCAD-penalized method consistently identifies the true set of non-zero coefficients. By Corollary 1, this in particular identifies the true lag order in the sequential linear modeling and thereby improves predictive accuracy in models (
1) and (
2).
As shown in the simulation results in the next section, the SCAD penalty provides a better solution to penalized estimation and variable selection than the Lasso and adaptive Lasso. Rather than applying a uniform penalty to the coefficients, SCAD applies a non-concave penalty that effectively shrinks small coefficients toward zero while preserving the large ones for more accurate estimation. Under assumptions A7 (restricted eigenvalue), A8 (sparsity rate), and A9 (minimum signal), the SCAD-penalized method achieves the oracle property by Theorem 1, even when the number of predictors exceeds the sample size. See also [
21] for a broader overview of SCAD-type methods in high dimensions.
The regularization parameter controls shrinkage of the coefficients for the predictors , and the coefficient vector combines the parameters for the independent variables and the lagged coefficients for .
Note that the sequential linear modeling often involves dependencies between observations at different time points, and the SCAD-penalized method helps mitigate multicollinearity issues by encouraging a parsimonious representation of the underlying patterns. Xie and Huang [
10] employed the SCAD penalty to promote sparsity in the partially linear model while using polynomial splines to estimate the nonparametric function. Their work demonstrates that the SCAD penalty not only achieves consistent variable selection but also accurately identifies the correct model structure, enabling efficient parameter estimation under the true model.
3.4. Selection of Tuning Parameter
The process of selecting the regularization tuning parameter in penalized methods is crucial. In sequential linear models, where temporal dependencies are present, careful consideration is needed to balance the penalty on the coefficients and the preservation of important features. The tuning parameter must therefore be selected appropriately.
As discussed in the previous subsection, the SCAD-penalized method estimates the parameters in the sequential linear model by shrinking non-significant parameters to zero and reducing the bias of significant parameter estimates. Therefore, the SCAD-penalized method consistently selects the significant parameters. Fan and Li [
7] showed that the SCAD method is applicable in both parametric and high-dimensional nonparametric settings. The convergence rate of the penalized-likelihood estimators depends on the regularization parameter, so an appropriately selected tuning parameter is needed for effective variable selection and accurate parameter estimation.
To facilitate tuning-parameter selection, cross-validation is commonly used. The sequential linear model involves additional complexities, such as the construction of lagged variables with time-effect coefficients, which requires a careful balance between penalizing coefficients and preserving important features. In this context, the SCAD penalty proves effective, which is also why its performance surpasses that of the other methods adopted here, as demonstrated in the simulations of the next section.
For the penalized methods with different penalty terms, we use cross-validation to choose the tuning parameter, which controls the weight of the shrinkage on parameter estimates. In the sequential linear model with the Lasso method, the
penalty function is
and the Lasso estimator admits the coordinate-wise thresholding rule
where
denotes the unpenalized least-squares coefficient,
is the sign function, and
.
With the adaptive Lasso method, the coordinate-wise penalty is
where
is the weight assigned to the coefficient
. The adaptive Lasso’s penalty on each coefficient is adjusted based on the corresponding Lasso estimate. This variable-specific shrinkage allows the adaptive Lasso to focus more on variables with larger estimated coefficients.
The purpose of using adaptive weights is to down-weight certain variables and up-weight others, allowing for more effective variable selection. Variables with larger Lasso estimates receive smaller weights, while variables with smaller Lasso estimates receive larger weights.
Before fitting the model via various methods, we split the data into training and test sets in the ratio 4:1. The training set is used to estimate the parameters, identify significant variables, and tune regularization parameters. The test set evaluates the model’s predictive performance. One step before fitting is to tune the regularization parameter to minimize the mean squared error.
Cross-validation is a powerful technique to ensure the robustness of the generalization performance and avoid overfitting in sequential linear modeling. By splitting the training data into training and validation subsets in K-fold cross-validation, the procedure helps the SCAD-penalized method select significant variables and prevent over-shrinkage. It is also used in combination with the Lasso and the adaptive Lasso to assess performance and select an optimal regularization tuning parameter.
The prediction error serves as the criterion in cross-validation. With the training set fixed, cross-validation estimates the expected prediction error, and the validation set provides an assessment of the prediction model. In K-fold cross-validation, the available folds are used to fit the model, and the held-out fold is used to evaluate performance. The rule for selecting the best tuning parameter is to minimize the estimated prediction error.
According to Equation (
7), the SCAD-penalized method selects variables with coefficients
as follows.
When is small, the SCAD penalty behaves like the Lasso and shrinks the parameter to zero.
When is large, the SCAD penalty is constant, avoiding bias for large parameters.
When is moderate, the SCAD penalty attaches different weights to parameters.
3.5. Algorithmic Convergence and Rate of Convergence
Having defined the SCAD-penalized estimator, established its oracle property, described the coordinate-descent algorithm used to compute it, and specified the tuning-parameter selection procedure, we conclude the theoretical development of the method by addressing two further questions. First, does the coordinate-descent algorithm in
Section 3.3 reliably produce an estimator to which Theorem 1 applies? Second, at what rate does this estimator approach the true parameter as the sample size grows? Because the SCAD objective (
6) is non-convex, global optimization is generally intractable, so neither question is trivial. Both, nevertheless, admit theoretical answers that we summarize here.
Algorithmic convergence. The SCAD-penalized objective in (
6) is minimized using the coordinate-descent algorithm of [
22], implemented in the R package
ncvreg. At each step, the algorithm updates a single coordinate by soft-thresholding a univariate quadratic majorant of the objective. Breheny and Huang [
22], in their Proposition 1, show that the sequence of iterates produced by this procedure converges to a coordinate-wise minimum of (
6), which is also a local minimum because the directional derivatives of the SCAD penalty exist everywhere. In our implementation, convergence is declared when the relative change in the objective falls below the tolerance
. Moreover, they [
22] establish a convexity diagnostic identifying regions of the parameter space where the SCAD objective is locally convex despite the non-convexity of the penalty; within such a region, the local minimum is the unique global minimum, and the coordinate-descent output coincides with the global optimizer there.
Selection of the oracle local minimum. The SCAD objective may possess multiple local minima, and the oracle property of Theorem 1 holds for one of them. A relevant question is therefore whether the specific local minimum returned by our coordinate-descent solver is the one that attains the oracle property. Fan, Xue, and Zou [
23] provide the sharpest available answer. They show that, under a localizability condition on the oracle estimator and the SCAD penalty, the local linear approximation (LLA) algorithm initialized by a
-consistent estimator, such as the Lasso, converges to the oracle estimator in a finite number of iterations with probability tending to one. Our implementation uses coordinate-descent rather than explicit LLA, but
ncvreg constructs the regularization path through warm-starts, so each SCAD fit is initialized from a sparse, near-Lasso solution obtained at the previous point on the path. This is closely analogous to the
-consistent initialization required by [
23], and their theoretical guarantee is informative for the practical behavior of the solver.
Statistical rate of convergence. The oracle property stated in Theorem 1 directly implies a rate of convergence for the estimator on the active set. Specifically, the asymptotic normality statement in Theorem 1(ii) yields
where
is the cardinality of the true active set. This is the classical parametric
rate, the same rate attained by the oracle least-squares estimator that knows the active set in advance. Combined with the selection-consistency conclusion of Theorem 1(i), this shows that the SCAD-penalized sequential linear model estimator matches the oracle estimator both in support recovery and in rate of convergence, as
.
Taken together, these three observations provide the theoretical basis for the numerical performance reported in
Section 4. The algorithm converges to a local minimum by [
22], that local minimum coincides with the oracle estimator when the algorithm is initialized appropriately by [
23], and the oracle estimator attains the parametric
rate by Theorem 1.
4. Simulations
4.1. Simulation Settings
To check the performance of the SCAD-penalized method in the sequential linear model (
2), we conduct a simulation study with fixed sample size and different numbers of predictors, along with a set of continuous outcomes based on fifty time points
. Based on each simulation setting, we compare the SCAD-penalized method with OLS, Lasso, and adaptive Lasso in traditional linear and sequential linear models. The optimal tuning parameter
is selected in the Lasso, SCAD-penalized, and adaptive Lasso methods, respectively, via 10-fold cross-validation.
Consider a dataset with repeated measures of outcomes for through time points . A model is trained to sequentially predict the outcome at each time point, with baseline data matrix and parameters . The number of predictors is set as low (), medium (), and high () dimensions in the simulation study.
At the first time point , the model is where .
For any subsequent time point
t, the sequential linear model with time series is
where
and
. Model (
10) is iterated to generate data until
. At the initial time point
, the model does not involve the term
.
The number of observations is . The three levels of predictor numbers are and 1000. We independently generate the simulated data matrix from the multivariate normal distribution with mean and covariance at each time point.
The true regression coefficients are , , with the remaining equal to 0. We set the effective lag order to , with true time-effect coefficient . We take . The data are split in a ratio into training and test sets. This setup is an explicit instance of the compatible construction in Remark 1.
The generated data are fitted by the linear model (LM) and the sequential linear model (SLM), respectively, using OLS, Lasso, adaptive Lasso, and the SCAD-penalized method.
4.2. Simulation Results for Estimation and Variable Selection
The simulation results for the estimation of the coefficients and the mean prediction test errors are averaged over 1000 replicates.
For estimation, the average prediction error is used as a measure.
Table 1 features the average prediction errors for each method.
According to [
24], the relative risk (RR) is the accuracy metric for comparing methods. We expect the RR to be as small as possible, which denotes a more accurate coefficient estimate. The formula for the relative risk is
where
denotes the covariance matrix of
, and
. We compute RR at the terminal time point
. The perfect score is 0, which means
equals
, while RR
means
is zero.
Table 2 features the RR values for each method.
Table 1 shows that with time-series data, for all models and dimensions, the average absolute prediction error of the SCAD-penalized method is among the smallest across methods.
For the sequential linear model,
Table 1 illustrates that the SCAD-penalized method has a relatively small mean absolute prediction error, indicating high prediction accuracy. For instance, in the medium dimension (
) with time series, the estimated
is
. The average prediction errors for the Lasso, SCAD, and adaptive Lasso are
,
, and
, respectively. Thus, in the medium dimension with time series, the SCAD-penalized method fitted by a sequential linear model is the best choice. Similarly, at high dimensions (
), the SCAD-penalized method attains the smallest MAPE in the SLM.
Table 2 also shows that, when fitted by the sequential linear model, the SCAD-penalized method generally achieves low mean RR values, confirming that it performs best in estimating parameters among all the methods considered.
Additionally, for the generated data with time series, the RR results in
Table 2 and the MAPE results in
Table 1 imply that sequential linear modeling functions better than a simple linear model. This confirms the advantages of sequential linear modeling in models (
2) and (
3). When fixed effects and time-series data are involved, sequential linear modeling is a good choice, mirroring the role of the ADL for dynamic data as discussed in the Introduction.
In the generated data, we also conduct simulations for variable selection. We set five non-zero true values as the significant variables—the first, third, fifth, seventh, and ninth—while the rest are insignificant. The individual empirical interval for each parameter is calculated as the range from the to the quantile of the estimates across the 1000 replications.
To examine
estimation accuracy of significant variables using the OLS, Lasso, SCAD-penalized, and adaptive Lasso methods, the empirical 95% intervals (hereafter abbreviated “ACI” following the original notation) for all significant variables and for the number of significant variables are summarized in
Table 3,
Table 4 and
Table 5 for the three dimensions. The methods are fitted both by the traditional linear model (LM) and the sequential linear model (SLM).
Table 3 gives these intervals in the low-dimension setting. Based on the simulated data with
, we expect the interval for the number of significant variables to be close to the truth (5 in the LM; 6 in the SLM, counting the lag coefficient). In the LM, the adaptive Lasso attains the tightest interval around the truth. However, in the SLM, the SCAD-penalized method is the best, with the tightest intervals for both the number of significant variables (6 to 9.53) and for the individual parameters, and the confidence interval for the lag coefficient is (0.59, 0.66), which is the tightest among all methods.
In the medium-dimension setting, we compare the three penalized methods. In the SLM, SCAD yields 95% empirical intervals of (6, 11) for the number of significant variables and (0.60, 0.68) for the lag coefficient, both the tightest among the three methods. Details are in
Table 4.
Table 5 shows that in the high-dimension setting, the SCAD-penalized method again attains the best empirical 95% intervals. With
predictors and 1000 repetitions, SCAD yields intervals of (5, 12.53) for the LM and (6, 17) for the SLM on the number of significant variables—the tightest among the three methods—and its interval for the individual coefficients is also close to the truth. Its interval for the lag-effect coefficient
is (0.59, 0.68), again tighter than the Lasso’s. The adaptive Lasso has a 95% empirical interval for
that is outside the true value range.
6. Concluding Remarks and Discussion
In sequential linear modeling, the measurements at each time point enable the prediction of data at intermediate time points, improving prediction accuracy because correlation structures are correctly captured. With the focus on the sequential linear regression model, lagged dependent variables are repeatedly included as input variables in the next predictive models. We therefore explore sequential linear modeling and its performance in estimation and variable selection.
For the sequential linear model proposed in this study, we introduce the SCAD-penalized method to improve estimation and variable selection. We establish the oracle property of the estimator (Theorem 1) and show that lag-order selection is consistent as a corollary (Corollary 1). The SCAD penalty is expected to perform better in estimation and variable selection than the Lasso and adaptive Lasso. The Lasso performs variable selection by shrinking coefficients to zero, useful in medium or high dimensions. The adaptive Lasso applies adaptive penalty weights to parameter estimates, improving over the Lasso. The SCAD-penalized method has different shrinkage at different coefficient magnitudes: for small coefficients, it behaves like the Lasso and shrinks to zero. For large coefficients, the penalty is constant, reducing bias and keeping significant signals. And for moderate coefficients, the penalty grows at a slower rate than the Lasso, reducing shrinkage.
We conduct simulations to explore the performance of the proposed SCAD-penalized method compared with Lasso and adaptive Lasso. OLS is used in low-dimension settings for reference. Across low, medium, and high dimensions, we generate base datasets with
and 1000 with 1000 replications, and the error terms follow the standard normal distribution. The algorithmic and statistical convergence of the proposed estimator, including the
-rate on the active set, are discussed theoretically in
Section 3.5 by appeal to the results of [
22,
23].
By incorporating lagged dependent variables as predictors at each time point, we compare OLS, Lasso, SCAD, and adaptive Lasso. OLS applies only to low-dimensional settings and does not function well when p exceeds n.
The simulation results show that the SCAD-penalized method improves over the Lasso by reducing bias in parameter estimation and lowering prediction errors. The SCAD penalty improves estimation accuracy and variability, yielding better predictions in medium- and high-dimensional settings. Results are summarized through the MAPE, RR, and empirical 95% interval (ACI) outputs across methods and scenarios. The 95% empirical intervals show that the SCAD-penalized method yields results closest to the true settings, both for the number of significant variables and for the true parameter values. Notably, in small datasets OLS performs well, while in medium- to high-dimensional datasets the SCAD-penalized method is superior for parameter estimation and variable selection.
To further explore the performance of the proposed SCAD-penalized method, we applied it to two real datasets in comparison with OLS, Lasso, and adaptive Lasso. The first dataset is the Heart Valve replacement surgery data from the joineR package [
18]. SCAD attains the lowest mean absolute prediction error in both linear and sequential models. The second dataset is the latest housing prices in Austin, TX [
25]. Although the modeling assumptions are not perfectly satisfied, removing extreme outliers or taking the log of the response variable strengthens the linear relationship between the predictors and the log housing price. The SCAD-penalized method demonstrates more stable performance in variable selection, incorporating most of the significant variables also identified by the adaptive Lasso.
In future work, we will consider analyzing datasets with strong multicollinearity and comparing OLS, Lasso, SCAD, and adaptive Lasso in that setting. Many real datasets entail high correlation among variables, which motivates the application of various penalized methods to sequential linear modeling when selecting significant lagged dependent variables from the multivariate time series. We will also explore hypothesis tests to determine the minimum sample size needed for reliable model prediction in sequential linear modeling, and estimate the sample size required so that the model is robust to significant variables and yields good prediction performance in applications.