Next Article in Journal
Semiparametric Integrated and Additive Spatio-Temporal Single-Index Models
Next Article in Special Issue
INLA Estimation of Semi-Variable Coefficient Spatial Lag Model—Analysis of PM2.5 Influencing Factors in the Context of Urbanization in China
Previous Article in Journal
A Heavy-Tailed Distribution Based on the Lomax–Rayleigh Distribution with Applications to Medical Data
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Group MCP Approach for Structure Identification in Non-Parametric Accelerated Failure Time Additive Regression Model

1
School of Economics, Jinan University, Guangzhou 510632, China
2
Department of Mathematics, Guangdong University of Education, Guangzhou 510632, China
*
Author to whom correspondence should be addressed.
Mathematics 2023, 11(22), 4628; https://doi.org/10.3390/math11224628
Submission received: 14 October 2023 / Revised: 6 November 2023 / Accepted: 10 November 2023 / Published: 13 November 2023
(This article belongs to the Special Issue Statistics and Nonlinear Analysis: Simulation and Computation)

Abstract

:
In biomedical research, identifying genes associated with diseases is of paramount importance. However, only a small fraction of genes are related to specific diseases among the multitude of genes. Therefore, gene selection and estimation are necessary, and the accelerated failure time model is often used to address such issues. Hence, this article presents a method for structural identification and parameter estimation based on a non-parametric additive accelerated failure time model for censored data. Regularized estimation and variable selection are achieved using the Group MCP penalty method. The non-parametric component of the model is approximated using B-spline basis functions, and a group coordinate descent algorithm is employed for model solving. This approach effectively identifies both linear and nonlinear factors in the model. The Group MCP penalty estimation exhibits consistency and oracle properties under regularization conditions, meaning that the selected variable set tends to have a probability of approaching 1 and asymptotically includes the actual predictive factors. Numerical simulations and a lung cancer data analysis demonstrate that the Group MCP method outperforms the Group Lasso method in terms of predictive performance, with the proposed algorithm showing faster convergence rates.

1. Introduction

In both economic and biological research, it is a common scenario that many theories do not prescribe specific functional forms for the relationships between predictors and outcomes. For example, in biomedical studies, the influence of predictors on survival time can exhibit nonlinearity. Attempting to fit a linear model in such cases can result in biased estimates or produce misleading results. However, the functional shape of a non-parametric model is determined by the available data, eliminating the need for a linear functional form to describe the influence of a covariate. Additionally, non-parametric models offer greater flexibility in fitting data compared to parametric models. This paper delves into the in-depth study of the non-parametric accelerated failure time additive regression (NP-AFT-AR) model:
  t i = u + j S 1 β j x i j + j S 2 f j x i j + ε i   i = 1 , 2 , , n
where ( t i , x i 1 , , x i p ,   1 i n ) is random sample, t i is the logarithm of the response variable, that is, t i is the logarithm of survival time. x i 1 , , x i p is a p × 1 vector of covariates, S 1 , S 2 are mutually independent and complementary subsets of 1 , , p , β j : j ϵ S 1 are regression coefficients of the covariates with indices in   S 1 , and f j : j ϵ S 2 are unknown functions. The covariates in S 1 have a linear relationship with the mean response, whereas the connection with the other covariates is not determined by a finite number of parameters. Parameter models require explicit assumption constraints, and they tend to overfit when there is an excessive number of model parameters. Some of these models are also based on the assumption of linearity, making them inadequate for capturing complex nonlinear relationships. However, parameter models offer the advantages of clear interpretability for explicit parameters, efficiency, and accurate parameter estimation. Hence, this paper aims to leverage these characteristics of parameter models and explores a hybrid approach that combines both parameter and non-parameter models to enhance the adaptability and performance of the model. When the emphasis is on the relationship between t i and x i j : j ϵ S 1 , which can be approximated by a linear function. It provides enhanced interpretability compared to a purely non-parametric additive model. The random error term ε i has a mean of zero and a finite variance σ 2 . Assuming that certain components f j are zero, our main objective in this research is to distinguish the nonzero components from the zero components and estimate the nonzero components accurately. A secondary objective is to elucidate the functional forms of the nonzero components, thereby suggesting a more concise model. The techniques we have established can readily be expanded to the partly linear additive AFT regression model, particularly when certain covariates may be discrete and not amenable to modeling using smoothing techniques like B-splines. We utilize the lung cancer data example to demonstrate this extension.
The structure identification method is effective in distinguishing linear variables from nonlinear ones, and numerous scholars have contributed to relevant research methods. Tibshirani [1] combined the least square estimation technique introduced by Bireman [2] with minimizing the residual sum of squares under constraints, transforming the solution into a continuous optimization process. This approach is known as the Lasso method, where penalties are applied to select variables, and estimated coefficients are continuously shrunk toward zero to automatically identify important explanatory variables. However, researchers such as Zhao and Yu [3] and Zou [4] discovered that the Lasso method may not consistently select the correct model, and the estimated regression coefficients do not exhibit asymptotic normality. To address this limitation, Fan and Li [5] proposed the SCAD penalty, which substitutes the penalty in Lasso with a quadratic spline penalty function to reduce deviations. In the context of linear models, the SCAD method can uniformly identify the true model and possess oracle properties. Nonetheless, the non-convex nature of the SCAD penalty makes it challenging to optimize in practical applications, leading to numerical instability during the solution process. Zhang [6] introduced the non-concave MCP (smoothly clipped absolute deviation) penalty and developed the MCP penalty likelihood procedure as an alternative approach. The MCP penalty method replaces the l 1 penalty in Lasso with a quadratic spline penalty function to reduce bias. MCP exhibits the capability to consistently select the correct model with a probability of 1 and provides corresponding estimates with oracle properties.
Heller [7] employed the weighted kernel smooth rank regression method to estimate the unknown parameters in the AFT model, particularly in the case of censored data. Gu [8] introduced an empirical model selection approach for non-parametric components based on the Kullback–Leibler geometric structure. Schumaker [9] utilized the Lasso iterative method for selecting parametric covariates, while non-parametric components were estimated using the sieve method. Johnson [10] extended the rank-based Lasso-type estimation, which can encompass a portion of the linear AFT model. Huang and Ma [11] applied the AFT model to analyze the relationship between gene expression and survival time, using the bridge penalty method for individual-level regularized estimation and gene selection. Long et al. [12] established a risk prediction score through regularized rank estimation within a portion of the linear AFT model. Wei et al. [13] explored the application extension of subgroup identification methods based on Adaptive Elastic Net and the AFT model. Wang and Gao [14] conducted empirical likelihood inference for the AFT model under right-censored data. Cai et al. [15] compared parametric and semiparametric AFT models in clustered survival data. Liu et al. [16] introduced a new semiparametric approach that allows for the simultaneous selection of important variables, model structure identification, and covariate effect estimation within the AFT model.
Researchers used different methods for variable selection and parameter estimation. For instance, Fan and Li [5] employed the Newton algorithm to estimate the penalty likelihood function. Cui et al. [17] introduced the concept of penalty regression spline approximation and group structure identification within the additive model. However, their approach faced computational instability issues as they relied on truncated power series to approximate non-parametric truncation. Huang and Ma [11] proposed a two-step method where, with a fixed number of predictors, nonzero variables are simultaneously selected and estimated in the additive model, using Group Lasso in the first stage and Adaptive Group Lasso in the second stage. Leng and Ma [18] used the COSSO penalty to handle non-parametric covariate effects in the AFT model. However, due to the non-smooth nature of the penalty function at the origin, the computation can be challenging, and these methods require a significant amount of time to calculate the inverse matrix of the Hessian matrix, especially when dealing with high-dimensional covariates. Therefore, in this paper, the group coordinate descent (GCD) algorithm is employed to approximate and estimate the parameters in the non-parametric additive accelerated failure time model. GCD capitalizes on the assumption of model sparsity, and the algorithm is simple and operates at a fast pace. The GCD algorithm closely resembles the standard Newton–Raphson algorithm, but each iteration involves solving a weighted least squares problem with a penalty function.
Under the assumption that the dimensionality of covariates is allowed to diverge, this paper rigorously proves that the Group MCP penalty estimator in the non-parametric accelerated failure time model exhibits consistency and oracle properties. As the generalized cross-validation criterion is inconsistent in model selection when the sample size tends to infinity, meaning it may select irrelevant variables, the Bayesian Information Criterion (BIC) does not suffer from such issues. BIC, as shown by Golub et al. [19], has the desirable property of selecting the true model with a probability of 1. Therefore, in the context of structure identification in the non-parametric accelerated failure time model, this study opts for tuning based on the BIC criterion.
The remaining sections of the paper are organized as follows. In Section 2, we describe the construction of the AFT model with penalty estimation and variable selection based on Group MCP. Section 3 introduces the algorithm and parameter tuning for the effective identification of both linear and nonlinear factors in the model. In Section 4, we provide proof of the Group MCP’s selection consistency property, where the selected variable set asymptotically tends to include the actual predictive factors with a probability approaching 1. Section 5 primarily focuses on numerical simulations and empirical analysis, demonstrating the method’s strong predictive performance. We also apply the method to the analysis of lung cancer data. Section 6 provides a brief summary of the corresponding conclusions.

2. Penalized Estimation and Variable Selection

2.1. Method

T i is the natural logarithm of the i t h censoring time, C i is the natural logarithm of the survival time and δ i represents the event indicator, i.e., δ i = I T i C i , which takes value 1 if the event time is observed or 0 if the event time is censored. Y i is the logarithm of the minimum of the survival time and the censoring time, i.e., Y i = log min T i , C i . Then, the observed data are assumed to be independent and identically distributed (i.i.d.) samples from Y , δ , X , in the form of X i , δ i , Y i ,   i = 1 , , n . Y 1 Y 2 Y n is the order statistics of Y i s , δ 1 , , δ n , and X 1 , , X n are the respective censoring indicators and covariates. F represents the distribution of T, and F ^ n is its Kaplan–Meier estimator by [20].   F ^ n y = i = 1 n w n i I Y i y , where the w n i s are Kaplan–Meier weights calculated by w n 1 = δ 1 / n , and w n i = δ i / ( n i + 1 ] j = 1 i 1 n j / n j + 1 δ j , i = 2 , , n . After processing T i and considering whether T i C i is established, transform T i into Y i , when other conditions remain unchanged, the above conversion Formula (1) can be expressed as
Y i = u + j S 1 β j x i j + j S 2 f j x i j + ε i   i = 1 , 2 , , n
The following introduces the method of coefficient estimation in Equation (2), assuming that each group is orthonormal, i.e., X j X k = 0 , j k and X j X j / n = I d j . z = X j y / n is the least squares estimate of θ , where θ   is the unknown parameter associated with marker effects by [21]. Because of X j X j / n = I d j , the least squares objective function with penalty term can be expressed as 2 1 z θ 2 2 + ρ θ 2 ; λ , γ . The linear part of Formula (2) is expressed as a function and then is brought into the additive non-parametric regression model to obtain:
y i = u + f 1 x i 1 + + f p x i p + ε i  
To ensure unique identification of the f j s , we assume that E f j x i j = 0 , 1 j p . If some of the f j s are linear, then Equation (3) transforms into the partially linear additive model (2). The problem shifts to determining the linear or nonlinear forms. Therefore, we decompose f j into a linear part and a non-parametric part f j x = β 0 j + β j x + g j x . Consider a truncated series expansion for approximating g j , that is g n j = k = 1 m n θ j k φ k x . Where φ k x , k = 1 , , m n is a set of basis functions and m n at certain rate as n . If θ j k = 0 ,   1 k m n ,   j = 1 , , p , then f j has the linear form. Therefore, based on this equation, the current task is to ascertain which groups of θ j k , j = 1 , , p ,   k = 1 , , m n are zero. Let β = β 1 , , β p and θ n = θ 1 n , , θ p n , where θ j n = θ j 1 , , θ j m n . Define the penalized least squares criterion.
  L u , β , θ ; λ , γ = 1 2 n i = 1 n w n i y i u j = 1 p β j x i j j = 1 p k = 1 m n θ j k φ k x i j 2 2 + j = 1 p ρ γ ( θ j n A j ; m n λ )  
where ρ is the penalty function based on the penalty parameter λ 0 and regularization parameter γ . u represents the intercept. θ n j A j = θ n j A j θ n j 1 2 is the norm with respect to the positive definite matrix A j . However, it is important to choose a suitable choice of A j to facilitate the computation. Let X ˜ i = n w n i 1 / 2 X i X ¯ W ,   and Y ˜ i = n w n i 1 / 2 Y i Y ¯ W ; X ¯ W = i = 1 n w n i X i / i = 1 n w n i and Y ¯ W = i = 1 n w n i Y i / i = 1 n w n i , then the weight w n i in Formula (4) can not be expressed, and Formula (4) can be transformed into
L β , θ ; λ , γ = 1 2 n i = 1 n y ˜ i j = 1 p β j x ˜ i j j = 1 p k = 1 m n θ j k φ k x ˜ i 2 2 + j = 1 p ρ γ ( θ j A j ; m n λ )
We apply Group MCP penalty to the penalty term, i.e., ρ γ t ; λ = λ 0 t 1 x / γ λ + d x , t 0 .
γ controls the concavity of ρ and λ is the penalty parameter. Here, x + denotes the nonnegative part of x, that is, x + = x I x 0 . We require λ 0   and   γ > 1 . Taking the derivative with respect to ρ γ t ; λ yields   ρ ˙ γ t ; λ = λ 1 t / γ λ + , t 0 . It initiates with the application of group MCP penalization at the same rate as the group lasso, gradually easing this penalization until, when t > λ γ , the rate of group MCP penalization diminishes to 0. This approach offers a spectrum of penalties, encompassing the l 1 penalty at γ = and the hard-thresholding penalty when γ 1 + . Notably, it encompasses the Lasso penalty as a specific case when γ = .
The penalty in Equation (4) combines the penalty function ρ γ · ; λ with a weighted l 2 norm of θ j . ρ γ · ; λ serves as a penalty for individual variable selection, and when applied to the norm of θ j , it selects the coefficients in θ j as a group. This approach is favorable since the nonlinear components are captured by the coefficients in θ j s as groups. We term the penalty function in Equation (4) as the group minimax concave penalty or simply the group MCP. The penalized least squares estimator is defined by u ^ n , β ^ n , θ ^ n = a r g m i n u , β , θ n L u , β , θ n ; λ , γ , subject to the constraints   i = 1 n k = 1 m n θ j k φ k x i j = 0 ,   1 j p . These centering constraints are sample analogs of the identifying restriction E f j x i j = 0 , 1 i n , 1 j p . z i j = φ j 1 x i j , , φ j m n x i j , z i j consists of the centered basis functions at the i th observation of the j th covariate. Let Z = Z 1 , , Z P , where Z j = z 1 j , , z n j is the n × m n design matrix corresponding to the j th expansion. Let y ˜ = y ˜ 1 , , y ˜ n , x ˜ i = x ˜ i 1 , , x ˜ i p , and X ˜ = x ˜ 1 , , x ˜ p . We can write
β ^ n , θ ^ n = a r g m i n β , θ n L β , θ n ; λ , γ = ( 1 / 2 n ) y ˜ X ˜ β Z θ n 2 2 + j = 1 p ρ γ ( θ n j A j ; m n λ )
Here, we excluded u from the arguments of L , as the intercept is zero as a result of centering. Therefore, the constrained optimization problem transforms into an unconstrained one.

2.2. Penalized Profile Least Squares

The penalized profile least squares approach is used to calculate β ^ n , θ ^ n . The β ^ that minimizes L inherently satisfies X ˜ y ˜ X ˜ β Z θ n = 0 for any given θ n , resulting in β = ( X ˜ X ˜ ) 1 X ˜ y ˜ Z θ n . Define Q = I P X ˜ , P X ˜ = X ˜ ( X ˜ X ˜ ) 1 X ˜ represents the projection matrix onto the column space of X ˜ . Consequently, the profile objective function of θ n becomes:
L θ n ; λ , γ = ( 1 / 2 n ) Q y ˜ Z θ n 2 + j = 1 p ρ γ ( θ n j A j ; m n λ )
We use A j = n 1 Z j Q Z j , this choice of A j standardizes the covariate matrices associated with θ n j s and leads to an explicit expression for computation in the group coordinate algorithm described below. For any given λ , γ , the penalized profile least squares estimator of θ n is defined by θ ^ n = a r g m i n θ n L θ n ; λ , γ . We compute θ ^ n using the group coordinate descent algorithm. The set of covariates estimated to have a linear form in the regression model (1) is denoted as S ^ 1 j : θ ^ n j = 0 . Then, we obtain g ^ n j x ˜ = 0 , j S ^ 1   and   g ^ n j x ˜ = k = 1 m n θ ^ j k φ k x ˜ , j S ^ 1 . Denote X ^ 1 = x ˜ j ,   j S ^ 1 ,   Z ^ 2 = Z j : j S ^ 1 and θ ^ n 2 = θ ^ n j : j S ^ 1 . We have β ^ n = ( X ˜ X ˜ ) 1 X ˜ y ˜ Z ^ 2 θ ^ n 2 . Then, the estimator of the coefficients of the linear components is β ^ n 1 = β ^ j :   j S ^ 1 ,   g ^ x ˜ = g ^ 1 x ˜ , , g ^ p x ˜ . Then, β ^ n 1 = X ^ 1 X ^ 1 1 X ^ 1 y j S ^ 1 g ^ j x ˜ i   is the estimator of the coefficient vector of the linear components. The coefficients of the linear and nonlinear parts can be identified and estimated, and then the structure identification of the non-parametric can be added to the AFT model.

3. Computation

3.1. Computation Algorithm

Assuming that there is a standard between each group, i.e., X j X k = 0 , j k and X j X j / n = I d j . Let z = X j y / n is the least squares estimate of θ . We use S z ; t = 1 t / z 2 + z to calculate the solution of group Lasso with the group coordinate descent algorithm, and the expression of group Lasso is θ ^ gLASSO z ; λ = S z , λ . When γ > 1 , the group MCP of the quadratic norm can be expressed as θ ^ gMCP ( z ;   λ , γ ) = γ γ 1 S z , λ   i f   z 2 γ λ z   i f   z 2 > γ λ . When γ , θ ^ g M C P ; λ , γ θ ^ g L a s s o ; λ , for λ > 0 and γ 1 , θ ^ g M C P ; λ , γ H ; λ . H z ; λ 0 ,   if   z 2 λ , z ,   if   z 2 > λ . . So we can use θ ^ g M C P ; λ , γ : 1 < γ } to express hard threshold function of group MCP and when γ = 1 or γ = can to be soft threshold function.
Group coordinate descent algorithm is used to compute θ ^ n in this paper. GCD algorithm is a natural extension of the standard coordinate descent algorithm [22,23] commonly used in optimization problems involving convex penalties like the Lasso. GCD algorithm optimizes the target function one group at a time, cycling through all groups iteratively until convergence is achieved. It is particularly well-suited for computing θ ^ n because it offers a straightforward closed-form expression for a single-group model, as presented in (8) below. A j = R j R j for an m n × m n upper triangular matrix R j via the Cholesky decomposition. Let b j = R j θ j ,   y ˜ = Q y ,   Z ˜ j = Q Z j R j 1 . Simple algebra shows that L b ; λ , γ = ( 1 / 2 n ) y ˜ j = 1 p Z ˜ j b j 2 + j = 1 p ρ γ ( b j ; m n λ ) . Note that   n 1 Z ˜ j Z ˜ j = R j 1 n 1 Z j Q Z j R j 1 = I m n . Let y ˜ j = y ˜ k j p Z ˜ k b k . Denote L j b j ; λ , γ = ( 1 / 2 n ) y ˜ j Z ˜ j b j 2 2 + ρ γ ( b j ; m n λ ) . Let η j = Z ˜ j ( Z ˜ j Z ˜ j ) 1 y ˜ j = n 1 Z ˜ j y ˜ j . when γ > 1, the value minimizing L j with respect to b j is
  b ˜ j , G M λ , γ = M η j ; λ , γ = 0 ,   i f   η j m n λ   γ η j γ 1 1 m n λ η j ,     i f   m n λ < η j γ m n λ η j i f   η j > m n λ  
In particular, when γ = , we have b ˜ j , G L = 1 m n λ / η j + η j , which is the group Lasso estimate for a single-group model, GCD algorithm can be implemented as follows based on the above expressions. Let the group coefficient b ˜ k s , k j is given. We want to minimize L with respect to b j . Define L j b j ; λ , γ = ( 1 / 2 n ) y ˜ k j Z ˜ k b ˜ k s Z ˜ j b ˜ j s 2 + ρ γ b j ; m n λ . Denote y ˜ j = k j Z ˜ k b ˜ k s and η j = n 1 Z ˜ j y ˜ y ˜ j . Let b ˜ j denote the minimizer of L j b j ; m n λ , γ . When γ > 1 , we have b ˜ j = M η j ; m n λ , γ . Equation (8) is used to iterate through one component at a time for any given λ , γ . The initial value is β ˜ 0 = β ˜ 1 0 , ,   β ˜ p 0 . The proposed GCD algorithm is as follows: Initialize the residual vector r = y y ˜ . Let y ˜ = j = 1 , , p Z ˜ j b j 0 . For s = 0 , 1 , , carry out the following calculation until convergence. For j = 1 , , p , repeat the following steps:
(1)
Calculate η ˜ j = n 1 Z ˜ j r + b ˜ j 0 .
(2)
Update b ˜ j s + 1 = M η ˜ j ; λ , γ .
(3)
Update r r Z ˜ j b ˜ j s + 1 b ˜ j s and j j + 1 .
The final step ensures that r holds the current values of the residuals. While the objective function may not necessarily be convex, it exhibits convexity concerning an individual group when the coefficients of all other groups are fixed.

3.2. Tuning Parameter Selection

Methods such as AIC, BIC, and Generalized Cross-Validation (GCV) are widely used for selection consistency. Let L be the likelihood function, and q represents the L q norm of the vector, P λ is a penalty function related to the parameter λ > 0 , The penalty method of structure recognition mainly considers the important variables by finding the extreme value of the objective function ( 1 / n ) L β j = 1 p P λ β j . Tibshirani [20] used the L 1 norm as the penalty function to obtain Lasso. AIC criterion is used to solve the over-fitting problem in which the value of the model likelihood function increases with the increase of the parameters, where A I C = 2 log L + 2 k . BIC criterion penalizes the number of parameters more strongly, where B I C = 2 log L + k l n n . L is the maximum value of the likelihood function, k is the number of parameters in the model. When λ →0, β close to ordinary least squares estimation; when λ →∞, almost only penalty items remain in the selection criteria. Therefore, we use the faster BIC method to select the parameters of each concave penalty model. The expression of the BIC criterion is B I C λ , d n = log R S S λ , d n + l o g n ( d f λ , d n ) / n . RSS is the sum of squared residuals, d f is the number of variables selected for a given ( λ , d n ). d n is selected from an increasing sequence with multiple nodes, and then selects λ from a sequence of length 100 for any given d n . The maximum value of the sequence is λ m a x = m a x 1 j p Z ˜ j Y 2 / d n , where Z j is a n × d n dimensional matrix about the covariate X j ,   j = 1 , , p , the minimum is 0.01 λ m a x .

4. Theoretical Properties of Group MCP

Let A denote the cardinality of any set A 1 , , p , and X A = ( X j , j A , A = X A W X A / n , where X is n × p covariance matrix, W = d i a g n w 1 , , n w n . Let β 0 = β 01 , , β 0 p be the true regression coefficient and A 1 = j : β 0 j 0 be the set of nonzero regression coefficients, q = A 1 Represents the number of elements in the set A 1 , and satisfies the following conditions:
(C1)   E g x j = 0 and there are constants C 1 and C 2   such that the density distribution function η j x of   x j satisfies 0 < C 1 η j x C 2 < on [a,b] for 1 j p .
(C2)   X i , δ i , Y i , i = 1 , , n is independent and identically distributed (i.i.d), and the error term ε 1 , , ε i is i.i.d in N 0 , σ 2 and exists K 1 , K 2 > 0 , which is constant for all x i 0 , P ε i > x i K 2 exp K 1 x i 2 .
(C3) The error term ( ε 1 , , ε n ) is independent of the Kaplan–Meier weight ( w 1 , , w n ), and there is a constant satisfying that for any 1 i n , 1 j p , there is X i j M , that is, the covariate is bounded.
(C4) The covariate matrix satisfies the SRC condition, exists 0 < c * < c * < , q * = 3 + 4 C q , C = C = c * / c * , converges to 1 with probability and satisfies c * v A v / v 2 c * .
Condition (C1) can ensure that the model is sparse even when the number of covariates is large; that is, the number of covariates with nonzero coefficients can be controlled to a small number; condition (C2) can ensure the tail probability of the model under high-dimensional linear regression The assumption is still valid; according to condition (C3), the sub-Gaussian nature of the model is still guaranteed even if the data is censored; (C4) Ensure that the model meets the SRC condition, that is, ensure that the characteristic root of matrix X W X / n is always between c * and   c * , and any model with a dimension smaller than q * can be identified. Where β ˜ = β ˜ 1 , , β ˜ p represents the estimated coefficient, and A 1 ˜ = j , β ˜ j 0 represents a set of all nonzero coefficients. Denote f j x = β 0 j + β j x + g j x is the regression component of the true value, g n j x = k = 1 m n θ j k φ k x , j = 1 , , p is B-spline basis function expansion of g n j x , and S 1 = j : g n j x 2 = 0   , θ n j 2 = 0 . Let q = S 1 be the cardinality of S 1 , which is the number of linear components in the AFT regression model. Define
  θ ^ n = a r g m i n θ n 1 2 n Q y Z θ n 2 : θ n j = 0 , j ϵ S 1
This represents the oracle estimator of θ 0 n under the assumption that the identity of the linear components is known. It’s worth noting that the oracle estimator cannot be computed as S 1 is unknown. Nevertheless, we employ it as the reference point for evaluating our proposed estimator. Similar to the actual estimates outlined in Section 2.2, let’s define the oracle estimators as g ˜ n j x = 0 ,   j ϵ S 1   and   g ˜ n j x = k = 1 m n θ ˜ j k φ k x ,   j S 1 . Denote X 1 = x j ,   j ϵ S 1 ,   X 2 = x j ,   j ϵ S 2   and   θ ˜ n 2 = θ ˜ n j ,   j ϵ S 2 . Let f ˜ j x = β ˜ 0 j + β ˜ j x + g ˜ j x , j ϵ S 2 .The oracle estimator of the coefficients of the linear components is β ˜ n 1 = ( X 1 X 1 ) 1 X 1 y j ϵ S 2 f ˜ j x . Without loss of generality, suppose that S 1 = 1 , , q . Write θ ˜ n = O q m n , θ ˜ 2 , where O q m n is a q m n -dimensional vector of zeros and   θ ˜ n 2 = Z 2 Q Z 2 1 Z 2 Q y . θ * = m i n j ϵ S 1 θ 0 n j represents the minimal coefficient norm in the B-spline expansions of the nonlinear components. Consider a non-negative integer k and take 0 < α 1 , such that d = k + α > 0.5 . Now, let’s define G as the set of functions g on 0 , 1 , where the kth derivative g k exists and adheres to a Lipschitz condition of order α : g k s g k t C s t α for s , t ϵ a , b .
Theorem 1.
Suppose that m n = O n 1 2 d + 1 , 1 m n γ is less than the smallest eigenvalue of Z Q Z / n , and 1 m n 2 d 1 2 θ * λ γ + 1 λ n 0 .
Then under (C1–C3), P θ ^ n θ ˜ n 0 , Consequently, P S 1 ^ = S 1 1 , P β ^ n 1 = β ˜ n 1 1 and P f ^ n j x f ˜ n j x 2 = 0 , j ϵ S 2 1 . Hence, given the conditions specified in Theorem 1, the proposed estimator can effectively differentiate between linear and nonlinear components with a high probability of accuracy. Additionally, the proposed estimator exhibits the oracle property, implying that it aligns with the oracle estimator’s performance, assuming the knowledge of linear and nonlinear component identities, except for events with vanishingly low probabilities.
Theorem 2.
Suppose (C1)–(C3) hold, we have
j = 1 p f ^ n j x f 0 j x 2 2 O p m n n + O 1 m n 2 d + O ( m n λ 2 )
Theorem 2 provides the convergence rate of the proposed estimator within the non-parametric additive model, encompassing partially linear models as specific instances. Specifically, if we assume the second-order differentiability (d = 2) of each component and let m n = O n 1 5 and λ = n 1 2 + δ tend toward small δ > 0 , then j = 1 p f ^ n j x f 0 j x 2 2 = O p n 4 / 5 , representing the optimal convergence rate in non-parametric regression. We will now explore the asymptotic distribution of β ^ n 1 . Denote H j = h j = h k : k ϵ S 1 : E h j k 2 u < ,   j ϵ S 2 . Each element of H j is a S 1 -vector of square-integrable functions with mean zero. Let the sumspace H = h = j ϵ S 2 h j : h j ϵ H j . The projection of the centered covariate vector x 1 E x 1 ϵ R q onto the sumspace H is defined to be the h 1 * , , h r * with E h j * x j = 0 , j S ^ 2 that minimizes W h = E x 1 E x 1 j S 2 h j x j 2 . For x 2 = x j : j ϵ S 2 , denote   h * x 2 = j S 2 h * j x j . Therefore, the orthogonal projection   h * onto H is well-defined and unique. Additionally, each individual component h * j is also well-defined and unique.
Theorem 3.
Assuming the conditions stated in Theorem 1 and the fulfillment of (C4), and given that A is non-singular. Then, n β ^ n 1 β 1 d N 0 , Σ , where β 1 = β j :   j ϵ S 1 and Σ = σ 2 A 1 .
Theorem 3 provides sufficient conditions under which the proposed estimator β ˜ n 1 of the linear components in the model is asymptotically normal with the same limit normal distribution as the oracle estimator β ˜ n 1 . Suppose that the first q addable parts are important functions, and the remaining p q are non-important functions. Let A 0 = q + 1 , , p be the set of non-important functions. Let X = X 1 , , X p , Σ = X X / n , for any A 1 , , p , X A = X j , j A , Σ A = X A X A / n , A represents the cardinality of set A and d A = A d n .

5. Numerical Simulation and Empirical Analysis

5.1. Numerical Simulation

Simulation is employed to assess the performance of the group MCP method in finite samples. Two examples are included in the simulation. For each of these simulated models, we consider two sample sizes (n = 100, 200) and conduct a total of 100 replications. We examine the following four functions defined on 0 , 1 ,   f 1 X 1 = sin 2 X 1 , f 2 X 2 = c o s X 2 ,   f 3 X 3 = 5 X 3 ,   and   f 4 X 4 = e X 4 2.5 . In the implementation, we utilize B-splines with seven basis functions to approximate each function.
Based on n = 200, the black solid line is the actual function, and the red dashed line is the Group MCP estimation function curve. It can be seen from Figure 1 that when the Group MCP method is used for B-spline expansion, the estimated function fits the real function well. In addition, we do not consider the intercept term of the model, X j ,     j = 1 , 2 , 3 , 4 , 5 , 6   and   ε independent and identically distributed in N 0 , 0.1 , some functions as follows: f 1 X 1 = 3 X 1 ,     f 2 X 2 = 2 sin 2 X 2 , f 3 X 3 = X 3 2 0.75 ,   f 4 X 4 = e X 4 25 / 12 . Let q = 6. Consider the model y = 3 f 1 x 1 + 4 f 1 x 2 2 f 1 x 3 + 8 f 2 x 4 + 6 f 3 x 5 + 5 f 4 x 6 + ε . In this model, the first three variables demonstrate a linear effect, while the last three variables exhibit a nonlinear effect. When n = 200, the black solid line is the actual function, and the red dashed line is the Group MCP estimation function curve. Figure 2 demonstrates that for the non-parametric additive accelerated failure time model, the non-parametric component estimates fit the true function well after B-spline estimation. In Figure 1 and Figure 2, the red dashed line represents the estimated function, while the black solid line represents the real function.
Table 1 displays simulation results based on 1000 replications. The columns provide the following information: the average number of selected nonlinear components (NL), the average model error (ER), the percentage of occasions on which the correct nonlinear components are included in the selected model (IN%), and the percentage of occasions on which the exact nonlinear components are chosen (CS%) in the final model. To compare the computational efficiency of group Lasso and group MCP, using time units in minutes (Time). The standard errors corresponding to these values are enclosed in parentheses. The Group MCP penalty outperforms the Group Lasso in terms of both the percentage of occasions on which the correct nonlinear components are included in the selected model (IN%) and the percentage of occasions on which the exact nonlinear components are chosen (CS%) in the final model. As the sample size increases from 100 to 500, both methods exhibit improved performance in terms of including all the nonlinear components (IN%) and selecting the exact correct model (CS%). The computational efficiency of group MCP surpasses that of group Lasso. This improvement is expected as larger sample sizes provide more information about the underlying model. Table 2 shows the number of times each component is estimated as a nonlinear function. Table 2 shows that the Group MCP method is more accurate in distinguishing between linear and nonlinear functions compared to the Group Lasso. Additionally, the Group MCP penalty method results in smaller mean squared errors, indicating more accurate estimation. The research demonstrates that the proposed approach using the Group MCP penalty is effective in distinguishing between linear and nonlinear components in simulated models, thereby enhancing model selection and estimation accuracy.

5.2. Lung Cancer Data Example

This study is based on survival analysis using the survival time data of 442 lung cancer patients and the gene expression data of 22,283 genes extracted from tumor samples. These data are available from the official website of the National Cancer Institute (http://cancergenome.nih.gov/) (accessed on 12 November 2023). In the original data, a two-column matrix denoted as T represents the survival data. The first column contains survival time in months, while the second column serves as an indicator function where 1 represents the state of death, and 0 represents the state of survival. The measured gene expression data are represented as X, with 22,283 gene expressions. The objective of this study is to identify covariates with nonlinear effects on survival time.
Due to the high dimensionality of the original data (p = 22,283, n = 442), it is necessary to transform the data from high-dimensional to low-dimensional. Assuming that the correlation coefficient between the independent variable and the dependent variable is equal to 0, the alternative hypothesis posits that the correlation coefficient between the independent variable and the dependent variable is not equal to 0. R programming language program is used to calculate the p-value for the correlation coefficient between each gene expression and survival time. When the p-value is less than the critical value, the null hypothesis is rejected in favor of the alternative hypothesis, indicating a significant correlation between the independent variable and the dependent variable. A smaller p-value provides stronger evidence of the association between gene expression and survival time. In this study, the p-values of the independent variables are computed and sorted in ascending order, and the top 50 independent variables with the smallest p-values are selected as input variables. The remaining gene expressions are discarded, achieving initial dimensionality reduction. As a result, the original data are transformed into lower dimensional data (p = 50, n = 442), and then covariates with nonlinear effects on survival time are identified.
Figure 3 displays the frequency distribution histograms of four randomly selected gene expressions, indicating that the distributions of these four gene expressions are all skewed. Based on the skewed data, this study considered using a non-parametric additive AFT model, with B-spline basis functions used to expand each covariate in the non-parametric part. The Group MCP method was employed to select and compress the coefficients of the B-spline basis functions, ultimately identifying gene expressions with nonlinear effects on survival time. Furthermore, Table 3 compares the results selected by the Group Lasso and Group MCP penalization methods. Under the Group Lasso penalization, all gene symbols were selected, indicating a tendency to over-select nonlinear variables. In contrast, Group MCP outperformed Group Lasso in selecting nonlinear variables. Genes 219720_s_at, 214991_s_at, and 210802_s_at were simultaneously selected, indicating that these three gene expressions are nonlinear variables. The three selected genes are associated with lung cancer research and can potentially be used to identify cancer biomarkers, understand tumor biology and develop treatment strategies. In order to comprehensively assess the significance of these specific genes in cancer research, further experimentation and literature studies are required. This decision may necessitate the support of specialized knowledge in the field of cancer biology and experimental data. This also represents a future direction for research.
The analysis compares the selection results of Group Lasso and Group MCP. Table 3 provides that all gene symbols are selected by the Group Lasso penalty, that is, √ indicates that the gene has been selected. This suggests that Group Lasso tends to over-select nonlinear variables, potentially including some variables that do not have true nonlinear effects. However, Group MCP performs better than Group Lasso in selecting nonlinear variables. It offers a more effective approach to identifying genes with nonlinear relationships with survival time. Lastly, genes with the symbols 219720_s_at, 214991_s_at, and 210802_s_at are simultaneously selected by all penalty methods. This consistent selection across different penalty methods confirms with certainty that these three gene expressions have nonlinear effects on survival time. These results underscore the superior performance of the Group MCP penalty method in accurately identifying genes with nonlinear relationships in high dimensional data, particularly in the context of survival time analysis. The selection of the same genes by multiple penalty methods strengthens the confidence in their nonlinear effects on survival time.

6. Concluding Remarks

This paper introduces a semi-parametric regression pursuit method for distinguishing between linear and nonlinear components in semi-parametric partially linear models. This approach enables the adaptive determination of parametric and non-parametric components in the semi-parametric model based on the available data. However, this method deviates from the standard semi-parametric inference approach, where parametric and non-parametric components are pre-specified before analysis. The study demonstrated that the proposed method possesses oracle properties. In other words, it performs as well as the standard semiparametric estimator, assuming that the model structure is known with high probability. The authors also conducted a simulation study that confirmed the effectiveness of the proposed method, particularly in finite sample sizes. It is worth noting that the semi-parametric regression pursuit method is primarily applied to partially linear models where the number of covariates (p) is less than the number of observations (n). However, genomic datasets may have a higher dimension (p > n). In cases where p > n and the model is sparse, this implies that the number of significant covariates is much smaller than n; it may be necessary to perform dimensionality reduction first to reduce the model dimension. Once the dimension is reduced, the proposed semiparametric regression pursuit method can be applied effectively to distinguish linear from nonlinear components. This research provides a valuable tool for model selection and feature identification in semiparametric modeling, and it highlights the potential need for dimensionality reduction in high-dimensional datasets.
This paper exclusively investigated the application of the group MCP penalty method to high dimensional non-parametric additive accelerated failure time models. Further research can be conducted to study the performance and theoretical properties of the group MCP penalty method in high-dimensional semiparametric accelerated failure time models. Additionally, its characteristics can be elucidated based on single-index models.

Author Contributions

Methodology, S.H.; Software, S.H.; Validation, S.H.; Formal analysis, S.H.; Investigation, S.H.; Resources, S.H.; Data curation, S.H.; Writing—original draft, S.H.; Writing—review & editing, H.L.; Visualization, H.L.; Supervision, H.L.; Project administration, H.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Publicly available datasets were analyzed in this study. This data can be found here: http://cancergenome.nih.gov/ (accessed on 12 November 2023).

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B Stat. Methodol. 1996, 58, 267–288. [Google Scholar] [CrossRef]
  2. Breiman, L. Heuristics of instability and stabilization in model selection. Ann. Stat. 1996, 24, 2350–2383. [Google Scholar] [CrossRef]
  3. Zhao, P.; Yu, B. On model selection consistency of Lasso. J. Mach. Learn. Res. 2006, 7, 2541–2563. [Google Scholar]
  4. Zou, H. The adaptive lasso and its oracle properties. J. Am. Stat. Assoc. 2006, 101, 1418–1429. [Google Scholar] [CrossRef]
  5. Fan, J.; Li, R. Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 2001, 96, 1348–1360. [Google Scholar] [CrossRef]
  6. Zhang, C.H. Nearly unbiased variable selection under minimax concave penalty. Ann. Stat. 2010, 38, 894–942. [Google Scholar] [CrossRef]
  7. Heller, G. Smoothed rank regression with censored data. J. Am. Stat. Assoc. 2007, 102, 552–559. [Google Scholar] [CrossRef]
  8. Gu, C. Model diagnostics for smoothing spline ANOVA models. Can. J. Stat. 2004, 32, 347–358. [Google Scholar] [CrossRef]
  9. Schumaker, L. Spline Functions: Basic Theory; Wiley: New York, NY, USA, 1981. [Google Scholar]
  10. Johnson, B.A. Rank-based estimation in the -regularized partly linear model for censored outcomes with application to integrated analyses of clinical predictors and gene expression data. Biostatistics 2009, 10, 659–666. [Google Scholar] [CrossRef]
  11. Huang, J.; Ma, S. Variable selection in the accelerated failure time model via the bridge method. Lifetime Data Anal. 2010, 16, 176–195. [Google Scholar] [CrossRef]
  12. Long, Q.; Chung, M.; Moreno, C.S.; Johnson, B.A. Risk prediction for prostate cancer recurrence through regularized estimation with simultaneous adjustment for nonlinear clinical effects. Ann. Appl. Stat. 2011, 5, 2003–2023. [Google Scholar] [CrossRef] [PubMed]
  13. Wei, H.; Kang, P.; Liu, Y. Application Extension of Subset Identification Method Based on Adaptive Elastic Net and Accelerated Failure Time Model. South. Med. Univ. J. 2021, 41, 391–398. [Google Scholar]
  14. Wu, D.; Gao, Q. Empirical Likelihood Inference for Accelerated Failure Time Models with Right-Censored Data; Zhejiang University of Finance and Economics: Hangzhou, China, 2019. [Google Scholar]
  15. Cai, H.; Kang, F.; Wu, F. Comparison of Clustering Survival Data in Parametric and Semi-Parametric Accelerated Failure Time Models. J. Beijing Univ. Inf. Sci. Technol. Nat. Sci. Ed. 2020, 35, 8–14. [Google Scholar]
  16. Liu, L.; Wang, H.; Liu, Y.; Huang, J. Model pursuit and variable selection in the additive accelerated failure time model. Stat. Pap. 2021, 62, 2627–2659. [Google Scholar] [CrossRef]
  17. Cui, X.; Peng, H.; Wen, S. Component selection in the additive regression model. Scand. J. Stat. 2013, 40, 491–510. [Google Scholar] [CrossRef]
  18. Leng, C.; Ma, S. Accelerated failure time models with nonlinear covariates effects. Aust. N. Z. J. Stat. 2007, 49, 155–172. [Google Scholar] [CrossRef]
  19. Golub, G.H.; Heath, M.; Wahba, G. Generalized cross-validation as a method for choosing a good ridge parameter. Technometrics 1979, 21, 215–223. [Google Scholar] [CrossRef]
  20. Stute, W. Almost sure representations of the product-limit estimator for truncated data. Ann. Stat. 1993, 21, 146–156. [Google Scholar] [CrossRef]
  21. Huang, J.; Ma, S. Regularized estimation in the accelerated failure time model with high-dimensional covariates. Biometrics 2006, 62, 813–820. [Google Scholar] [CrossRef]
  22. Fu, W.J. Penalized regressions: The bridge versus the lasso. J. Comput. Graph. Stat. 1998, 7, 397–416. [Google Scholar]
  23. Wu, T.; Lange, K. Coordinate descent algorithms for Lasso penalized regression. Ann. Appl. Stat. 2008, 2, 224–244. [Google Scholar] [CrossRef]
Figure 1. Simulation of one linear and three nonlinear B-spline estimates.
Figure 1. Simulation of one linear and three nonlinear B-spline estimates.
Mathematics 11 04628 g001
Figure 2. Simulation of three linear and three nonlinear B-spline estimates.
Figure 2. Simulation of three linear and three nonlinear B-spline estimates.
Mathematics 11 04628 g002
Figure 3. The frequency distribution histogram of four arbitrarily selected genes.
Figure 3. The frequency distribution histogram of four arbitrarily selected genes.
Mathematics 11 04628 g003
Table 1. The performance of group LASSO and group MCP.
Table 1. The performance of group LASSO and group MCP.
MethodNLERIN%CS%Time (min)
p = 6 n = 100
group LASSO1.250.311001002.51
(1.14)(0.16)(0.00)(0.00)
group MCP2.350.241001001.23
(1.01)(0.15)(0.00)(0.00)
p = 6 n = 200
group LASSO0.260.151001004.49
(0.51)(0.05)(0.00)(0.00)
group MCP0.690.131001003.47
(0.67)(0.04)(0.00)(0.00)
p = 6 n = 500
group LASSO0.190.111001007.36
(0.24)(0.01)(0.00)(0.00)
group MCP0.360.081001006.18
(0.27)(0.01)(0.00)(0.00)
Note: the corresponding standard errors are in parentheses.
Table 2. Mean square error of important functions.
Table 2. Mean square error of important functions.
Method f 1 ( . ) f 2 ( . ) f 3 ( . ) f 4 ( . ) f 5 ( . ) f 6 ( . )
n = 100
Group Lasso24.4452.2920.0079.2118.4467.58
Group MCP22.8845.1717.7969.8820.80111.05
n = 200
Group Lasso28.3043.7711.4868.0410.9023.08
Group MCP24.5538.939.6862.8515.7325.45
n = 500
Group Lasso30.4023.788.1732.344.926.13
Group MCP27.6216.825.4626.152.378.54
Table 3. The genes selected by Group Lasso and Group MCP.
Table 3. The genes selected by Group Lasso and Group MCP.
Gene SymbolgLASSOgMCP
208033_s_at
212242_at
211671_s_at
216364_s_at
205944_s_at
214143_x_at
217155_at
202734_at
219720_s_at
214991_s_at
214944_at
215544_s_at
217106_x_at
216180_s_at
208917_x_at
210802_s_at
221781_s_at
55583_at
204446_s_at
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Hou, S.; Lv, H. A Group MCP Approach for Structure Identification in Non-Parametric Accelerated Failure Time Additive Regression Model. Mathematics 2023, 11, 4628. https://doi.org/10.3390/math11224628

AMA Style

Hou S, Lv H. A Group MCP Approach for Structure Identification in Non-Parametric Accelerated Failure Time Additive Regression Model. Mathematics. 2023; 11(22):4628. https://doi.org/10.3390/math11224628

Chicago/Turabian Style

Hou, Sumin, and Hao Lv. 2023. "A Group MCP Approach for Structure Identification in Non-Parametric Accelerated Failure Time Additive Regression Model" Mathematics 11, no. 22: 4628. https://doi.org/10.3390/math11224628

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop