A Group MCP Approach for Structure Identiﬁcation in Non-Parametric Accelerated Failure Time Additive Regression Model

: In biomedical research, identifying genes associated with diseases is of paramount importance. However, only a small fraction of genes are related to speciﬁc diseases among the multitude of genes. Therefore, gene selection and estimation are necessary, and the accelerated failure time model is often used to address such issues. Hence, this article presents a method for structural identiﬁcation and parameter estimation based on a non-parametric additive accelerated failure time model for censored data. Regularized estimation and variable selection are achieved using the Group MCP penalty method. The non-parametric component of the model is approximated using B-spline basis functions, and a group coordinate descent algorithm is employed for model solving. This approach effectively identiﬁes both linear and nonlinear factors in the model. The Group MCP penalty estimation exhibits consistency and oracle properties under regularization conditions, meaning that the selected variable set tends to have a probability of approaching 1 and asymptotically includes the actual predictive factors. Numerical simulations and a lung cancer data analysis demonstrate that the Group MCP method outperforms the Group Lasso method in terms of predictive performance, with the proposed algorithm showing faster convergence rates.


Introduction
In both economic and biological research, it is a common scenario that many theories do not prescribe specific functional forms for the relationships between predictors and outcomes.For example, in biomedical studies, the influence of predictors on survival time can exhibit nonlinearity.Attempting to fit a linear model in such cases can result in biased estimates or produce misleading results.However, the functional shape of a non-parametric model is determined by the available data, eliminating the need for a linear functional form to describe the influence of a covariate.Additionally, non-parametric models offer greater flexibility in fitting data compared to parametric models.This paper delves into the in-depth study of the non-parametric accelerated failure time additive regression (NP-AFT-AR) model: where (t i , x i1 , . . ., x ip , 1 ≤ i ≤ n) is random sample, t i is the logarithm of the response variable, that is, t i is the logarithm of survival time.x i1 , . . ., x ip is a p × 1 vector of covariates, S 1 , S 2 are mutually independent and complementary subsets of {1, . . . ,p}, β j : j S 1 are regression coefficients of the covariates with indices in S 1 , and f j : j S 2 are unknown functions.The covariates in S 1 have a linear relationship with the mean response, whereas the connection with the other covariates is not determined by a finite number of parameters.
Parameter models require explicit assumption constraints, and they tend to overfit when there is an excessive number of model parameters.Some of these models are also based on the assumption of linearity, making them inadequate for capturing complex nonlinear relationships.However, parameter models offer the advantages of clear interpretability for explicit parameters, efficiency, and accurate parameter estimation.Hence, this paper aims to leverage these characteristics of parameter models and explores a hybrid approach that combines both parameter and non-parameter models to enhance the adaptability and performance of the model.When the emphasis is on the relationship between t i and x ij : j S 1 , which can be approximated by a linear function.It provides enhanced interpretability compared to a purely non-parametric additive model.The random error term ε i has a mean of zero and a finite variance σ 2 .Assuming that certain components f j are zero, our main objective in this research is to distinguish the nonzero components from the zero components and estimate the nonzero components accurately.A secondary objective is to elucidate the functional forms of the nonzero components, thereby suggesting a more concise model.The techniques we have established can readily be expanded to the partly linear additive AFT regression model, particularly when certain covariates may be discrete and not amenable to modeling using smoothing techniques like B-splines.We utilize the lung cancer data example to demonstrate this extension.
The structure identification method is effective in distinguishing linear variables from nonlinear ones, and numerous scholars have contributed to relevant research methods.Tibshirani [1] combined the least square estimation technique introduced by Bireman [2] with minimizing the residual sum of squares under constraints, transforming the solution into a continuous optimization process.This approach is known as the Lasso method, where penalties are applied to select variables, and estimated coefficients are continuously shrunk toward zero to automatically identify important explanatory variables.However, researchers such as Zhao and Yu [3] and Zou [4] discovered that the Lasso method may not consistently select the correct model, and the estimated regression coefficients do not exhibit asymptotic normality.To address this limitation, Fan and Li [5] proposed the SCAD penalty, which substitutes the penalty in Lasso with a quadratic spline penalty function to reduce deviations.In the context of linear models, the SCAD method can uniformly identify the true model and possess oracle properties.Nonetheless, the non-convex nature of the SCAD penalty makes it challenging to optimize in practical applications, leading to numerical instability during the solution process.Zhang [6] introduced the non-concave MCP (smoothly clipped absolute deviation) penalty and developed the MCP penalty likelihood procedure as an alternative approach.The MCP penalty method replaces the 1 penalty in Lasso with a quadratic spline penalty function to reduce bias.MCP exhibits the capability to consistently select the correct model with a probability of 1 and provides corresponding estimates with oracle properties.
Heller [7] employed the weighted kernel smooth rank regression method to estimate the unknown parameters in the AFT model, particularly in the case of censored data.Gu [8] introduced an empirical model selection approach for non-parametric components based on the Kullback-Leibler geometric structure.Schumaker [9] utilized the Lasso iterative method for selecting parametric covariates, while non-parametric components were estimated using the sieve method.Johnson [10] extended the rank-based Lasso-type estimation, which can encompass a portion of the linear AFT model.Huang and Ma [11] applied the AFT model to analyze the relationship between gene expression and survival time, using the bridge penalty method for individual-level regularized estimation and gene selection.Long et al. [12] established a risk prediction score through regularized rank estimation within a portion of the linear AFT model.Wei et al. [13] explored the application extension of subgroup identification methods based on Adaptive Elastic Net and the AFT model.Wang and Gao [14] conducted empirical likelihood inference for the AFT model under right-censored data.Cai et al. [15] compared parametric and semiparametric AFT models in clustered survival data.Liu et al. [16] introduced a new semiparametric approach that allows for the simultaneous selection of important variables, model structure identification, and covariate effect estimation within the AFT model.
Researchers used different methods for variable selection and parameter estimation.For instance, Fan and Li [5] employed the Newton algorithm to estimate the penalty likelihood function.Cui et al. [17] introduced the concept of penalty regression spline approximation and group structure identification within the additive model.However, their approach faced computational instability issues as they relied on truncated power series to approximate non-parametric truncation.Huang and Ma [11] proposed a two-step method where, with a fixed number of predictors, nonzero variables are simultaneously selected and estimated in the additive model, using Group Lasso in the first stage and Adaptive Group Lasso in the second stage.Leng and Ma [18] used the COSSO penalty to handle non-parametric covariate effects in the AFT model.However, due to the non-smooth nature of the penalty function at the origin, the computation can be challenging, and these methods require a significant amount of time to calculate the inverse matrix of the Hessian matrix, especially when dealing with high-dimensional covariates.Therefore, in this paper, the group coordinate descent (GCD) algorithm is employed to approximate and estimate the parameters in the non-parametric additive accelerated failure time model.GCD capitalizes on the assumption of model sparsity, and the algorithm is simple and operates at a fast pace.The GCD algorithm closely resembles the standard Newton-Raphson algorithm, but each iteration involves solving a weighted least squares problem with a penalty function.
Under the assumption that the dimensionality of covariates is allowed to diverge, this paper rigorously proves that the Group MCP penalty estimator in the non-parametric accelerated failure time model exhibits consistency and oracle properties.As the generalized cross-validation criterion is inconsistent in model selection when the sample size tends to infinity, meaning it may select irrelevant variables, the Bayesian Information Criterion (BIC) does not suffer from such issues.BIC, as shown by Golub et al. [19], has the desirable property of selecting the true model with a probability of 1.Therefore, in the context of structure identification in the non-parametric accelerated failure time model, this study opts for tuning based on the BIC criterion.
The remaining sections of the paper are organized as follows.In Section 2, we describe the construction of the AFT model with penalty estimation and variable selection based on Group MCP.Section 3 introduces the algorithm and parameter tuning for the effective identification of both linear and nonlinear factors in the model.In Section 4, we provide proof of the Group MCP's selection consistency property, where the selected variable set asymptotically tends to include the actual predictive factors with a probability approaching 1. Section 5 primarily focuses on numerical simulations and empirical analysis, demonstrating the method's strong predictive performance.We also apply the method to the analysis of lung cancer data.Section 6 provides a brief summary of the corresponding conclusions.

Penalized Estimation and Variable Selection 2.1. Method
T i is the natural logarithm of the i th censoring time, C i is the natural logarithm of the survival time and δ i represents the event indicator, i.e., δ i = I(T i ≤ C i ), which takes value 1 if the event time is observed or 0 if the event time is censored.Y i is the logarithm of the minimum of the survival time and the censoring time, i.e., Y i = log[min(T i , C i )].Then, the observed data are assumed to be independent and identically distributed (i.i.d.) samples from (Y, δ, X), in the form of (X are the respective censoring indicators and covariates.F represents the distribution of T, and Fn is its Kaplan-Meier estimator by [20].Fn (y) = ∑ n i=1 w ni I Y (i) ≤ y , where the w ni s are Kaplan-Meier weights calculated by w n1 = δ (1) /n, and After processing T i and considering whether T i ≤ C i is established, transform T i into Y i , when other conditions remain unchanged, the above conversion Formula (1) can be expressed as The following introduces the method of coefficient estimation in Equation ( 2), assuming that each group is orthonormal, i.e., X j X k = 0, j = k and X j X j /n = I d j .z = X j y/n is the least squares estimate of θ, where θ is the unknown parameter associated with marker effects by [21].Because of X j X j /n = I d j , the least squares objective function with penalty term can be expressed as 2 −1 z − θ 2  2 +ρ( θ 2 ; λ, γ).The linear part of Formula ( 2) is expressed as a function and then is brought into the additive non-parametric regression model to obtain: To ensure unique identification of the f j s, we assume that E f j x ij = 0, 1 ≤ j ≤ p.If some of the f j s are linear, then Equation (3) transforms into the partially linear additive model (2).The problem shifts to determining the linear or nonlinear forms.Therefore, we decompose f j into a linear part and a non-parametric part f j (x) = β 0j + β j x + g j (x).Con- sider a truncated series expansion for approximating g j , that is . ., p), then f j has the linear form.Therefore, based on this equation, the current task is to ascertain which groups of θ jk , j = 1, . . ., p, k = 1, . . ., m n are zero.Let β = β 1 , . . ., β p and θ n = θ 1n , . . ., θ pn , where θ jn = θ j1 , . . ., θ jm n .Define the penalized least squares criterion.
where ρ is the penalty function based on the penalty parameter λ ≥ 0 and regularization parameter γ. u represents the intercept.
is the norm with respect to the positive definite matrix A j .However, it is important to choose a suitable choice of A j to facilitate the computation.Let , then the weight w ni in Formula (4) can not be expressed, and Formula (4) can be transformed into We apply Group MCP penalty to the penalty term, i.e., ρ γ (t; γ controls the concavity of ρ and λ is the penalty parameter.Here, x + denotes the nonnegative part of x, that is, x + = xI {x≥0} .We require λ ≥ 0 and γ > 1. Taking the derivative with respect to ρ γ (t; λ) yields .ρ γ (t; λ) = λ(1 − t/(γλ)) + , t ≥ 0. It initiates with the application of group MCP penalization at the same rate as the group lasso, gradually easing this penalization until, when t > λγ, the rate of group MCP penalization diminishes to 0. This approach offers a spectrum of penalties, encompassing the 1 penalty at γ = ∞ and the hard-thresholding penalty when γ → 1+ .Notably, it encompasses the Lasso penalty as a specific case when γ = ∞.
The penalty in Equation ( 4) combines the penalty function ρ γ (•; λ) with a weighted 2 norm of θ j .ρ γ (•; λ) serves as a penalty for individual variable selection, and when applied to the norm of θ j , it selects the coefficients in θ j as a group.This approach is favorable since the nonlinear components are captured by the coefficients in θ j s as groups.We term the penalty function in Equation ( 4) as the group minimax concave penalty or simply the group MCP.The penalized least squares estimator is defined by ûn , βn , θn = argmin u,β,θ n L(u, β, θ n ; λ, γ), These centering constraints are sample analogs of the identifying restriction E f j x ij = 0, 1 ≤ i ≤ n, 1 ≤ j ≤ p. z ij = ϕ j1 x ij , . . ., ϕ jm n x ij , z ij consists of the centered basis functions at the ith observation of the jth covariate.Let Z = (Z 1 , . . . ,Z P ), where Z j = z 1j , . . ., z nj is the n × m n design matrix corresponding to the jth expansion.Let y = ( y 1 , . . . ,y n ) , x i = x i1 , . . ., x ip , and X = x 1 , . . ., x p .We can write βn , θn = argmin Here, we excluded u from the arguments of L, as the intercept is zero as a result of centering.Therefore, the constrained optimization problem transforms into an unconstrained one.

Penalized Profile Least Squares
The penalized profile least squares approach is used to calculate βn , θn .The β that minimizes L inherently satisfies X y − Xβ − Zθ n = 0 for any given θ n , resulting in X represents the projection matrix onto the column space of X.Consequently, the profile objective function of θ n becomes: We use A j = n −1 Z j Q Z j , this choice of A j standardizes the covariate matrices associated with θ nj s and leads to an explicit expression for computation in the group coordinate algorithm described below.For any given (λ, γ), the penalized profile least squares estimator of θ n is defined by θn = argmin θ n L(θ n ; λ, γ).We compute θn using the group coordinate descent algorithm.The set of covariates estimated to have a linear form in the regression model ( 1) is denoted as Ŝ1 ≡ j : θnj = 0 .Then, we obtain ĝnj ( x) = 0, j ∈ Ŝ1 and ĝnj Then, the estimator of the coefficients of the linear components is βn1 = βj : j ∈ Ŝ1 , ĝ( x) = ĝ1 ( x), . . . ,ĝp ( x) .Then, βn1 = X (1) X( 1) is the estimator of the coefficient vector of the linear components.The coefficients of the linear and nonlinear parts can be identified and estimated, and then the structure identification of the non-parametric can be added to the AFT model.

Computation Algorithm
Assuming that there is a standard between each group, i.e., X j X k = 0, j = k and X j X j /n = I d j .Let z = X j y/n is the least squares estimate of θ.We use S(z; t) = (1 − t/ z 2 ) + z to calculate the solution of group Lasso with the group coordinate de- scent algorithm, and the expression of group Lasso is θgLASSO (z; λ) = S(z, λ).When γ > 1, the group MCP of the quadratic norm can be expressed as θgMCP (z; . So we can use θgMCP (•; λ, γ) : 1 < γ ≤ ∞ to express hard threshold function of group MCP and when γ = 1 or γ = ∞ can to be soft threshold function.
Group coordinate descent algorithm is used to compute θn in this paper.GCD algorithm is a natural extension of the standard coordinate descent algorithm [22,23] commonly used in optimization problems involving convex penalties like the Lasso.GCD algorithm optimizes the target function one group at a time, cycling through all groups iteratively until convergence is achieved.It is particularly well-suited for computing θn because it offers a straightforward closed-form expression for a single-group model, as presented in (8) below.A j = R j R j for an m n × m n upper triangular matrix R j via the Cholesky decomposition.
In particular, when γ = ∞, we have b j,GL = 1 − √ m n λ/ η j + η j , which is the group Lasso estimate for a single-group model, GCD algorithm can be implemented as follows based on the above expressions.Let the group coefficient b k , k = j is given.We want to minimize L with respect to b j .Define L j b j ; λ, γ = (1/2n k and η j = n −1 Z j y − y j .Let b j denote the minimizer of L j b j ; √ m n λ, γ .When γ > 1, we have b j = M η j ; √ m n λ, γ .Equation (8)   is used to iterate through one component at a time for any given (λ, γ).The initial value is . The proposed GCD algorithm is as follows: Initialize the residual vector r = y − y.Let y = ∑ j=1,...,p Z j b j (0) .For s = 0, 1, . . ., carry out the following calculation until convergence.For j = 1, . . ., p, repeat the following steps: (1) Calculate The final step ensures that r holds the current values of the residuals.While the objective function may not necessarily be convex, it exhibits convexity concerning an individual group when the coefficients of all other groups are fixed.

Tuning Parameter Selection
Methods such as AIC, BIC, and Generalized Cross-Validation (GCV) are widely used for selection consistency.Let L(•) be the likelihood function, and • q represents the L q norm of the vector, P λ (•) is a penalty function related to the parameter λ > 0, The penalty method of structure recognition mainly considers the important variables by finding the extreme value of the objective function (1/n)L(β) − ∑ p j=1 P λ β j .Tibshirani [20] used the L 1 norm as the penalty function to obtain Lasso.AIC criterion is used to solve the over-fitting problem in which the value of the model likelihood function increases with the increase of the parameters, where AIC = −2log(L) + 2k.BIC criterion penalizes the number of parameters more strongly, where BIC = −2log(L) + kln(n).L is the maximum value of the likelihood function, k is the number of parameters in the model.When λ→0, β close to ordinary least squares estimation; when λ→∞, almost only penalty items remain in the selection criteria.Therefore, we use the faster BIC method to select the parameters of each concave penalty model.The expression of the BIC criterion is BIC(λ, d n ) = log(RSS λ,d n ) + log n (d f λ,d n )/n.RSS is the sum of squared residuals, d f is the number of variables selected for a given (λ,d n ).d n is selected from an increasing sequence with multiple nodes, and then selects λ from a sequence of length 100 for any given d n .The maximum value of the sequence is λ max = max 1≤j≤p Z j Y 2 / √ d n , where Z j is a n × d n dimensional matrix about the covariate X j , j = 1, . . ., p, the minimum is 0.01λ max .

Theoretical Properties of Group MCP
Let |A| denote the cardinality of any set A ⊆ {1, . . . ,p}, and X A = (X j , j ∈ A, ∑ A = X A WX A /n, where X is n × p covariance matrix, W = diag(nw 1 , . . ., nw n ).Let β 0 = β 01 , . . . ,β 0p be the true regression coefficient and A 1 = j : β 0j = 0 be the set of nonzero regression coefficients, q = |A 1 | Represents the number of elements in the set A 1 , and satisfies the following conditions: (C1) Eg x j = 0 and there are constants C 1 and C 2 such that the density distribution function η j (x) of x j satisfies 0 . ., n is independent and identically distributed (i.i.d), and the error term ε 1 , . . ., ε i is i.i.d in N 0, σ 2 and exists K 1 , K 2 > 0, which is constant for all x i ≥ 0, The error term (ε 1 , . . ., ε n ) is independent of the Kaplan-Meier weight (w 1 , . . ., w n ), and there is a constant satisfying that for any 1 ≤ i ≤ n, 1 ≤ j ≤ p, there is X ij ≤ M, that is, the covariate is bounded.
(C4) The covariate matrix satisfies the SRC condition, exists 0 < c * < c * < ∞, q * = (3 + 4C)q, C = C = c * /c * , converges to 1 with probability and satisfies Condition (C1) can ensure that the model is sparse even when the number of covariates is large; that is, the number of covariates with nonzero coefficients can be controlled to a small number; condition (C2) can ensure the tail probability of the model under highdimensional linear regression The assumption is still valid; according to condition (C3), the sub-Gaussian nature of the model is still guaranteed even if the data is censored; (C4) Ensure that the model meets the SRC condition, that is, ensure that the characteristic root of matrix X WX/n is always between c * and c * , and any model with a dimension smaller than q * can be identified.Where β = β 1 , . . ., β p represents the estimated coefficient, and basis function expansion of g nj (x), and S 1 = j : g nj (x) 2 = 0 , θ nj 2 = 0. Let q = |S 1 | be the cardinality of S 1 , which is the number of linear components in the AFT regression model.Define θn = argmin This represents the oracle estimator of θ 0n under the assumption that the identity of the linear components is known.It's worth noting that the oracle estimator cannot be computed as S 1 is unknown.Nevertheless, we employ it as the reference point for evaluating our proposed estimator.Similar to the actual estimates outlined in Section 2.2, let's define the oracle estimators as g nj (x) = 0, j S 1 and The oracle estimator of the coefficients of the linear components is β n1 = (X (1) X (1) ) −1 X (1) y − ∑ j S 2 f j (x) .Without loss of generality, suppose that , where O qm n is a (qm n )-dimensional vector of zeros and Qy. θ * = min j S 1 θ 0nj represents the minimal coefficient norm in the B-spline expansions of the nonlinear components.Consider a nonnegative integer k and take 0 < α ≤ 1, such that d = k + α > 0.5.Now, let's define G as the set of functions g on [0, 1], where the kth derivative g (k) exists and adheres to a Lipschitz condition of order α: √ m n γ is less than the smallest eigenvalue of Z QZ/n, and Then under (C1-C3), P θn = θ n → 0 , Consequently, P Ŝ1 = S 1 → 1, P βn1 = β n1 → 1 and P fnj (x) − f nj (x) 2 = 0, j S 2 → 1 .Hence, given the conditions specified in Theorem 1, the proposed estimator can effectively differentiate between linear and nonlinear components with a high probability of accuracy.Additionally, the proposed estimator exhibits the oracle property, implying that it aligns with the oracle estimator's performance, assuming the knowledge of linear and nonlinear component identities, except for events with vanishingly low probabilities.
Theorem 2 provides the convergence rate of the proposed estimator within the nonparametric additive model, encompassing partially linear models as specific instances.Specifically, if we assume the second-order differentiability (d = 2) of each component and let , representing the optimal convergence rate in non-parametric regression.We will now explore the asymptotic distribution of βn1 .Denote H j = h j = (h k : k S 1 ) : Eh 2 jk (u) < ∞, j S 2 .Each element of H j is a |S 1 |-vector of square-integrable functions with mean zero.
Let the sumspace H = h = ∑ j S 2 h j : h j H j .The projection of the centered covariate vector x (1) − E x (1) R q onto the sumspace H is defined to be the h * 1 , . . ., h * r with For Therefore, the orthogonal projection h * onto H is well-defined and unique.Additionally, each individual component h * j is also well-defined and unique.Theorem 3. Assuming the conditions stated in Theorem 1 and the fulfillment of (C4), and given that A is non-singular.Then, Theorem 3 provides sufficient conditions under which the proposed estimator β n1 of the linear components in the model is asymptotically normal with the same limit normal distribution as the oracle estimator β n1 .Suppose that the first q addable parts are important functions, and the remaining p − q are non-important functions.Let    Table 1 displays simulation results based on 1000 replications.The columns provid the following information: the average number of selected nonlinear components (NL) the average model error (ER), the percentage of occasions on which the correct nonlinea Table 1 displays simulation results based on 1000 replications.The columns provide the following information: the average number of selected nonlinear components (NL), the average model error (ER), the percentage of occasions on which the correct nonlinear components are included in the selected model (IN%), and the percentage of occasions on which the exact nonlinear components are chosen (CS%) in the final model.To compare the computational efficiency of group Lasso and group MCP, using time units in minutes (Time).The standard errors corresponding to these values are enclosed in parentheses.The Group MCP penalty outperforms the Group Lasso in terms of both the percentage of occasions on which the correct nonlinear components are included in the selected model (IN%) and the percentage of occasions on which the exact nonlinear components are chosen (CS%) in the final model.As the sample size increases from 100 to 500, both methods exhibit improved performance in terms of including all the nonlinear components (IN%) and selecting the exact correct model (CS%).The computational efficiency of group MCP surpasses that of group Lasso.This improvement is expected as larger sample sizes provide more information about the underlying model.Table 2 shows the number of times each component is estimated as a nonlinear function.Table 2 shows that the Group MCP method is more accurate in distinguishing between linear and nonlinear functions compared to the Group Lasso.Additionally, the Group MCP penalty method results in smaller mean squared errors, indicating more accurate estimation.The research demonstrates that the proposed approach using the Group MCP penalty is effective in distinguishing between linear and nonlinear components in simulated models, thereby enhancing model selection and estimation accuracy.

Lung Cancer Data Example
This study is based on survival analysis using the survival time data of 442 lung cancer patients and the gene expression data of 22,283 genes extracted from tumor samples.These data are available from the official website of the National Cancer Institute (http: //cancergenome.nih.gov/)(accessed on 12 November 2023).In the original data, a twocolumn matrix denoted as T represents the survival data.The first column contains survival time in months, while the second column serves as an indicator function where 1 represents the state of death, and 0 represents the state of survival.The measured gene expression data are represented as X, with 22,283 gene expressions.The objective of this study is to identify covariates with nonlinear effects on survival time.
Due to the high dimensionality of the original data (p = 22,283, n = 442), it is necessary to transform the data from high-dimensional to low-dimensional.Assuming that the correlation coefficient between the independent variable and the dependent variable is equal to 0, the alternative hypothesis posits that the correlation coefficient between the independent variable and the dependent variable is not equal to 0. R programming language program is used to calculate the p-value for the correlation coefficient between each gene expression and survival time.When the p-value is less than the critical value, the null hypothesis is rejected in favor of the alternative hypothesis, indicating a significant correlation between the independent variable and the dependent variable.A smaller pvalue provides stronger evidence of the association between gene expression and survival time.In this study, the p-values of the independent variables are computed and sorted in ascending order, and the top 50 independent variables with the smallest p-values are selected as input variables.The remaining gene expressions are discarded, achieving initial dimensionality reduction.As a result, the original data are transformed into lower dimensional data (p = 50, n = 442), and then covariates with nonlinear effects on survival time are identified.
Figure 3 displays the frequency distribution histograms of four randomly selected gene expressions, indicating that the distributions of these four gene expressions are all skewed.Based on the skewed data, this study considered using a non-parametric additive AFT model, with B-spline basis functions used to expand each covariate in the non-parametric part.The Group MCP method was employed to select and compress the coefficients of the B-spline basis functions, ultimately identifying gene expressions with nonlinear effects on survival time.Furthermore, Table 3 compares the results selected by the Group Lasso and Group MCP penalization methods.Under the Group Lasso penalization, all gene symbols were selected, indicating a tendency to over-select nonlinear variables.In contrast, Group MCP outperformed Group Lasso in selecting nonlinear variables.Genes 219720_s_at, 214991_s_at, and 210802_s_at were simultaneously selected, indicating that these three gene expressions are nonlinear variables.The three selected genes are associated with lung cancer research and can potentially be used to identify cancer biomarkers, understand tumor biology and develop treatment strategies.In order to comprehensively assess the significance of these specific genes in cancer research, further experimentation and literature studies are required.This decision may necessitate the support of specialized knowledge in the field of cancer biology and experimental data.This also represents a future direction for research.
The analysis compares the selection results of Group Lasso and Group MCP.Table 3 provides that all gene symbols are selected by the Group Lasso penalty, that is, √ indicates that the gene has been selected.This suggests that Group Lasso tends to over-select nonlinear variables, potentially including some variables that do not have true nonlinear effects.However, Group MCP performs better than Group Lasso in selecting nonlinear variables.It offers a more effective approach to identifying genes with nonlinear relationships with survival time.Lastly, genes with the symbols 219720_s_at, 214991_s_at, and 210802_s_at are simultaneously selected by all penalty methods.This consistent selection across different penalty methods confirms with certainty that these three gene expressions have nonlinear effects on survival time.These results underscore the superior performance of the Group MCP penalty method in accurately identifying genes with nonlinear relationships in high dimensional data, particularly in the context of survival time analysis.The selection of the same genes by multiple penalty methods strengthens the confidence in their nonlinear effects on survival time.

Concluding Remarks
This paper introduces a semi-parametric regression pursuit method for distinguishing between linear and nonlinear components in semi-parametric partially linear models.This approach enables the adaptive determination of parametric and non-parametric components in the semi-parametric model based on the available data.However, this method deviates from the standard semi-parametric inference approach, where parametric and non-parametric components are pre-specified before analysis.The study demonstrated that the proposed method possesses oracle properties.In other words, it performs as well as the standard semiparametric estimator, assuming that the model structure is known with high probability.The authors also conducted a simulation study that confirmed the effectiveness of the proposed method, particularly in finite sample sizes.It is worth noting that the semi-parametric regression pursuit method is primarily applied to partially linear models where the number of covariates (p) is less than the number of observations (n).However, genomic datasets may have a higher dimension (p > n).In cases where p > n and the model is sparse, this implies that the number of significant covariates is much smaller than n; it may be necessary to perform dimensionality reduction first to reduce the model dimension.Once the dimension is reduced, the proposed semiparametric regression pursuit method can be applied effectively to distinguish linear from nonlinear components.

Figure 1 .
Figure 1.Simulation of one linear and three nonlinear B-spline estimates.

Figure 1 .
Figure 1.Simulation of one linear and three nonlinear B-spline estimates.

Figure 1 .
Figure 1.Simulation of one linear and three nonlinear B-spline estimates.

Figure 2 .
Figure 2. Simulation of three linear and three nonlinear B-spline estimates.

Figure 2 .
Figure 2. Simulation of three linear and three nonlinear B-spline estimates.

Mathematics 2023 , 1 Figure 3 .
Figure 3.The frequency distribution histogram of four arbitrarily selected genes.

Figure 3 .
Figure 3.The frequency distribution histogram of four arbitrarily selected genes.

Table 1 .
The performance of group LASSO and group MCP.

Table 2 .
Mean square error of important functions.

Table 3 .
The genes selected by Group Lasso and Group MCP.

Table 3 .
The genes selected by Group Lasso and Group MCP.