A Penalization Method for Estimating Heterogeneous Covariate Effects in Cancer Genomic Data

In high-throughput profiling studies, extensive efforts have been devoted to searching for the biomarkers associated with the development and progression of complex diseases. The heterogeneity of covariate effects associated with the outcomes across subjects has been noted in the literature. In this paper, we consider a scenario where the effects of covariates change smoothly across subjects, which are ordered by a known auxiliary variable. To this end, we develop a penalization-based approach, which applies a penalization technique to simultaneously select important covariates and estimate their unique effects on the outcome variables of each subject. We demonstrate that, under the appropriate conditions, our method shows selection and estimation consistency. Additional simulations demonstrate its superiority compared to several competing methods. Furthermore, applying the proposed approach to two The Cancer Genome Atlas datasets leads to better prediction performance and higher selection stability.


Introduction
The tremendous development of high-throughput sequencing techniques allows for the generation of massive genomic data, e.g., gene expressions and Single-Nucleotide Polymorphisms (SNPs). These data provide an unprecedented opportunity of uncovering biomarkers associated with outcomes such as the development and progression of complex diseases, e.g., cancers and type II diabetes. Numerous studies on this topic have been hitherto carried out. However, most existing studies assume that a covariate has an identical effect on the outcome variable for all subjects, which is often unrealistic in practice. For example, Ford et al. [1] found that the risk of breast and ovarian cancers in BRCA2 mutation carriers increases with age. Another example is that the effects of some genes in the nicotinic 15q25 locus on lung cancer risk are mediated by nicotine dependence [2]. These findings suggest that the effects of a specific covariate can be heterogenous and discrepancies in covariate effects or covariate-outcome associations may arise due to the differences in clinical characteristics and other traits that differ across subjects. As such, ignoring such effects, heterogeneity in genomic data analysis can result in biased estimations and misleading inferences.
The most commonly used strategy for handling heterogeneity is subgroup analysis, under which subjects form subgroups and each subgroup has unique covariate-outcome associations. A number of approaches have been proposed, such as the finite mixture model [3][4][5], and penalization-based approaches, such as concave fusion penalization [6,7], and C-Lasso [8]. However, these approaches assume that the effects of covariates are the same within each subgroup. As suggested by the literature, the covariate (e.g., genetic) effects are typically associated with clinical measures (e.g., age and number of cigarettes smoked per day), which are often continuous variables. As such, in some applications, covariate effects are more likely to vary smoothly rather than being locally constant within each subgroup.
In this study, we focus on a scenario where the subjects can be ordered by an auxiliary variable (see Section 2 for details). We consider a linear regression model with heterogeneous covariate effects by allowing the regression coefficients to vary smoothly across subjects. We then propose a novel penalization approach to capture the smoothing changes of coefficients. Under this approach, a "spline-lasso" penalty is imposed on the secondorder derivatives of the coefficients to encourage smoothness in coefficients' changes. Additionally, we introduce a penalty of the group Lasso form to accommodate the high dimensionality of genomic data (i.e., the number of genes is larger than the sample size) and select the relevant covariates.
Our work is related to the varying coefficient models, a kind of classical semiparametric model. It treats the coefficients as functions of certain characteristics, and uses various nonparametric smoothing techniques, such as spline-based methods [9,10], and local polynomial smoothing [11], to approximate the unknown coefficient functions. For example, high-dimensional varying coefficient models proposed by Wei et al. [12], Xue and Qu [13], Song et al. [14], Chen et al. [15], finite mixture of varying coefficient model [16], and additive varying-coefficient model for non linear gene-environment interactions [17]. Compared to these varying-coefficient regression approaches, the proposed method has few requirements for the distribution of auxiliary variables and better estimates the regression coefficients when auxiliary variable is unevenly distributed (Figure 1). Moreover, the proposed approach is also related to but also significantly advances existing ones. First, it advances existing genomic marker identification studies by considering the heterogeneity of covariate effects. Second, it advances gene-environment interaction analysis methods [18,19] by allowing more flexibility in the relationship pattern (not limited to a given relationship) between covariate (genetic) effects and environmental factors (auxiliary variables). Finally, the proposed approach also advances the existing multiple changing-point regression studies [20,21] by tracking the gradually changes of coefficients rather than the abrupt ones ( Figure 1). Overall, this approach is practically useful for analyzing genomic data and may lead to important new findings.
To further illustrate differences of the proposed method from varying-coefficient models and multiple changing-point regression methods, consider a simple simulation example with N = 200, p = 10, and 3 significant variables. The coefficient for each variable varies among individuals and is a function of a certain environmental factor, e.g., age. Suppose the age is unevenly distributed among subjects, with subjects concentrated between the age of 25-35 and 45-55, which is indicated by denser rugs in the Figure 1. We compare proposed method with the varying-coefficient model [12] and the change point regression model [22]. The simulation results show that the compared method performs relatively poorly (root mean squared errors (RMSE) = 4.853, rooted prediction error (RPE) = 1.325 for varying-coefficient model; RMSE = 3.158, RPE = 1.242 for change point regression model), while proposed method identifies the true coefficient pathway consistently (RMSE = 0.954, RPE = 0.893).
The rest of this paper is organized as follows. In Section 2, we introduce the proposed approach, present the algorithm, and discuss some theoretical properties. Simulations are shown in Section 3. Section 4 presents the analysis of two The Cancer Genome Atlas (TCGA) datasets. Section 5 concludes the paper. The technical details of proofs and additional numerical results are provided in the Appendixes A-D.

Materials and Methods
Assume a dataset consists of N independent subjects. For subject n, let y n and X n = (X n 1 , X n 2 , . . . , X n p ) denote the response variable and the p-dimensional vector of genomic measurements, respectively. In our numerical study, we analyze gene expression data. It is noted that the proposed approach can also be applied to other types of omics measurements. Assume the data has been standardized and consider a heterogenous linear regression model given by: where ε n 's are independent and identically distributed (i.i.d.) random errors and β n = (β n 1 , β n 2 , . . . , β n p ) are the regression coefficients. Different from the standard regression model, which imposes an identical β on all subjects, model (1) allows β n to be subject-specific. Here, we consider a linear regression, which is standard to model the relationship between covariates and outcomes. The proposed approach is applicable to other models, for example, the AFT model. More details are provided in Appendix A. In this paper, we focus on a scenario where the heterogeneity analysis of covariate effects can be conducted with the aid of an auxiliary variable whose measurement is available for N subjects. Specifically, we assume that the subjects have been sorted according to the auxiliary variable's values. Further, the effect of a relevant covariate on the response variable is expected to vary smoothly across subjects. The studies reviewed in Section 1 and other similar ones suggest that the covariate (e.g., genetic) effects are usually associated with clinical traits. As such, we choose an auxiliary variable with known interactions with clinical variables. Please see the examples in the data analysis section for details (Section 4).

Remark 1.
In subgroup-level heterogeneity analysis, an auxiliary variable may not be needed. However, a subject-level heterogeneity analysis is intractable without the auxiliary variable due to non-identifiability. To date, the existing methods that can handle this type of heterogeneity, for example, varying-coefficients and interaction analysis, all require an auxiliary variable. Note that, in our analysis, the auxiliary variable does not need to be "precise." Consider, for example, a sample of size 5. Auxiliary variable A has the values 1, 3, 7, 2, and 9 for the five subjects and auxiliary variable B has the values −0.8, 0.4, 0.5, 0.0, and 3. Although auxiliary variables A and B do not match, the proposed method can lead to the same covariate effects when using both auxiliary variables as an ordering index.
Rationale. In (2), the first term is the lack-of-fit measure, expressed as the sum of N individual subjects. The first penalty is the group Lasso on β. Here the "group" refers to the regression coefficients of N subjects for a specific covariate. This penalty accommodates the high-dimensionality of the data and allows for the regularized estimation and selection of relevant covariates. The "all-in-all-out" property of the group Lasso leads to a homogeneous sparsity structure, that is, the N subjects have the same set of important covariates. To obtain an oracle estimator, we add weight ω j to the sparsity penalty, which is determined by an initial estimator. Assuming that initial estimatorβ j is available, let ω j = 1 The main advancement is the second penalty, which has a spline form. It penalizes the second-order derivatives (in discrete version) of coefficients β n j to promote the smoothness of coefficients between adjacent subjects. Note that the coefficients for any adjacent subjects are assigned a penalty of the same magnitude regardless of the distance between subjects measured by the auxiliary variable. Different from standard spline-lasso penalties [23], it is imposed on the regression coefficients of different subjects. Furthermore, different from some alternatives which promote first-order smoothness, such as the fusion Lasso [24] and smooth Lasso [25], this penalty encourages second-order smoothness. Additionally, the quadratic form of this penalty makes it computationally easier than the absolute-valueform penalty, such as Lasso. It is noted that the gene-environment interaction analysis also can capture the smooth change of covariate effects over an auxiliary variable (environmental factor). However, the interaction analysis approach requires specifying a parametric form of the relationship between covariate effects and auxiliary variable, which is not very flexible in practice, in particular, for high-dimensional data.

Computation
Optimization (2) can be realized using a block coordinate descent (CD) algorithm. For each covariate j, its measurement on the N subjects X j = (X 1 j , X 2 j , . . . , X N j ) forms a group and corresponding coefficients β j are simultaneously updated. The algorithm optimizes the objective function with respect to one group of coefficients and iteratively cycles through all groups until convergence is reached. Let Z [j] = diag(X j ) represent the sub-matrix of Z, corresponding to X j , which is a diagonal matrix. We denote β (k) j as the estimate of β j in the kth iteration. The proposed algorithm proceeds as follows:

2.
Update k = k + 1. For j ∈ {1, 2, . . . , p}, minimize M(β j ) with respect to β j , where: This can be realized by executing the following steps: (a) Set the step size t = 1. Compute Increase step size by t ← 0.8t until and update the estimate of β j by

Repeat
Step 2 until convergence is achieved. In our numerical study, the convergence criterion is min To speed up the algorithm, we add a momentum term to the last iteration of β (k−1) j in (3) and determine step size t via the backtracking line search method. After the algorithm converges, some groups of coefficients are estimated as zeros. To further improve estimation accuracy, in practice, we can remove the covariates with zero coefficients and re-estimate the nonzero coefficients by minimizing objective function (2) without the sparsity penalty. The proposed approach involves two tuning parameters selected using a grid search and the K-fold cross validation with K = 5. Realization. To facilitate data analysis within and beyond this study, we have developed a Python code implementing the proposed approach and made it publicly available at https://github.com/foliag/SSA (accessed on 21 March 2022). The proposed approach is computationally affordable. As shown in Figure A1, the computational time of the proposed approach is linear, with an increasing number of features.

Statistical Properties
Here, we establish the consistency properties of the proposed approach. We define a new dataset (Ỹ,Z) byỸ (n+(n−2)×p) = (Y, 0) andZ (n+(n−2)×p)×np = (Z, Then, objective function (2) can be converted to an adaptive group Lasso form: . . , (β 0 p ) ) be the true parameter values. We denote q as the number of non-zero coefficient vectors. Without loss of generality, assume β 0 corresponding to the index of nonzero and zero coefficient vectors, respectively. Let J = A A and Σ = 1 N Z Z + λ 2 J. We then use τ to represent the minimal eigenvalue of matrix Σ. The following conditions are assumed: (C0) Errors ε 1 , ε 2 , . . . , ε N are i.i.d sub-Gaussian random variables with mean zero. That is, for certain constants 0.5 ≤ t ≤ 1 and K, C ≥ 0, the tail probabilities of ε n satisfy P(|ε n | > x) ≤ Ke −Cx t for all x ≥ 0 and n = 1, 2, . . . , N.
Condition (C0) is the sub-Gaussian condition is commonly assumed in studies [26]. Condition (C1) assumes the measurement matrix is bounded. Similar conditions have been considered by AuthMartinussen and Scheike [27] and Binkiewicz and Vogelstein [28]. Condition (C2) puts a lower bound on the size of the smallest signal and assumes the initialβ j is not too small for j ∈ E 1 . Similar conditions have been considered by Wei and Huang [29]. Condition (C3) is similar to the assumption made in Case I of Guo et al. [23], which requires Σ to be invertible and the minimal eigenvalue τ to converge to 0 at a rate controlled by λ 2 . Condition (C4) makes a weak constraint on β 0 , which can be satisfied when for any nonzero coefficient vector β k (k ∈ E 1 ) the largest gap between two adjacent components is bounded.
. Then, with a probability converging to one, we have The proof is provided in Appendix B. If q is not too large and α 2 and τ are not too small, we may have (more details below). Then, we can find a λ 1 that and λ 1 ∼o τα 2 N q simultaneously. It is not difficult to prove that event Ω holds for the marginal regression estimator as the initial estimator. As a result, under conditions (C3) and (C4), the gap between β 0 andβ converges to 0. This theorem thus establishes estimation consistency.
The following additional conditions are assumed: (C5) Initial estimatorsβ j are r-consistent for the estimation of certain ξ j : where ξ j is an unknown constant vector satisfying max Condition (C5) is similar to condition (A2) in Huang et al. [26], which ensured that is not too small for j ∈ E 0 . Condition (C6) restricts the numbers of covariates with zero and nonzero coefficients, the penalty parameters, the minimal eigenvalue of Σ, and the smallest nonzero coefficient. Given all conditions in Theorems 1 and 2, we may assume The proof is provided in Appendix C. This theorem establishes the selection consistency properties of the proposed approach under a high-dimensional setting.

Simulation
We set p = 500. The data are generated from the following true model: where the random errors are simulated independently from N(0, 1). We investigate nine scenarios for the coefficients as follows: Scenario 1. The coefficients are generated from trigonometric functions; for n = 1, 2, . . . , N, where u n j = a j + N 10 · n, a j ∼U(0, 0.5).
Scenario 2. The coefficients are generated from exponential functions: where u n j = a j + N 100 · n, a j ∼U(0, 0.2).
Scenario 3. The coefficients are generated from logarithmic functions: where u n j = a j + N 20 · n, a j ∼U(0.7, 0.9).
Scenario 4. The coefficients are generated from linear functions: where u n j = a j + N 10 · n, a j ∼U(0, 1).
Scenario 5. The coefficients are constants: . . , q, where a j ∼U(0, 1). Scenario 6. The coefficients are generated from the four above (trigonometric, exponential, logarithmic and linear) functions, respectively. Each function generates an equal number of coefficients. Scenario 7. The coefficients are generated from the four above functions, where 40% and 35% of the coefficients are generated from the trigonometric and linear functions, respectively, and 10% and 15% of the coefficients are generated from the exponential and logarithmic functions, respectively. Scenario 8. The coefficients are generated from the four functions. The trigonometric, exponential, logarithmic, and linear functions generate 35%, 15%, 20%, and 30% of the coefficients, respectively. Scenario 9. The coefficients are generated as in Scenario 5. We select 40% of the coefficients and, for each function, add random perturbations on their values in one or two ranges, where each range includes 20 consecutive subjects.
In Scenarios 1-5, the q coefficients are generated from the same function, whereas from different functions in Scenarios 6-9. The coefficients in Scenario 5 are constants, that is, there is no heterogeneity in covariate effects. Some of coefficients in Scenario 9 do not change smoothly across subjects, but have a few discontinuous areas. Figure A2 presents q = 20 nonzero coefficients as a function of N = 200 subjects under nine scenarios. In the first eight scenarios, the p covariates are generated from a multivariate normal distribution with marginal mean 0 and variance 1. We consider an auto-regressive correlation structure, where covariates j and k have the correlation coefficient ρ |j−k| with ρ = 0.3 and 0.8, corresponding to the weak and strong correlations, respectively. In Scenario 9, the p covariates are generated independently from a uniform distribution on (−1, 1). It is noted that the aforementioned nonlinear functions of regression coefficients are widely used in simulation studies of varying-coefficient models for genomic data [30,31].
We consider two versions of the proposed approach. One uses the "standard" Lasso to obtain the initial estimator of coefficients (New-Lasso) and the other uses marginal regression (New-Mar). Both estimators are homogeneous, that is, the coefficients are the same for all subjects. To better gauge the proposed approach, we compare it with three alternatives: (a) Lasso, which directly applies the Lasso method to the entire dataset but does not account for the heterogeneity of coefficients across different subjects; (b) AdLasso, which is the group adaptive Lasso in the varying-coefficient model [12]; and (c) IVIS, which uses the independent screening technique for fitting the varying-coefficient model [14]. The last two methods focus on variable selection and the estimation of the varying-coefficient model in high-dimensional settings, where each nonzero coefficient is assumed a smooth function of a known auxiliary variable.
For the proposed approach and its alternatives, we evaluate the variable selection performance by TP (number of true positives) and FP (number of false positives). Estimation and prediction are also evaluated. Specifically, estimation is measured by RMSE (root mean squared errors), defined as 1 p p ∑ j=1 β j −β j 2 , and prediction is measured by RPE (root prediction errors), defined as  , suggesting the proposed approach is robust to perturbations. In most scenarios, New-Lasso outperforms New-Mar when covariates are weakly correlated (ρ = 0.3), but performs worse than New-Mar when covariates are strongly correlated (ρ = 0.8). These results stem from the fact that Lasso is not good at dealing with highly correlated covariates. In practice, we can select one of them according to the correlations among covariates. Examples are provided in Section 4. Lasso identifies a reasonable number of important variables but with higher false positive than the proposed approach. AdLasso shows a good performance in variable selection, but inferior to that of the proposed approach under most simulation settings. IVIS has the worst performance among the five approaches.
In the evaluation of estimation, the proposed approach again has a favorable performance. We plot the estimated nonzero coefficients as a function of subjects and 95% pointwise confidence intervals ( Figure A3). In Scenario 6 with N = 200, q = 20, and ρ = 0.3, the estimated coefficients are close to the true ones, and the confidence intervals contain the true coefficients for most subjects. However, the estimation results become worse for the coefficients of the first and last few subjects. This is because the information available to estimate these coefficients is less than that on the intermediate coefficients. This problem can be alleviated by increasing the sample size ( Figure A4). Additionally, the proposed approach outperforms the alternatives in terms of prediction under most scenarios.
Overall, simulation suggests favorable performance of the proposed approach. It is interesting to note that it has satisfactory performance even under the no heterogeneity scenario (Scenario 5). Thus, it provides a safe choice for practical data analysis where the degree of heterogeneity in covariate effects is unknown. The other simulation settings have similar results. However, due to space constraints, we do not describe them here.

Data Analysis
Here, we apply the proposed approach to two TCGA datasets. As a cancer genomics program initiated by the National Institute of Health (NIH), TCGA publishes high quality clinical and genetic data. In our analysis, the data are downloaded from the cBioPortal website (http://www.cbioportal.org/, accessed on 16 January 2021) via the cgdrs package.

SKCM Data
Cutaneous melanoma (SKCM) is a cancer of the skin cells called melanocytes, leading to the majority of deaths from skin cancers. In our analysis, we are interested in the regulation of Breslow thickness, a measure of the size of melanoma growth, by gene expressions. We use age as the auxiliary variable, which is correlated with the melanoma development and progression [32]. After removing missing values from the Breslow thickness and age, a total of 228 patients are included in analysis. The median age is 58 (range: 18-90 years) and the median Breslow thickness is 2.45 (range: 0.28-75). All patients are sorted by age in ascending order. There are some patients that have the same age, but there are only a few (2-8) patients with the same age. The analysis results show that the orders of patients within each age have little impact on the identification of important genes and the effect estimation. Consequently, in the analysis, we sort the patients with the same age randomly. A total of 20,531 RNAseq gene expression measurements are available. More specifically, the processed level-3 gene expression data is used. Please refer to literature [33] for detailed information on generation and processing of gene expression data. To improve the reliability of the results, we conduct a marginal screening to screen out irrelevant genes and include 400 genes with lowest p-values in the downstream analysis. The gene expressions are assumed to connect with the response variable via a linear model.
The average correlation coefficient of 400 genes is 0.07, which is close to the 0.06 from the above simulation studies with ρ = 0.3. As such, we adopt the New-Lasso method, which identifies 6 important genes. Figure 2 shows the estimated coefficients for the 6 genes. The changes in the effects of genes across patients are prominent, which suggests that the heterogenous model is more appropriate for this dataset. We observe different change patterns for the effects of the 6 genes. Specifically, genes AOC4P and EDNRB first increase then decrease; genes CRELD2 and TRIM64 show an increase then remain steady, while gene SERPINA3 demonstrate the opposite pattern, and effect of gene OR10GB has a bowl-shaped pattern. The literature suggests that the identified genes are biologically meaningful. For example, gene EDNRB provides instructions for making a protein called endothelia receptor type B. Inherited variations in this gene may be associated with an increased risk of melanomas [34]. Recent studies revealed that gene AOC4P plays critical roles at multiple levels in diverse physiological and pathological processes [35]. Some of changes in metastatic melanomas were identified in gene SERPINA3 encoding proteins involved in the regulation of the extracellular matrix [36]. A high SERPINA3 expression correlates with shorter disease survival [37,38], suggesting the SERPINA3 expression can be used as a prognostic marker in melanoma.
We also apply the alternatives described above. The comparative results are provided in Table A4. The different methods identify different sets of genes. Based on real data, the true set of important genes is unknown and, thus, it is difficult to directly evaluate the identification and estimation accuracy. To verify the results, we now evaluate prediction and stability. Specifically, the dataset is split into a training set and a testing set of sizes 2:1. The regression coefficients are estimated using the training set and used to make predictions for the subjects in the testing set. We repeat the process 50 times and calculate the average root prediction errors (RPEs) to be 0.775 (New-Lasso), 1.072 (Lasso), 1.036 (AdLasso), and 1.393 (IVIS). The proposed approach has the best prediction performance. Moreover, for the proposed approach, we compare the RPEs of training sets and that of testing sets, and no significant differences are found (p-value > 0.5), suggesting that the proposed approach does not produce obvious over-fitting. Additionally, we compute the observed occurrence index (OOI) values to evaluate the stability of the identification results. Figure A6 shows the OOIs of all methods. The proposed approach significantly outperforms the alternatives in terms of identification stability. Figure 2. Analysis of the SKCM data using the proposed approach: estimated coefficients of the 6 genes for all subjects. The x-axis represents the subjects, and the y-axis represents the coefficient values.

LUAD Data
Lung adenocarcinoma (LUAD) is a form of non-small cell lung cancer, being the most common type of lung cancer. In our analysis, survival time is the response variable. There are a total of 231 patients, sorted by their forced expiratory volume in one second (FEV1), an important measure of lung function. The median follow-up time is 20 (range: 0.13-232 months) and the median FEV1 is 83 (range: 1.95-156). A total of 18, 325 RNAseq gene expressions are initially available for the analysis. Using the same marginal screening process as described above, the number of gene expressions is reduced to 400.
We adopt the accelerated failure time (AFT) model for the analysis of these censored survival data. The estimation procedure described above can be directly applied to the AFT model (see Appendix C). Because the genes have an average correlation coefficient (0.16) higher than that in the simulation studies with ρ = 0.8 (≈0.13), the New-Mar method is used here. The proposed method identifies 7 genes. The estimated coefficients of the 7 genes are presented in Figure A5.
Extant studies provide biological evidence for the association of identified genes with lung cancer. For example, AGTR1, the gene encoding angiotensin II receptor type I, has been extensively studied in human cancers [39] and has shown a strong influence on tumor growth, angiogenesis, inflammation and immunity [40]. Guo et al. [41] shows that methylation profiles of AGTR1 could be an effective methylation-based assay for non-small cell lung cancer diagnosis.
Data are also analyzed using the alternative methods. The summary comparison results (Table A4) again suggest that different methods produce different results. With censored survival data, we use the log-rank statistics to measure prediction performance.
The higher log-rank statistics indicate better prediction performance and the proposed approach has an average log-rank statistic of 11.67, compared with 4.43 for Lasso, 5.81 for AdLasso and 3.08 for IVIS. The OOI results are also presented in Figure A6. The proposed approach has again the highest OOI among all methods.

Simulation on SKCM Dataset
It has been recognized in some studies that simulated data may be "simpler" than real data. Here, we conduct an additional set of simulation based on the SKCM data analyzed above. Specifically, the observed gene expression data and the estimated coefficients in Section 4.1 are used in simulation. The simulation results are summarized in Table A5. It is observed that the proposed method maintains a relative edge over the alternatives, which justifies the effectiveness of the proposed method.

Discussion
The mature application of the high-throughput technology has produced a large amount of genomic data. With the rapid development of precision medicine, the heterogeneity effect of covariates has received increasing attention in disease genomic studies. However, most existing studies focus on the subgroup-specific effects, meaning the effects are the same within each subgroup, thus neglecting the possible varying effects within a subgroup. In this paper, we consider that the effects of covariates change smoothly across subjects. We thus propose a novel penalization-based estimation method, which combines a group-lasso penalty and a spline-lasso penalty based on subgroup-based studies by capturing the varying effects within each subgroup. It also advances the existing varyingcoefficient studies by lowering the requirements for the distribution of the auxiliary variable. We show that, under the appropriate conditions, the proposed approach can correctly select important covariates with a probability converging to one and estimates the coefficients consistently. Simulations demonstrated a satisfactory practical performance and data analysis led to sensible findings, significantly different from those using alternative methods.
WIth the proposed regression model, it is impossible to estimate directly the subjectspecific covariate effects due to the non-identifiability problem. This is resolved by introducing an auxiliary variable, which can have a biological interpretation. As such, it would be of interest to develop other frameworks that can differentiate between heterogeneous covariate effects in the (partial) absence of auxiliary variable. Additionally, the data analysis results also warrant further investigation.  Institutional Review Board Statement: Not applicable.

Informed Consent Statement: Not applicable.
Data Availability Statement: Publicly available datasets were analyzed in this study. The SKCM and the LUAD datasets can be found here: http://www.cbioportal.org/ (accessed on 16 January 2021), and can be downloaded via the cgdsr R package.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A. Estimation under the Accelerated Failure Time Model
The AFT model is an alternative to the commonly used Cox model in survival analysis, and regresses the logarithm of the survival time over the covariates. Consider a sample set {(X n , Y n ) : X n ∈ R p , Y n ∈ R} of size N, where X n = (X n 1 , . . . , X n p ) denotes the pdimensional covariates. Under the right-censoring situation, we obtain Y n = min{T n , C n }, where T n and C n denote the survival time and censoring time of the nth subject, respectively. Assume N subjects have been sorted by a known biomarker. We specify the following AFT model: where ε n is the random error with mean zero. Unknown coefficients β j = (β 1 j , . . . , β N j ) can be estimated by the weighted least squares method [42], where the weight is defined as a Kaplan-Meier weight. Let be the order statistics of Y n , n = 1, 2, . . . , N, and δ [n] the associated indicator function. The Kaplan-Meier weight can be computed as: . . , N. The weighted least-square loss function becomes: where X [n] is the vector of the covariates associated with Y [n] and β [n] j 's are the corresponding coefficients.

Appendix B
Proof of Theorem 1. From the definition ofβ: From the Cauchy-Schwartz inequality, we have: We define κ p = max 1≤j≤p β j ∞ . Under event Ω, by Lemma 1 in Huang et al. [26], for any 1 ≤ j ≤ p, we have: As a result, from (A2): According to condition (C2), we finally obtain: This completes the proof of Theorem 1.

Appendix C
Proof of Theorem 2. Consider the Karush-Kuhn-Tucker (KKT) condition: where e N is a N × 1 vector whose elements are all 1s. We define Z * = √ N( 1 N Z Z + λ 2 J) 1 2 and Y * = Z * −1 Z Y. Therefore,β is also the minimizer of the following objective function: As a result, if β j 2 = 0 for j ∈ E 1 , then, by the KKT condition: where we have Letû =β − β 0 and S = Z ε/ √ N. As a result, if there existsû so that: Then, we have β j 2 = 0 for j ∈ E 0 , and β j 2 = 0 for j ∈ E 1 . From (A3), are satisfied, we have β j 2 = 0 for j ∈ E 0 and β j 2 = 0 for j ∈ E 1 . We define the events as: Then, we have: then, for any j ∈ E 1 , From condition (C6) and Lemma 1 in Huang et al. [26], we have From condition (C2), P(R c ) → 0. Then, we only need to prove P(D 2 R) → 0. Since is invertible, we can prove that for any j ∈ E 1 , From Condition (C6), we have Under Conditions (C2) and (C5), we know P(E c ) → 0 and, thus, only need to prove that P(D 3 E) → 0. Since Σ E 1 E 1 is invertible, we have, for any j ∈ E 0 , Then, where function q * N (·) is the same as q * n (·) in Lemma 1 of Huang et al. [26]. Therefore, from Lemma 1 of Huang et al. [26] and Condition (C6), P(D 3 E) → 0.
Finally, we consider D 4 . To prove P(D 4 ) → 0, we only need to prove P(D 4 E) → 0. Since Σ E 1 E 1 is invertible, we can prove that, for any j ∈ E 0 , Namely, P(D 4 E) → 0. This completes the proof of Theorem 2.
Remark A1. We show that the marginal regression estimator satisfies Condition (C5) under some assumptions and can thus be used as the initial estimator. With the standardization of X = (X 1 , X 2 , . . . , X p ), the estimated marginal regression coefficient becomes: We define 1 3 , so that the non-zero coefficients' signals are bounded away from zero at certain rates.
Similarly to the "partial orthogonality" condition in Huang et al. [26], we assume that the correlation between the covariates with zero coefficients and those with nonzero coefficients (multiplying the corresponding coefficient) is not large, that is, For k ∈ E 0 , we have |ξ k | ≤ qρ N . Assume From Condition (C6), qρ N < 1. From Lemma 1 in Huang et al. [26], for any > 0, if r = o √ N log p log N , we have: When p = O(e N 1 2 −δ ) with 0 < δ < 0.5, r can be set as O( N δ−c log N ) for a small c > 0. Therefore, the marginal regression estimator satisfies Condition (C5).      3. The blue lines represent the true coefficients, the orange ones the coefficients estimated by New-Lasso, and the shadowed areas the 95% confidence intervals. The x-axis represents the subjects, and the y-axis represents the coefficient values. Figure A4. Estimated coefficients under Scenario 5 with N = 500, q = 40, and ρ = 0.8. The blue lines represent the true coefficients, the orange ones the coefficients estimated by New-Mar, and the shadowed areas the 95% confidence intervals. The x-axis represents the subjects, and the y-axis represents the coefficient values. Figure A5. Analysis of LUAD data using the proposed approach: estimated coefficients of the 7 genes for all subjects. The x-axis represents the subjects, and the y-axis represents the coefficient values.