Iterative Variable Selection for High-Dimensional Data: Prediction of Pathological Response in Triple-Negative Breast Cancer

: Over the last decade, regularized regression methods have offered alternatives for performing multi-marker analysis and feature selection in a whole genome context. The process of deﬁning a list of genes that will characterize an expression proﬁle remains unclear. It currently relies upon advanced statistics and can use an agnostic point of view or include some a priori knowledge, but overﬁtting remains a problem. This paper introduces a methodology to deal with the variable selection and model estimation problems in the high-dimensional set-up, which can be particularly useful in the whole genome context. Results are validated using simulated data and a real dataset from a triple-negative breast cancer study.


Introduction
Breast cancer (BC) is the most frequent cancer among women, representing around 25% of all new cancer diagnoses in women [1]. One in eight women in developed countries will be diagnosed with BC over the course of a lifetime.
The prognosis of this disease has progressively improved over the past three decades, due to the implementation of population-based screening campaigns and, above all, the introduction of new effective targeted medical therapies, i.e., aromatase inhibitors (effective in hormone receptor-positive tumors) and trastuzumab (effective in HER2-positive tumors). Breast cancer is, however, a heterogeneous disease. The worst outcomes are associated with the so-called triple-negative breast cancer subtype (TNBC), diagnosed in 15-20% of BC patients. TNBC is defined by a lack of immunohistochemistry expression of the estrogen and progesterone receptors and a lack of expression/amplification of HER2 [2]. The absence of expression of these receptors makes chemotherapy the only available therapy for TNBC.
TNBC is usually diagnosed in an operable (early) stage. Surgery, chemotherapy and radiation therapy are the critical components of the treatment of early TNBC. Many early TNBC patients are treated with upfront chemotherapy (neoadjuvant chemotherapy, NACT) and then operated on and, perhaps, irradiated. The rationale for this sequence is the ability to predict the long-term outcome of patients looking at the pathological response achieved with initial NACT [3].
With the currently available neoadjuvant chemotherapy regimens, nearly 50% of TNBC patients achieve a good pathological response to this therapy, whereas the remaining patients have an insufficient response. TNBC patients achieving a complete or almost complete disappearance of the tumor in the breast and axilla after NACT have an excellent outcome (less than 10% of relapses at five years), in contrast with those with significant residual disease (more than 50% of relapses at five years) [4,5].
The identification of these two different populations is therefore of utmost relevance, in order to test new experimental therapies in the population unlikely to achieve a good pathological response.
Several tumor multigene predictors of pathological response of operable BC to NACT have been proposed over the past few years, taking advantage of the recent decreased economic cost of obtaining an individual's full transcriptome [6][7][8]. Most of them have been tested in unselected populations of BC patients and have shown insufficient positive predictive value and sensitivity.
The process of defining a list of genes that will define a characteristic expression profile is still ambiguous. This process relies upon advanced statistics and can use an agnostic point of view or include some a priori knowledge, but overfitting remains a problem. RNA-Seq has become one of the most appealing tools of modern whole transcriptome analyses because it combines a relatively low cost and a comprehensive approach to transcript quantification. Some approaches to complex disease biomarker discovery already pointed to the need to use a whole genome perspective using joint information in order to predict complex traits instead of a priori selecting individual features [9,10]. This strategy would lead to high predictive accuracy, and there would be no need to know the precise biological associations in the genome background because of the high correlation among the biomarkers [11]. This approach is challenging from the statistical point of view because of the large number of biomarkers that must be tested along the genome in relation to the rather small sample sizes in clinical studies. On the other hand, daily clinical practice scenarios require cheaper and faster quantification platforms than whole-genome RNA-Seq analysis. Thus, it is necessary to reduce the number of biomarkers to focus on in order to define a practical gene expression signature for the clinical community.
Regularized regression methods provide alternatives for performing multi-marker analysis and feature selection in a whole genome context [12]. Specifically, we focus on the sparse-group lasso (SGL) regularization method [13], which generalizes lasso [14], group lasso [15] and elastic-net [16], merging lasso and group lasso penalties. The solution provided by SGL usually involves a small number of predictor variables, given that many coefficients in the solution are exactly zero. It has an advantage over lasso when the predictor variables are grouped, as many groups are entirely zeroed out, but unlike group lasso, the solution is also sparse within those groups that are not completely eliminated from the model. However, as will be explained in the next sections, the SGL is not appropriate for the problem we are dealing with without introducing a broader methodology to control the regularization hyper-parameters, the groups, and the high-dimensionality issue. From a methodological point of view, this paper provides an original contribution to perform variable selection and model fitting in high-dimensional problems, allowing a priori selection of the final number of variables and addressing the problem of overfitting with the introduction of the importance index. Furthermore, the results presented in this paper are the first attempt in a translational oncology scenario at building a predictive model for the response to treatment, based entirely on whole genome RNA-Seq data and conventional clinical variables. This paper is organized as follows. Section 2 ties together the various theoretical concepts that support our approach. Section 2.1 introduces the mathematical formulation of the SGL as an optimization problem. Section 2.2 discusses the iterative-sparse group lasso, a coordinate descent algorithm used to automatically select the regularization parameters of the SGL. Section 2.3 describes a clustering strategy for the variables, based on principal component analysis, which makes it possible to work with an arbitrarily large number of variables without specifying the groups a priori. Section 2.5 highlights our main methodological contributions: the importance and the power indexes, to weight variables and models, respectively. In Section 3, a simulation study is presented, with several synthetic matrix designs, and varying the number of variables from 40 to 4000. Section 4 highlights the contributions of our methodology on a TNBC cohort that had undergone neoadjuvant docetaxel/carboplatin chemotherapy. Some conclusions and outlines for future work are drawn in the final section.

Methodology and Algorithms
Consider the usual logistic regression framework, with N observations in the form where p is the number of features or predictor variables, and y (i) is the binary response. We assume that the response comes from a random variable with conditional distribution, and η is the linear predictor, The objective is to predict the response Y for future observations of X 1 . . . X p , using an estimation of the unknown parameter β, given by: where:R The problem with this approach is that for N < p, the minimization (1) has infinite optimal solutions. When the features X 1 . . . X p represent genetic expressions, this problem of predicting Y becomes more extreme, since we often have N several orders of magnitude smaller than p.
As a solution, variable selection techniques are proposed, in order to tackle the analytical intractability of this problem.

The Sparse-Group Lasso
It has been shown that SGL can play an important role in addressing the issue of variable selection in genetic models, where genes are grouped following different pathways. The mathematical formulation of this problem is: Here J is the number of groups, β (j) ∈ R p j are vectors with the components of β corresponding to j-th group (of size p j ), and γ j = √ p j , j = 1, 2, . . . , J. The regularization The problem with (3) is that the vectorβ(λ) of estimated coefficients depends on the selection of a vector of regulation parameters λ, which must be chosen before estimatinĝ β(λ). The selection of λ is partly an open problem, because although there are several practical strategies for choosing these parameters, there is no established theoretical criterion to follow. In most cases, the regularization parameters are set a priori, based on some additional information about the data, or the characteristics of the desired solution, e.g., a greater λ 1 implies that more components ofβ are identically zero. The most commonly used methodology to select λ consists of moving the regulation parameters in a fixed grid, which is usually not very thin. However, this approach has many disadvantages. By contrast, we propose the iterative-sparse group lasso, a coordinate descent algorithm, recently introduced in [17].

Selection of the Optimal Regularization Parameter
is partitioned into three disjoint data sets, Z T , Z V , and Z test . The data in Z T are used for training the model, i.e., solving (3). Z V is used for validation, i.e., finding the optimal parameter λ. The remaining observations in Z test are used for testing the prediction ability of the model on future observations. Specifically, the selection of the optimal parameter λ is based on the minimization of the validation error, defined as: where:β and:R with # denoting the cardinal of a set. Therefore, the problem of finding the optimal parameter λ can be formulated as: Algorithm 1 describes the two-parameter ITERATIVE SPARSE-GROUP LASSO (iSGL 0 ), a gradient-free coordinate descent method to tune the parameter λ from the sparse-group lasso (3), which performs well under different scenarios while drastically reducing the number of operations required to find optimal penalty weight parameters that minimize the validation error in (4). The iSGL 0 iteratively performs a univariate minimization over one of the coordinates of λ, while the other coordinate is fixed.
As mentioned before, a very useful property of the sparse-group lasso as a variable selection method is the ability to remove entire groups from the model (sending to zero the components of theβ vector relative to those groups), as is the case with group lasso. However, this means that a grouping among the variables under consideration must be specified. This does not entail a challenge if there are natural groupings among the variables, e.g., if the variables are dummies related to different levels of the same original categorical variable. However, in our study most of the variables are transcriptomes, for which there are no established groupings in the literature. To overcome this problem, we suggest an empirical variable grouping approach, based on the principal component analysis of the data matrix.

Grouping Variables Using Principal Component Analysis
Principal component analysis (PCA) is a dimension reduction technique, which is very effective in reducing a large number of variables related to each other to a few latent variables while trying to lose a minimum amount of information. The new latent variables obtained (the principal components), which are a linear transformation of the original variables, are uncorrelated and ordered in such a way that the first components capture most of the variation present in all of the original variables.
Given the data matrix X ∈ R N×p , PCA computes the rotation matrix W ∈ R p×G , where G ≤ min(N, p) is the number of principal components to retain. The transformed data matrix (the principal component matrix) is T = XW. This rotation matrix W suggests a natural grouping on the columns of X, given by: This strategy will provide at most G groups on the columns of X.
The following example illustrates our approach on a simulated data set. Suppose that we want to cluster variables X 1 , X 2 , and X 3 using two groups. There are 300 observations ( Figure 1) and they are simulated such that corr(X 1 , X 2 ) = 0.75, corr(X 1 , X 3 ) = 0.1, and corr(X 2 , X 3 ) = −0.25. The principal components rotation matrix W is displayed in Table 1. Table 1. Principal components rotation matrix W.

PC1
PC2 Figure 1. Simulated sample from three random variables, that illustrate the grouping based on principal component analysis (PCA).
In this example, X 1 and X 2 would be grouped together, whereas X 3 would be in the other group. This method places highly correlated variables in the same group.

Mining Influent Variables under a Cross-Validation Approach
In this section, we focus on the problem of variable selection in models where the ratio p/N is in the order of 10 2 . In these scenarios, even state-of-the-art methods such as SGL find it hard to select an appropriate set of variables related to the response term. We propose a cross-validation approach to fit and evaluate many different models using only a sample size of N initially given observations. The solution in terms ofβ(λ) provided by Algorithm 1 strongly depends on the partition Z T , Z V . As a consequence, if we run Algorithm 1 for different partitions Z T , Z V of the same data Z, it will probably result in different coefficient estimatesβ(λ). Therefore, the indicator function of variable X j included in the model, 1(β j (λ) = 0), will take different values depending on the partition Z T , Z V . In order to avoid this dependency in the sample data partition, we propose Algorithm 2, which computes many different solutionsβ(λ) of Algorithm 1 for different partitions of the original data sample Z. The goal of this algorithm is to be able to fit and evaluate many models using the same data. Since the sample size is small compared to the number of covariates, the variable selection will greatly depend on the train-validate partition. We denote by R the total number of models that will be fitted using different partitions from the original sample. Algorithm 2 stores the information of the fittingβ of each model and the correct classification rate in the validation sample (ccr V ) in each case.

Selection of the Best Model
Our objective is to select one of the R models computed in Algorithm 2 to be our final model. We believe that a selection only based on the maximization of ccr V could lead to overfit in the training sample data Z. To overcome this problem, we define two indexes: the importance index of a variable and the power of a model. These indexes are fundamental to choosing a final model that does not overfit the data.
We consider the importance index I j of variable X j , defined as: where β (r) and ccr (r) V are those returned by Algorithm 2 on data Z. With the objective of penalizing those models that performed poorly on the validation set, the term δ has been introduced, which is the maximum betweenȳ and 1 −ȳ, i.e., the null model correct classification rate.
The importance index weights each variable X 1 . . . X p differently depending on the correct classification rate of the models in which each variable was present. The larger I j , the greater the chances of X j being present in the underlying model that generated the data Z. Figure 2 illustrates the importance index, computed on a simulated data set, with N = 100 observations and p = 400 variables. Notice that the most important highest variables are actually in the generating model, and there is a clear gap in Figure 2 between them and the rest of the variables. Based on the maximization of the importance index, an appropriate subset is selected from the original p variables. Although the true number of variables involved in the model is unknown, we can focus our attention on a predefined number of important variables K, which depends only on the sample data Z. We empirically found K = √ N/2 to achieve good results. Using the importance index of the best K variables, we define the power of a model as: where I (k) denotes the k-th greatest importance index, e.g., I (1) = max j I j . The power index P weights each model, depending on the importance of its included variables. The selection of the final model is based on the criterion: Equation (11), Algorithm 2, and the framework that supports them, are the main contribution of this paper from a methodological point of view. Equation (11) is based on the correct classification rates of R different fitted models, two indexes defined in this paper, and the iterative sparse-group lasso, which is a novel algorithm.

A Simulation Study
In this section, we illustrate the performance of Algorithm 2 using synthetic data. To generate observations, we have followed simulation designs from [13] (uncorrelated features) and [14,18,19] (correlated features). Since our objective was to evaluate Algorithm 2 in binary classification problems, we used a logistic regression model for the response term using the simulated design matrices in each case. We simulated data from the true model: with the logistic response y given by: Five scenarios for β and X were simulated. In each example, our simulated data consisted of a training set of N = 50 observations and p variables, and an independent test set of 5000 observations and p variables. Models were fitted using training data only. Below are the details of the five scenarios. (SFHT_2) In this example, β is generated as in SFHT_1, but the rows of the model matrix X are i.i.d. generated from a multivariate gaussian distribution with cov(X i , X j ) = 0.5 |i−j| , 1 ≤ j ≤ i ≤ p. ) and the rows of X were generated as follows: . . , 10, where x i are i.i.d. N(0, 0.01), for 1 ≤ i ≤ 15. We aimed to investigate the robustness of our methodology in each example, measured using the test accuracy, as the number of noisy variables (not in the generating model) increased. Table 2 describes the performance of the final model (Algorithm 2 with importance index) selected under our methodology in the scenarios described above. We have conducted 30 experiments in each case, as we varied the number of variables in the model (p). Mean standard errors are given in parenthesis. To establish a baseline comparison, we also included Lasso with grid search, and our methodology with groups known. Table 2 reveals that for all the configurations (except, perhaps Tibs_1 and SHFT_1) the methodology is very robust with respect to an increase in the number of variables p. In fact, for most of them, the ccr does not vary much from p = 400 to p = 1000. Intuitively, the grouping strategy introduced in Section 2.3 places highly correlated variables in the same groups, producing better results when there is correlation between the variables in the model. That is why the simulation scheme SFHT_1 produces the poorest results. In SFHT_1, all the simulated variables are independent and therefore, there is not any clear way to group the variables. From a computational point of view, however, Algorithm 2 is very expensive compared with lasso. In fact, the parameter R in Algorithm 2 is expected to linearly increase the computational burden of the method. That is because Algorithm 2 loops through R independent runs of Algorithm 1, which multiplies the computational cost of running one instance of Algorithm 1 by a factor R. However, the actual time needed to obtain a solution for Algorithm 2 can be decreased to the maximum of one instance of Algorithm 1, because this is a parallel problem. The remaining computations performed to select the best model can be ignored because they are at the same order of a matrix-vector product. Therefore, the parameter K is not going to affect the computational performance of our method. However, it should affect the predictive performance, since K is directly related to the model selection. Figure 3 compares the computational and predictive performance of our method and lasso on a numerical experiment. Figure 3 confirms that K does not affect the computational burden, and its impact on the predictive performance is also negligible in this case, although it has a peak when K equals the true number of variables in the generating model.
To empirically evaluate the impact of R, we propose a numerical simulation. Figure 4 displays the average elapsed time needed to obtain a solution to Algorithm 2 (Alg. 2) and lasso with grid search (lasso-GS), varying the parameter R (which does not affect lasso, which has been included as a baseline). The illustrated simulation design was ZH_d, with n = 50 and p = 200 both fixed, and R varying between 1 and 300. These experiments demonstrate that the relationship between R and the elapsed time is linear. In addition, Figure 4 confirms that we are trading computational cost for an increase of more than 10% in the accuracy, which we believe is a reasonable trade for most applications.

Application to Biomedical Data
In this section, we evaluate the methodology described in Algorithm 2 with the model selection criterion given by (11) on a real case study. A sample of TNBC patients from a previously published clinical trial [20] was used to analyze relations between cancer cells' transcriptome and the response of patients to the given medical treatment (docetaxel plus carboplatin). The dataset was composed of 93 observations (patients) and 16, 616 variables (genetic transcripts and clinical variables). Figure 5 shows the highest 30 importance indexes out of a total of 16, 616 variables. The criterion used to measure the importance of the variables is given in (9). Algorithm 2 was run with R = 200, and the cutoff value was set to K = √ N/2 = 7, as described in Section 2.5. With this importance index, the power of each model was computed using (10) and the best model was chosen according to (11), as highlighted in Figure 6. The selected model included 843 out of 16, 616 variables. The grouping strategy mentioned in Section 2.3 resulted in a total of 82 groups, from which 18 were included in the final model. Figure 7 displays the distribution of the number of non-zero coefficients for each group that was included in the final model, which is revealing in several ways. Firstly, it indicates that PCA finds groups of similar lengths, and secondly, the selected model is sparse at both the group and the variable levels.
In an attempt to discover the biological and genetic meaning in the model selected by our methodology, we ran DAVID [21,22] to detect enriched functional-related gene groups. The clustering and functional annotation was performed using the default analysis options, and the role of the potential multiple testing effect was considered using the false discovery rate (FDR). The results are detailed in supplementary Table S1.
We observed just two remarkable families of pathways after the gene enrichment analysis: the homeobox-related and the oxidative phosphorylation pathways. They are both involved in the mechanism of action of docetaxel and carboplatin in response to the provided treatment.  The homeobox genes have been proposed to be involved in mechanisms of resistance to taxane-based oncologic treatments in ovarian and prostate cancer [23][24][25][26]. Docetaxel hyper-stabilizes the microtube structure, irreversibly blocking the cytoskeleton function in the mitotic process and intracellular transport. In addition, this drug induces programmed cell death.
On the other hand, carboplatin attaches alkyl groups to DNA bases, resulting in fragmentation by repair enzymes when trying to repair them. It also inducts to mutations due to nucleotide despairing and generates DNA cross-links that affects the transcription process [27]. The development of resistance to platinum-based schemes of chemotherapy is a common feature. Several studies demonstrate that dysfunctions in mitochondrial processes, in conjunction with the mentioned mechanism of action, can contribute to the development of phenotypes associated with resistance [28][29][30][31][32][33].

Conclusions
The present study introduced a methodology to deal with the variable selection problem in a high dimensional set-up. It can be seen as an extension of the sparse-group lasso regularization method, without dependencies on both the hyper-parameters and the groups. There are several critical components in this approach: • A clustering of the variables, based on PCA, makes it possible to work with an arbitrarily large number of variables without specifying groups a priori. • The iterative sparse group lasso removes the dependence on the hyper-parameters of the sparse group lasso, but is sensible to the train-validate sample partitions. This problem has been solved running the algorithm for a large number of different train-validate sample partitions (Algorithm 2). • The correct classification rate of each model in its respective validation sample is stored. Notice that this is an overestimation of the true correct classification rate on future observations, and the highest validation rate does not imply the best model.

•
The importance index weighs the variables, based on the correct classification rate of the models that include them. • The power index weighs the models, based on the importance of the variables they include.
This methodology was tested on a sample of TNBC patients, trying to reveal the genetic profile associated with resistance to the treatment of interest. The literature studies mentioned in Section 4 provide a rationale supporting the potential predictive value of the two gene pathways identified in our study (the homeobox-related and the oxidative phosphorylation pathways). In order to validate these results, we are testing the model in a new cohort of TNBC patients from the same clinical trial.
Future studies should examine other strategies to grouping variables, as discussed in Section 2.3, based on supervised algorithms as well as unsupervised ones.
Supplementary Materials: The following are available online at https://www.mdpi.com/2227-7 390/9/3/222/s1: Table S1: DAVID functional analysis of selected genes, the enriched pathways and their sources. Each pathway underwent a modified Fisher's exact test (EASE score) in order to determine if the sparse-group lasso model genes were over-represented in those gene sets. The Fold Enrichment and the PValue measure the magnitude of enrichment. In addition, Bonferroni, Benjamini, and FDR techniques are provided to globally correct enrichment P-values to control the family-wide false discovery rate. Some basic metrics regarding the number and percentage of genes in the studied pathways are shown in the Count and % columns.