Skewness-Kurtosis Model-Based Projection Pursuit with Application to Summarizing Gene Expression Data

: Non-normality is a usual fact when dealing with gene expression data. Thus, ﬂexible models are needed in order to account for the underlying asymmetry and heavy tails of multivariate gene expression measures. This paper addresses the issue by exploring the projection pursuit problem under a ﬂexible framework where the underlying model is assumed to follow a multivariate skew-t distribution. Under this assumption, projection pursuit with skewness and kurtosis indices is addressed as a natural approach for data reduction. The work examines its properties giving some theoretical insights and delving into the computational side in regards to the application to real gene expression data. The results of the theory are illustrated by means of a simulation study; the outputs of the simulation are used in combination with the theoretical insights to shed light on the usefulness of skewness-kurtosis projection pursuit for summarizing multivariate gene expression data. The application to gene expression measures of patients diagnosed with triple-negative breast cancer gives promising ﬁndings that may contribute to explain the heterogeneity of this type of tumors.


Introduction
The development of high-throughput technologies has provided the scenario to simultaneously monitor the expression levels of hundreds of genes in an attempt to obtain insights about the molecular mechanisms of human diseases. Genomic studies usually involve a vast number of measures quantifying the expression levels of genomic information from patients. An issue arising in this scenario is concerned with the data reduction through the construction of new genomic features that can summarize the expression levels of a set of genes sharing either a specific clinical characteristic or a well-established biological function. Standard methods for addressing the issue are based on first and second order moments indices using averages of expression levels and the first principal component conveying the largest variability respectively, the latter being an approach that accounts for gene dependencies provided that gene expression measures fit to the multivariate normal model. Since gene expression measures usually exhibit asymmetries and heavy tails, the normality assumption is not realistic [1][2][3][4] and dimension reduction methods based on first and second order moments entail obvious theoretical limitations. Thus, a dimension reduction approach based on higher moments is a better suited approach to capture the non-normality of this type of data. This is the motivation for exploring the skewness-kurtosis-based projection pursuit (PP) problem as a dimension reduction technique to summarize gene expression data.
This paper revisits the PP problem which in short is concerned with the search of "relevant" projections in multivariate data through the maximization of a non-normality index [5]. When the third (fourth) order moment is considered then skewness (kurtosis) is taken as projection index and the problem reduces to finding the direction that yields the maximal skewness (kurtosis) projection, an idea early proposed by [6] which has revived increasingly attention in an attempt to understand and interpret the derived projections under flexible parametric models for skewness [7][8][9][10][11][12][13][14] and kurtosis [9,[15][16][17][18] indices.
The multivariate skew-t (ST) distribution has become a widely used parametric model in multivariate data analysis, due to its tractability and appealing properties, which allows the handling of asymmetry and tail weight behavior simultaneously. In this paper we propose to study the model-based PP problem under the flexible class of multivariate ST distribution using the skewness-kurtosis projection indices in the context of summarizing multivariate gene expression measures. The rest of the paper is organized as follows: Section 2 gives a general theoretical overview about the ST family and presents some motivation for its use by assessing the multivariate normality of gene expression measures from breast cancer data. Section 3 discusses the skewness-kurtosis-based PP problem under the multivariate ST model; we examine the role of the shape vector, which accounts for the non-normality the model, in the derivation of the PP direction achieving the maximal skewness-kurtosis. The theoretical results are illustrated by means of a simulation experiment with synthetic data in Section 4; the simulation experiments reveal important findings with useful implications for the application to a real gene expression cancer data set; the application is discussed in Section 5. Finally, Section 6 summarizes the main findings.

The Skew-Normal and Skew-T Distributions
The multivariate skew-normal (SN) distribution has become an increasingly used model that regulates departures from normality by means of a shape vector dealing with the multivariate asymmetry; its study has originated fruitful research [19][20][21][22][23][24]. In this paper we adopt the formulation given by [19,24,25] to define the density function of a normalized SN vector Z by f (z; 0, Ω, α) = 2φ p (z; Ω)Φ(α z) : z ∈ R p , where Ω is a p × p correlation matrix, φ p (z; Ω) is the density function of a p-dimensional normal vector with zero mean and covariance matrix Ω, Φ is the distribution function of a standard N(0, 1) variable and α is a shape p × 1 vector.
To introduce location and scale parameters into this model, it is considered a location vector ξ = (ξ 1 , . . . , ξ p ) and a diagonal matrix ω = diag(ω 1 , . . . , ω p ) with non-negative entries which converts the correlation matrix into a scale matrix Ω = ωΩω; as a result, the vector X = ξ + ωZ has a SN distribution with density function f (x; ξ, Ω, α) = 2φ p (x − ξ; Ω)Φ(α ω −1 (x − ξ)) : x ∈ R p . ( The parameters in the density above are the location ξ, the scale matrix Ω and the shape vector α, or η = ω −1 α, which regulates the multivariate asymmetry of the model. We write X ∼ SN p (ξ, Ω, α) to denote that X follows a SN distribution with density (2); note that when α = 0 the multivariate normal density function is recovered.
As we know, the SN vector admits the stochastic representation: X = ξ + ωZ, with Z following a normalized skew-normal distribution with density (1). The multivariate ST distribution arises as a generalization of the SN when tail weight is injected into the model by incorporating a mixing variable S, independent of the vector Z, in the stochastic representation of the SN vector as follows: X = ξ + ωSZ. The mixing variable is given by [26]; as a result, it can be shown that the density function of X is with the quantity Q x above given by Q We will write X ∼ ST p (ξ, Ω, α, ν), or equivalently X ∼ ST p (ξ, Ω, η, ν), to indicate that X follows a p-dimensional ST distribution with density function (3). Please note that when ν → ∞ the ST reduces to the SN distribution, i.e., X ∼ SN p (ξ, Ω, α). In addition, when α = 0, the ST model becomes the p-dimensional t distribution for which the tail weight parameter ν controls the non-normality of the model.

Motivating Example
Cancer patients who are diagnosed with triple-negative breast cancer (TNBC) define a heterogeneous subtype of breast cancer with a worse prognosis than patients diagnosed by other cancer subtypes such as the estrogen-receptor positive (ER + ) or the HER2-positive. A data set containing gene expression measures for 494 TNBC patients was collected from GSE31519. An initial analysis allowed to reduce the original list with 13,146 genes to a new list containing only 1998 genes with the highest variability in their expression measures. This data set is used to illustrate the non-normality of multivariate gene expression measures.
To assess the normality assumption, we carry out multivariate normality tests for groups with p = 2, 5, 10 genes. For each dimension, a subset with p genes is selected at random and the multivariate normality is assessed by the p-value of the test; the experiment is repeated 10,000 times for each one of the following tests: Shapiro-Wilk's test [27,28], skewness and kurtosis tests implemented in the ICS R package [29], and Mardia's [30], Henze-Zirkler's [31], Doornik-Hansen's tests [32] implemented in the MVN R package [33].
The results appear in Table 1 which displays the number of rejections for each test at significance levels: α = 0.005, 0.01, 0.05. We can see that the rejection is higher for the larger dimensions; overall, we can observe a great deal of rejections, even for the smallest significance level, so that it can be concluded that the multivariate normal model does not fit the gene expression measures of TNBC patients. The non-normality issue has been tackled by previous works which pointed out the cautions and caveats regarding the use of statistical methods that rely on the normality assumption [1,3].   's  7716  8079  8878  9798  9853  9952  9998  9998  10,000  ICS skewness  5503  5944  7090  8583  8830  9363  9867  9904  9969  ICS kurtosis  8327  8546  9082  9934  9946  9982  10,000  10,000  10,000  Mardia skewness  5899  6368  7528  9418  9562  9812  9998  9999  10,000  Mardia kurtosis  8358  8615  9221  9958  9975  9995  10,000  10,000  10,000  Henze-Zirkler  4571  5232  6942  9200  9389  9717  9999  10,000  10,000  Doornik-Hansen  8189  8513  9135  9866  9907  9964 10,000 10,000 10,000 To illustrate the suitability of non-normal multivariate distributions such as the SN and ST for modeling gene expression measures, we take the following five illustrative genes: (GNG10, EEF1G, UQCR10, DLST, ZMYM3). The analysis comprises the computation of the p-value for the normality tests given in Table 1 and the fit of Normal, SN and ST distributions to their expression measures by maximum likelihood using the selm function of the sn R package [34]: the p-values are 0.00803 (Shapiro-Wilk's), 0.00104 (ICS skewness), 0 (ICS kurtosis), 0.02695 (Mardia skewness), 0 (Mardia kurtosis), 0.01596 (Henze-Zirkler) and 0.00088 (Doornik-Hansen); the PP-plots obtained from the fit are depicted in Figure 1. The p-values are mostly against normality and the PP-plots show the adequacy of SN and ST distributions to handle the underlying non-normality of gene expression measures.
The results of our analysis reveal the limitations of the normal distribution for modeling multivariate gene expression data. Thus, we advocate for the use of more flexible models such as the SN and ST.

Skewness-Kurtosis Based Projection Pursuit
In this section, we study the skewness and kurtosis model-based projection pursuit problems with a goal on exploring the directions that yield the maximal skewness and kurtosis projections for a ST input vector X. From now on, it is assumed that the underlying model for the multivariate gene expression measures is a ST distribution such that X ∼ ST p (ξ, Ω, η, ν) with a density function given by Equation (3). Let us denote by U = Σ −1/2 (X − ξ) its scaled version, with Σ denoting the covariance matrix of the input vector X. We study the problems for the maximization of skewness and kurtosis separately.

Skewness Maximization
Now, we consider the scaled version U of the ST input vector. First, we address the problem of finding the direction c for which the scalar variable Y = c U attains the maximum skewness, as defined by the standardized third moment measure: Since γ 1 is scale invariant, the search of the direction yielding the maximal skewness projection can be formulated by the following problem: where S p = {c ∈ R p : c c = 1}, or equivalently by where d = Σ −1/2 c and S * p = {d ∈ R p : d Σd = 1}. The vectors providing the maximal skewness in the previous equivalent problems are denoted by which satisfy that λ skew,X ∝ Σ −1/2 λ skew,U . The quantity γ D 1,p = max is a well-known measure for assessing multivariate asymmetry in a directional fashion [6]. Hence, the direction driven by the vector λ skew,X can be used as a principal skewness direction that would allow a summary of multivariate data.

Kurtosis Maximization
When the focus is on kurtosis maximization the formulation can be established in a similar way. Now, we must find the vector c (or d) for which the scalar variable Y = c U (or d X) attains the maximal kurtosis, where U is the scaled version of the input vector X. Here, the kurtosis (excess) is quantified by the standard fourth order moment measure As in the previous case, due to the scale invariance of γ 2 , the problem admits the following two equivalent formulations: where S p = {c ∈ R p : c c = 1}, or where d = Σ −1/2 c and S * p = {d ∈ R p : d Σd = 1}. If λ kurt,U and λ kurt,X denote the vectors where the maximal kurtosis in expressions (7) and (8) is attained, respectively, then they satisfy that λ kurt,X ∝ Σ −1/2 λ kurt,U and provide the maximal kurtosis directions.
In fact, the maximal kurtosis measure was already introduced in the past to account for the directional nature of kurtosis [6].
The main theoretical result of this work is provided by Theorem 1. It essentially states that under the flexible class of multivariate ST distributions, the vectors yielding the maximal skewness and kurtosis agree and have a simple analytical form related to the shape vector of the multivariate ST model. Theorem 1. Let X be a random vector such that X ∼ ST p (ξ, Ω, η, ν) with degrees of freedom ν > 4. Then the maximal skewness-kurtosis projections in (5) and (8) are attained at the direction of the vector: λ skew,X = λ kurt,X = η/ η Ση, with Σ the covariance matrix of the vector X.

Proof of Theorem 1. See the Appendix A.
Theorem 1 provides a revealing theoretical finding to summarize multivariate nonnormality through maximal skewness-kurtosis projections; it states that the maximal non-normality is attained at the direction of the shape vector η of the model since such direction not only maximizes skewness but kurtosis as well. This is also the case for vectors following a SN distribution, as stated by [9], who wondered about its validity for the ST distribution (see Section 6 in [9]). As a result, the shape vector may be interpreted as a parameter that accounts for the multivariate non-normality of the ST model in a directional way. The result also points out the parametric interpretation of the skewness-kurtosis-based PP problem under the ST distribution, a fact enhancing the inferential side of the problem which in turn poses computational implications when summarizing non-normal gene expression measures. The next section discusses in detail such computational issues giving two alternative methods for calculating maximal skewness-kurtosis projections from data.

Computational Issues
The first approach to compute the maximal non-normality direction comes from a non-parametric standpoint motivated by the representation of directional skewness as the maximum of an homogeneous third-order polynomial defined as follows [14]: where K 3 (U) is the p 2 × p third cumulant matrix of the scaled version U of the input vector X and the symbol ⊗ is used to denote the tensor product.
The problem above involves an iterative numerical algorithm that requires the choice of a proper initial direction in order to avoid local maxima. The use of the right dominant eigenvector of the empirical third cumulant matrix is suggested by the higher-order power method (HOPM) as a good starting direction [35], but without providing a theoretical justification. Interestingly, when the input vector X follows a ST distribution, it has been shown that the right dominant eigenvector of the third cumulant matrix appearing in (9) is proportional to Σ −1/2 γ, with γ = Ωη

+ η Ωη
, and also gives the direction achieving the maximal skewness projection for the scaled vector U [14]; therefore, the maximal skewness projection for X lies on the direction of η since Σ −1 γ is proportional to η -see Lemma 1 in [12]. This fact provides theoretical support for the HOPM algorithm and enhances its parametric interpretation. The previous argument also provides the theoretical support for a non-parametric method, based on the empirical third cumulant matrix, which serves to estimate η by resorting to the maximal skewness principle [14,36] as follows:

Method 1. Estimate the skewness-kurtosis-based PP direction by means ofη
, whereΣ is the sample estimate of Σ andû is the right dominant eigenvector of the empirical third cumulant matrix.
A second alternative method to address the skewness-kurtosis PP problem relies on the ST assumption for the underlying distribution. As Theorem 1 shows, this assumption brings the problem to the parametric field. Therefore, we can resort to maximum likelihood (ML) for estimating the shape parameter η using the functionalities of the sn R package [34]. Consequently, in order to compute the maximal non-normality projection we can use the ML method.

Method 2.
Estimate the skewness-kurtosis-based PP direction by means ofη 2 =η withη the ML estimation of the shape vector.
The next sections describe how both methods are applied. First, their performance is evaluated in a simulation study with artificial data drawn from scenarios which are designed by varying the characteristics of the underlying ST model and the sampling scheme. A real data application for the TNBC patients of the genomic experiment that motivated Section 2.2 is also provided to illustrate how they work to summarize multivariate gene expression measures.

Application to Synthetic Data
In this section, we carry out a simulation study to evaluate the accuracy of estimationŝ η 1 andη 2 . The experiments of the simulations are controlled by several sources that may affect the sampling behavior of the estimators. In addition to the sample size n and the dimension p of the input vector, some additional parameters ρ, τ, e and the degrees of freedom ν of the ST are involved in the design of each simulation scenario: The first one, ρ, is used to define the correlation matrix Ω with a Toeplitz structure as follows: Ω = (ω i,j ) 1≤i,j≤p , with ω i,j = ρ |i−j| : 1 ≤ i ≤ j ≤ p, so that the couple (ω, ρ) determines the scale matrix Ω of the model, where ω must be set in advance using a well-established criterion explained in a while. Given a direction defined by a unit length vector e, asymmetry is injected into the multivariate model across the direction e by an amount τ so that α = τe. It is worthwhile noting that the couple (τ, ν) are non-normality indices closely related to the asymmetry and tail weight behavior of the multivariate model; they account for the non-normal features of the multivariate ST model and also determine the position of the first principal component derived from its covariance matrix. Finally, for the sake of simplicity location is set at the null vector ξ = 0.
Each scenario is designed by setting specific values for the aforementioned parameters. Two thousand records for the estimations ofη 1 andη 2 are obtained by drawing samples of sizes n = 100, 200 from a ST distribution with the corresponding parameters. Finally, the mean square error (MSE) is calculated by comparing the unit length vectors obtained by both estimation methods with the theoretical unit norm shape vector. The simulation study is accomplished using several facilities of the sn and MaxSkew R packages [34,36].
The next sections provide an overview of the results for the bidimensional case and when the dimension is greater than two.

Simulation Study for the Bidimensional Case
Now we consider the bidimensional case; the simulation study is carried out for several scenarios defined by the following settings: ρ = 0.2, 0.7, ratio ω 2 /ω 1 = 1, 2 with ω = (ω 1 , ω 2 ), and values for the non-normality couple (τ, ν) equal to (1, 10), (1, 5), (5, 10), (5,5). The results about the accuracy of both estimation methods are shown by the MSEs appearing in Tables 2, 3 and 4. On the other hand, additional detailed visualizations are provided by the "clock-plots" depicted in Figure 2 which display the following: the maximal non-normality direction in black, the unit length vector e represented by the pendulum, the direction yielding the first principal component of Σ in gray, a cloud of points and finally the locations of the estimated directionsη 1 (outer locations of the clockplot) andη 2 (inner locations of the clock-plot), with the gray intensity representing in a visual way the density of directions.         When e is taken so that it lies on the direction of the first principal component of the scale matrix Ω, the performance of both estimations is summarized by the MSEs shown in Table 2. Overall, we can observe that the MSE increases with ρ and the ratio ω 2 /ω 1 . As expected, the smaller MSEs are observed for the larger sample size withη 2 giving more accurate estimations in nearly all the cases. A revealing phenomenon is that whereas the closer scenario to multivariate normality (τ, ν) = (1, 10) exhibits the higher errors, changes in the pair (τ, ν) towards non-normality give rise to remarkably higher error reductions forη 2 , mainly in scenarios corresponding to (τ, ν) = (5, 10) and (τ, ν) = (5, 5); taking into account this finding, the most accurate estimations arise for the aforementioned non-normality couples when ρ = 0.2 and ω 1 = ω 2 . However, the MSE ofη 1 deteriorates as we inject tail weight: we can see that forη 2 the MSE decreases when we departure from normality through changes in (τ, ν), although this is not the case forη 1 as shown by the peak of the MSE when (τ, ν) = (5, 5). In short, both estimation methods may exhibit remarkable differences as it is highlighted by the top left plot displayed in Figure 2.
For the second simulation experiment, a unit length vector e lying on the direction of the second principal component of Ω is considered. The resulting MSEs obtained from both estimation methods appear reported in Table 3, with the cells containing the not available (NA) cases corresponding to situations where the ML method has failed. The reported MSEs show a slight decreasing pattern of the error with the ratio ω 2 /ω 1 and an unclear pattern with respect to ρ. Anyway, the variability of the MSE is smaller than before with the most accurate estimations obtained for the cases ρ = 0.2 and ρ = 0.7 when ω 2 = 2ω 1 (details displayed by the top right plot of Figure 2). The other patterns we can observe for the MSE values agree qualitatively with those reported by Table 2.
Finally, if we consider the direction onto an arbitrary unit length vector given by e = (0.894, 0.447) we would obtain the results shown by Table 4. As previously, we can observe the decreasing behavior of the MSE with respect to the ratio ω 2 /ω 1 and its increasing behavior against ρ. Once again, the most accurate estimations arise for the scenarios (τ, ν) = (5, 10) and (τ, ν) = (5, 5) but now when ρ = 0.2 and ω 2 = 2ω 1 (see the bottom plot of Figure 2 for the detailed outcome of this simulation scenario). Other simulations, not reported here for the sake of space, have shown that the accuracy of the estimations improves as the ratio ω 2 /ω 1 increases; just as an illustrative reference, Table 4 reports in parenthesis the MSEs for ω 2 /ω 1 = 3 when (τ, ν) = (5, 5) and n = 200.
In summary, the simulations show that the ML method (η 2 ) is more accurate than the method based on the third cumulant matrix (η 1 ). Moreover, the most remarkable differences are observed as we depart from normality via asymmetry and tail weight deviations, as assessed by the parameters of the ST model.

Simulation Study for p > 2
In this section, we address experiments with dimensions p = 5 and p = 10. The study only considers the settings that led to the smaller MSEs in the previous bivariate case. Hence, we will analyze the sample size n = 200 and the non-normality couple (τ, ν) equal to (5, 10) and (5, 5); once again we will take ρ = 0.2, 0.7. In order to set the simulation framework, we take a first shape vector α lying on the direction of the first principal component of the scale matrix Ω and another shape vector whose components are chosen arbitrarily; on the other hand, the entries of the diagonal matrix ω are chosen either equal to 25 or unequal with values selected at random between the integers from 1 to 35. Therefore, four simulation scenarios are set as follows: • Scenario 1. The simulation experiments are determined by the following settings: p = 5, shape vector lying on the direction of the first principal component of the scale matrix Ω, and matrix ω such that either diag(ω) = (25,25,25,25,25) or diag(ω) = (3, 18,25,13,13), with the aforementioned values for (τ, ν) and ρ. • Scenario 2. It is determined by the settings from the previous scenario but with the shape vector lying on the direction α = (1, 1/2, 1, 1/2, 1). • Scenario 3. The simulation experiments are determined using the following settings: p = 10, shape vector lying on the direction of the first principal component of the scale matrix Ω, and either equal diagonal elements of the matrix ω given by diag(ω) = (25, 25, . . . , 25) or unequal diagonal elements given by diag(ω) = (17, 10,12,33,3,9,5,30,3,16), with the aforementioned values for (τ, ν) and ρ. • Scenario 4. It uses the same settings of scenario 3 but now the shape vector lies on the direction of α = (1, 1/2, 1, 1/2, 1, 1/2, 1, 1/2, 1, 1/2). Table 5 summarizes the accuracy of the estimationsη 1 andη 2 for the simulation experiments settled in the previous scenarios.
The errors in scenario 1 show a similar behavior as in the bivariate case for both estimation methods but now higher MSEs are obtained. Overall, the higher errors are observed for the larger ρ and when we take unequal ω i , with better MSE outcomes forη 2 ; additionally, for the heavier tail weight ν = 5 the MSEs ofη 1 estimation increase while the MSE values ofη 2 decrease slightly. The results from scenario 2, with an arbitrary shape vector, are similar to those obtained for the bidimensional case; once again, we can see a change in the behavior of the MSE with respect to the structure of the diagonal matrix ω (equal versus unequal ω i ). On the other hand, as expected, the results deteriorate when p = 10 as observed in scenarios 3 and 4: The most remarkable finding about the results in these scenarios is the high outcomes of the MSE when they are compared with the highest achievable value: MSE = 2. Once again, we come to a similar performance of both estimation methods as in the previous scenarios, but nowη 2 does not outperformη 1 in all the cases, perhaps due to the impact the higher dimension, p = 10, has on the maximum likelihood estimation. However, such differences are not so obvious in scenario 4 whose MSE outcomes still highlight the aforementioned general behavioral pattern.

Data Collection
In this section, we return to the genomic study introduced in Section 2. The study collected the expression measures of 13,146 genes corresponding to 579 individuals diagnosed with a TNBC tumor; the data set is available at the Gene Expression Omnibus (GEO) repository and can be accessed through GSE31519. An amount of 85 patients who received neoadjuvant chemotherapy is removed from the analysis so that we end up with a data set containing 13,146 gene expression measures for 494 TNBC tumors samples which, after data cleaning and the retention of genes with the highest variability, gets reduced to a data set with 1998 genes and 494 TNBC samples as described in Section 2.

Application of Skewness-Kurtosis Projection Pursuit
Recent works that aim to summarize the biological underpinning of associations in genomic data have proved the usefulness of probabilistic graphical modeling (PGM) to construct association networks that reveal insights about the underlying functional biological structure responsible for the observed gene expression levels [37,38]. When applied to this genomic study, PGM gave an association network unraveling the existence of 26 gene nodes which correspond to well-defined functional biological groups as described by gene ontology [38]. Interestingly, these functional nodes are related to 15 metagenes previously described by Rody [39]; this fact deserves the construction of metanodes by grouping similar nodes within the graphical model. Table 6 shows the correspondence between Rody's metagenes and the representative genes from the metanodes described in [38]; note that Hemoglobin and VEGF Rody's metagenes are excluded because they contain just a single gene from those described in [38] and here the focus is on multivariate gene expression measures. It is well known that when gene expression measures fit the multivariate normal model, then first and second order moments will suffice to handle data variability; hence, the first component of a principal component analysis (PCA) could be a natural choice to summarize the multivariate expression measures for the genes belonging to each functional group of Table 6. However, when their expression measures exhibit departures from normality, as occurs in this case, higher-order moments will capture the variability more properly; hence, we argue that the maximal non-normality projection, based on skewness-kurtosis maximization, may be a better approach to summarizing multivariate gene expression measures in such a case. The maximal non-normality projections for each functional group in Table 6 are computed using the estimationsη 1 andη 2 ; so we will be applying a kind of gene feature engineering.
Both approaches, using PCA and skewness-kurtosis PP, provide a list with new gene features that summarize the multivariate expression measures on the basis of the prior biological functional knowledge. The derived gene features can be used as inputs for additional exploratory analysis using methods such as multidimensional scaling (MDS) which serves to represent and visualize the multivariate expression measures in a 2D coordinate system; its output gives the representation displayed by Figure 3 which clearly shows differential TNBC patterns, with an outstanding shape when the MLE method is applied.

Discussion and Interpretation of Results
To elucidate whether there may exist hidden groups in data, which may throw clues and insights about the genetic heterogeneity of TNBC patients, Gaussian mixture modeling with the BIC criterion for model selection is carried out [40,41]. The BIC criterion led to a four group model for the skewness-kurtosis PP gene features, while it resulted in three groups when PCA is used to summarize the gene expression measures. For the sake of comparison, model-based clustering with three groups for the skewness-kurtosis gene features is finally considered, with a small acceptable loss in the BIC, as provided by the model selection capabilities of the mclust R package [41]. The underlying classes have been colored by the blue, red and green colors on the display of the previous MDS visualization plots (see Figure 4). It is worthwhile noting that the skewness-kurtosis MLE projection seems to highlight a better defined class structure in data as shown in the middle plot of Figure 4. Additional biological interpretation about the subgroups derived from the new MLE skewness-kurtosis PP gene features can be obtained using an exploratory classification tree approach to ascertain whether the resulting groups can be fully profiled through rules determined by different expression levels from the new skewness-kurtosis metagenes. The conditional inference tree method is a standard and widely used approach to achieve this goal [42,43]; an easy to use algorithm implementing the approach is provided by the partykit R package [44]. When applied to the MLE data projections, using the resulting groups as the class label for the output variable of the tree model, we obtain the tree structure displayed by Figure 5 which in turn provides a set of rules that characterize the underlying groups; it also contributes to their interpretation in terms of thresholds that highlight different over-expression conditions, shedding a flash of light in the study of the heterogeneity of TNBC patients.  It is worth noting the following revealing findings: there are two well-defined homogeneous subtypes; the first one corresponds the red group at terminal nodes 8 and 9 of the tree, the other one is the blue group which mostly appears at its terminal node 4. Thus, the first TNBC subtype would be characterized by an Apocrine over-expression as defined by the 8.407 cutoff of the MLE Apocrine metagene; whereas, the second subtype would be characterized by an absence of over-expression in the Apocrine, Claudin and T.Cell skewness-kurtosis MLE metagenes, which is determined by expression levels under the cutoffs 8.407, 3.074 and 2.465 respectively. Regarding the third TNBC subtype (green color), it can be observed that it appears mostly at the terminal nodes 5 and 6 of the tree; this finding is consistent with its heterogeneity as previously highlighted by the middle MDS plot of Figure 4. Please note that this TNBC subtype can be profiled by the absence of Apocrine over-expression and either a Claudin over-expression, associated with expression levels greater than the 3.074 cutoff, or a T.Cell over-expression, associated with expression levels greater than the 2.465 cutoff.

Concluding Remarks
This work has explored the projection pursuit problem within the framework of analyzing and summarizing gene expression data. The multivariate ST distribution arises as a flexible model for tackling the non-normality of this type of data since it can handle multivariate skewness and tail weight behavior simultaneously. In addition, projection pursuit has theoretical appealing implications when standard third and fourth moment skewness and kurtosis measures are employed as projection indices provided that the underlying model follows a multivariate ST distribution. Our theoretical findings have shown that the maximal skewness-kurtosis projection lies on the direction of the shape vector the ST distribution. As a result, two estimation methods, based on the empirical cumulant matrix (Method 1) and on the maximum likelihood approach (Method 2), have been proposed for computing such non-normality projection; their performance is evaluated through a simulation study whose outcomes show the superiority of the ML method, especially in a low-dimensional framework.
When applied to gene expression data from TNBC patients, the resulting projection pursuit directions define new gene features which contribute to reveal outstanding biological insights about the genomic heterogeneity of this type of breast cancer. More precisely, the maximal skewness-kurtosis projections help to unravel meaningful TNBC subtypes when the MLE estimation method is applied in combination with prior biological knowledge. The new skewness-kurtosis MLE gene features helped to identify three TNBC subtypes which are expected to guide pathologists, oncologists and biochemists to decipher the heterogeneity of TNBC tumors and to progress in the clinical practice accordingly.
A limitation of the skewness-kurtosis model-based projection pursuit approach is concerned with its poor performance as the dimension increases; this limitation would merit to investigate how sparse projection pursuit [45] or the graphical lasso approach to estimate the precision matrix [46][47][48] can be adapted and applied within this framework. Finally, from a theoretical standpoint, the extension of the results derived in this work to other flexible parametric families such as scale mixtures of skew-normal distributions [49] or generalized skew-normal distributions [8] may deserve further investigation; another problem for future research would lie in investigating whether it could be established a connection between previous work on multivariate skewness and kurtosis convex transform orderings [50][51][52] and the skewness-kurtosis PP problem.  γ Σ −1 γ = η Ωη/ 1 + η Ωη E(S 2 ) − 2 π E 2 (S) η Ωη