Cumulative Median Estimation for Sufﬁcient Dimension Reduction

: In this paper, we present the Cumulative Median Estimation (CUMed) algorithm for robust sufﬁcient dimension reduction. Compared with non-robust competitors, this algorithm performs better when there are outliers present in the data and comparably when outliers are not present. This is demonstrated in simulated and real data experiments.


Introduction
Sufficient Dimension Reduction (SDR) is a framework of supervised dimension reduction algorithms that have been proposed mainly for regression and classification settings. The main objective is to reduce the dimension of the p-dimensional predictor vector X without losing information on the conditional distribution of Y|X. In other words, we can state the objective as the effort to estimate a p × d matrix β, such that the following conditional independence model holds: where d < p. The space spanned by the columns of β is called the Dimension Reduction Subspace (DRS). There are many β's that satisfy the conditional independence model above, and therefore, there are many DRSs. The objective is to find the matrix β which achieves the minimum d. The space spanned by such β is called the Central Dimension Reduction Subspace (CDRS), or simply the Central Space (CS). CS does not always exist, but if it exists, it is unique. Conditions of existence of the CS are mild (see [1]) and we consider that these are met in the rest of this work. For a more comprehensive overview of the SDR literature, the interested reader is referred to [2]. The first approach to the SDR framework was Sliced Inverse Regression (SIR) introduced by [3] and which used inverse means to achieve efficient feature extraction. There are a number of other methods that used the idea of inverse moments like Sliced Average Variance Estimation (SAVE) by [4] which uses inverse variance, Directional Regression (DR) by [5] which uses both the inverse mean and variance, and Sliced Inverse Mean Difference (SIMD) by [6] which uses the inverse mean difference. A common aspect of these methods is the fact that one needs to define the number of slices as a tuning parameter. To avoid this, [7] proposed the Cumulative Mean Estimation (CUME), which uses the first moment and removes the necessity to tune the number of slices.
More recently, a number of methods were introduced for robust sufficient dimension reduction. [8] proposed Robust SIR by using the L1 median to achieve sufficient dimension reduction and [9] proposed the use of Tukey and Oja medians to robustify SIR. Similarly, ref. [10] proposed Sliced Inverse Median Difference (SIMeD), the robust version of SIMD using the L1 median. The main reason for using the L1 median is the fact that it is uniquely defined for p > 2. On the other hand, Tukey and Oja medians may not be uniquely defined, but they are affine equivariant. In this paper, we will investigate the use of the L1 median to robustify CUME. The new method is called Cumulative Median estimation (CUMed). As CUMed is the robust version of CUME, it has all the advantages CUME has with the additional advantage that it is robust to the presence of outliers. The rest of the paper is organised as follows. In Section 2 we discuss the previous methods in more detail, and we introduce the new method in Section 3. We present numerical studies in Section 4, and we close with a small discussion.

Literature Review
In this section, we discuss some of the existing methods in the SDR framework that mostly relate to our proposal. We discuss Sliced Inverse Regression (SIR), Cumulative Mean Estimation (CUME), and Sliced Inverse Median (SIME). We first introduce some standard notation. Let X be the p−dimensional predictor vector, Y be the response variable (which we assume to be univariate without loss of generality), Σ = var(X) be the covariance matrix, and Z = Σ −1/2 (X − E(X)) be the standardized predictors.

Sliced Inverse Regression (SIR)
SIR was introduced by [3] and it is considered the first method introduced in the SDR framework. SIR depends on the linearity assumption (or the linear conditional mean assumption), which states that for any β ∈ R p , the conditional expectation E(X|β T X) is a linear function of β T X. The author proposed to standardize the predictors and then use the inverse mean E(Z|Y). By performing an eigenvalue decomposition of the variancecovariance matrix var(E(Z|Y)), the authors find the directions which span the CS, S Y|Z . One can then use the invariance principle [11] to find the direction that spans S Y|X .
A key element in SIR is the use of inverse conditional means, that is, E(Z|Y). To find these in a sample setting, one has to discretize Y, that is, to bin the observations into a number of intervals (denoted with I 1 , . . . , I H , where H is the number of intervals). Therefore, the inverse mean E(Z|Y) is, in practice, replaced by using E(Z|Y ∈ I i ) for i = 1, . . . , H, and thus, it is the eigenvalue decomposition of var(E(Z|Y ∈ I i )) which is used to find the directions which span the CS, S Y|Z .

Sliced Inverse Median (SIME)
A number of proposals exist in the literature to try to robustify SIR (see, for example, refs. [8] and [9]). We discuss here the work by [8] which uses the inverse L1 median instead of means to achieve this. The authors preferred the L1 median due to its uniqueness in cases where p ≥ 2. In their work, they defined the inverse L1 median as: Similarly, they denote withm Z = arg min µ∈R p E( Z − µ |Y). To estimate the CS S Y|Z , they performed an eigenvalue decomposition of the covariance matrix var(m Z ) in a similar manner that SIR works with the decomposition of the covariance matrix of the means.

Cumulative Mean Estimation (CUME)
As we described above, to use SIR one needs to define the number of slices. To avoid this, [7] proposed the use of the Cumulative Mean Estimation (CUME) algorithm, which removes the need of tuning for the number of slices by taking the cumulative mean idea. CUME uses the eigenvalue decomposition of the matrix var(E(ZI(Y ≤ y)), where I(·) is the indicator function, to find the directions which span the CS, S Y|Z . To achieve this, one needs to increase the value of y and recalculate the mean each time the value of y is large enough for a new observation to be included in the range {Y ≤ y}.
Like SIR, SIME needs to tune the number of slices as well. To our knowledge, there is not another proposal in the literature that removes the need to tune for the number of slices while simultaneously being robust to outliers. Therefore, in the next section, we will present our proposal to robustify CUME by using the Cumulative L1 median.

Cumulative Median Estimation
In this section, we introduce our new robust method. Like [8] and [10] we will use the L1 median as the basis for our proposed method due to the fact that it is unique in higher dimensions. To be more accurate, ref. [8] used the inverse L1 median as it was stated in Definition 1, and [10] used the inverse median difference. In this work, we will be using the cumulative L1 median, as it is defined below: The cumulative L1 median is given bym(y) = arg min µ∈R p E( X − µ I(Y ≤ y)), where · is the Euclidean norm and I(·) is the indicator function.
Before we prove the basic theorem of this work, let's give some basic notation. Let (Y i , X i ), i = 1, . . . , n the data pairs, where the response is considered univariate. Let also Z i denote the standardized value of X i . It is standard in the SDR literature to work with the regression of Y i on Z i , and then using the invariance principle (see [11]) to find the CS, S Y |X. In the following theorem, we demonstrate that one can use the cumulative L1 median to estimate the CS. Theorem 1. LetK CUMed be the kernel matrix for our method, as defined in (4). Suppose Z has an elliptical distribution with p ≥ 2. Then, span(K CUMed ) ⊆ S Y|Z .
Proof. First of all, note that Now suppose that S Y|Z has an orthonormal basis A = {η 1 , . . . , η d }. Let {η d+1 , . . . , η p } be the orthonormal basis for S ⊥ Y|Z , the complement of S Y|Z . Then, For t ∈ R p i = d + 1, . . . , p consider the linear transformation φ i : R p → R p such that where j = 1, . . . , p, η T j φ i (t) = η T j t for j = i and η T j φ i (t) = −η T j t for j = i. Using this notation, Dong et al. (2015) showed that Using (2) and (3), one can then show that From this, φ(m(y)) is the minimizer of E( Z − µ I(Y ≤ y)). By definition,m(y) is also the minimizer of E( Z − µ I(Y ≤ y)). Therefore,m(y) = φ(m(y)), and therefore, η T im (y) = −η T im (y) which means η T im (y) = 0 for i = d + 1, . . . , p. From the construction at the beginning of the proofm(y) ∈ S ⊥ Y|Z ⊥ = S Y|Z , and therefore span(K CUMed ) ⊆ S Y|Z .

Estimation Procedure
In this section we will present the algorithm to estimate the CS using the cumulative L1 median.

1.
Standardise data to findẐ =Σ −1/2 (X −μ), where(µ) is the sample mean of X i and (Σ) is an estimate of the covariance matrix of X i .

2.
Sort Y and then for each value of y, find the cumulative L1 mediañ m(y) = arg min µ∈R p E( Z − µ I(Y ≤ y)).

3.
Using them(y)'s from the previous step, estimate the candidate matrix bŷ where Ω Y is the range of Y,p y the proportion of points used to findm(y).

4.
Calculate the eigenvectorsβ 1 , . . . ,β d , which correspond to the largest eigenvalues of K CUMed , and estimate S Y|X with span(Σ . Note that the in the first step, it is more appropriate to use a robust estimator for Σ, and in this work we propose the use of the minimum covariance determinant (MCD) estimator, which is implemented in function covMcd in package robustbase in R (see [12]. Note that other alternative estimators, such as RFCH and RMVN [13] and [14] or the forward search estimator ( [15]) could have been used. The same estimator is used in the last step as well to recover S Y|X .

Numerical Results
In this section, we will discuss the numerical performance of the algorithm, both in a simulated as well as a real dataset setting.

Simulated Datasets
We compare the performance of our algorithms with algorithms based on means like SIR and CUME as well as SIME, which is the robust version of SIR proposed by [8] and, similarly to our case, it uses the L1 median. We ran a number of simulation experiments using the following models: 1.
Model 5: Y = sin(X 1 + σ ) for p = 5, 10 or 20 and ∼ N(0, 1). Note that σ = 0.2. For SIR and SIME we used 10 slices. Note also that the predictors X = (X 1 , . . . , X p ) come either from multivariate standard normal, which produces no outliers, or from a multivariate standard Cauchy, which produces outliers. We denote them in the result Tables as follows: • (Outl a): X is from multivariate standard normal. • (Outl b): X is from multivariate standard Cauchy.
We ran 100 simulations and report the average trace correlation and its standard errors to compare performance. Trace correlation is calculated as follows: where P B is the projection matrix on the true subspace and PB the projection matrix on the estimated subspace. It takes values between 0 and 1 and the closer to 1, the better the estimation.
The results in Table 1 demonstrate the advantages of our algorithm. Our method CUMed performs well when there are no outliers (comparable performance with CUME) and it is better than CUME in the case where there are outliers. Comparing to SIME (the robust version of SIR), our method performs better with the exception of model 2, where SIME performs slightly better. We run a more extensive simulation and we have seen that when d > 1, SIME tends to work better than CUMed. Table 1. Mean and standard errors (in parentheses) of trace correlation (4) for 100 replications for all the models for different values of p. We use n = 200 and the number of slices H = 10 in SIR and SIME. On the second experiment, we describe the importance of using the L1 median by running a comparison of CUMed when the L1 median is used with CUMed when the Oja median is used. Here, we emphasize that the Oja median is not unique, and it is also computationally more expensive to be calculated. We run the experiment for n = 200 and p = 10 under both scenarios where there are outliers and where there are not outliers. As we can see from Table 2, the L1 median behaves better for all models and both scenarios.

Real Data-Concrete Data
In this section, we compare the SIR, SIME, and CUME methods against our CUMed method on a real data set. We use the Concrete strength dataset (see [16]), which has eight predictors (Cement, Blast Furnace Slag, Fly Ash, Water, Superplasticizer, Coarse Aggregate, Fine Aggregate, Age) and a response variable (Concrete Strength). It has 1030 observations. To highlight our method's ability to estimate the central subspace in the presence of outliers, we ran 100 experiments where we selected 30 data points randomly and multiplied them by 10 (therefore creating about 3% outliers in the dataset). We then computed the trace correlation distance between the original projection matrix and the estimate projection matrix containing the outliers. As it is shown in Figure 1, if we compare the median trace correlations, both robust methods, SIME and CUMed, tend to have much less distance from the original projection (the one before introducing outliers) than SIR and CUME. CUMed is, overall, the best method, as it seems to have most distances above 0.8 and only in a few cases falls below that. On the other hand, for SIME, we can see from the boxplot that Q1 is at 0.35 (as opposed to CUMed, which has Q1 at 0.9).

Discussion
In this work, we discussed a robust SDR methodology which uses the L1 median to expand the scope of a previously proposed algorithm in the SDR literature, CUME, which uses the first moments. The new method utilizes all the advantages of CUME, by not having to tune for the number of slices as earlier SDR methods, like SIR, had to. In addition, it is robust to the presence of outliers, which CUME is not, as we demonstrated in our numerical experiments.
There have been a number of papers introduced recently in the SDR literature to robustify against outliers. We believe that this work adds to the existing literature and opens new roads to further discuss robustness within the SDR framework. In addition to the ones already discussed, which are based on inverse-moments, there is literature in the Support Vector Machine (SVM) based on SDR literature-see, for example, ref. [17], who used adaptively weighted schemes for SVM-based robust estimation, as well as the literature in iterative methods, like the ones proposed in [18] and [19].