Classification in High Dimension Using the Ledoit–Wolf Shrinkage Method

Lotfi, Rasoul; Shahsavani, Davood; Arashi, Mohammad

doi:10.3390/math10214069

Open AccessArticle

Classification in High Dimension Using the Ledoit–Wolf Shrinkage Method

by

Rasoul Lotfi

¹,

Davood Shahsavani

¹ and

Mohammad Arashi

^2,3,*

¹

Department of Statistics, Faculty of Mathematical Sciences, Shahrood University of Technology, Shahrood 3619995161, Iran

²

Department of Statistics, Faculty of Mathematical Sciences, Ferdowsi University of Mashhad, Mashhad 9177948974, Iran

³

Department of Statistics, Faculty of Natural and Agricultural Sciences, University of Pretoria, Pretoria 0002, South Africa

^*

Author to whom correspondence should be addressed.

Mathematics 2022, 10(21), 4069; https://doi.org/10.3390/math10214069

Submission received: 23 September 2022 / Revised: 16 October 2022 / Accepted: 25 October 2022 / Published: 1 November 2022

(This article belongs to the Special Issue Contemporary Contributions to Statistical Modelling and Data Science)

Download

Browse Figures

Versions Notes

Abstract

:

Classification using linear discriminant analysis (LDA) is challenging when the number of variables is large relative to the number of observations. Algorithms such as LDA require the computation of the feature vector’s precision matrices. In a high-dimension setting, due to the singularity of the covariance matrix, it is not possible to estimate the maximum likelihood estimator of the precision matrix. In this paper, we employ the Stein-type shrinkage estimation of Ledoit and Wolf for high-dimensional data classification. The proposed approach’s efficiency is numerically compared to existing methods, including LDA, cross-validation, gLasso, and SVM. We use the misclassification error criterion for comparison.

Keywords:

classification; linear discriminant analysis; high-dimensional data; Ledoit and Wolf shrinkage method; Stein-type shrinkage; misclassification error

MSC:

62H30; 68T09

1. Introduction

As one of the most widely used classification techniques, linear discriminant analysis (LDA) is still interesting because of its simplicity, stability, and prediction accuracy. Consider X as a p-dimensional predictor vector and the response

Y \in {1, 2, \dots, K}

as the class labels. In LDA, it is assumed that

X | Y \sim N_{p} (μ_{Y}, Σ)

, where

μ_{Y} \in R^{p}

and

Σ > 0

; consequently, the Bayes decision rule involves

Σ^{- 1}

. In the large sample setting, when

p < n

, the sample covariance matrix

S

is an unbiased estimator of

Σ

. However, in a high-dimensional setting,

p \approx n

or

p > n

,

S

will be singular, and the likelihood estimator has many weaknesses such as inaccuracy (see [1,2]).

Many studies have been conducted using factor methods, sparse, graphical, and shrinkage methods for estimating

Σ

. Srivastava [3] examined multivariate theory in a high-dimensional state and used the Moore–Penrose inverse of the covariance matrix to solve the singularity problem of

S

. However, when some covariance matrix values are zero or close to zero, this idea does not work well.

The idea of estimating the precision matrix using a sparse method was first proposed by Dempster [4], and later, Meinshausen and Bühlmann [5] proposed the use of least absolute shrinkage and selection operator (Lasso) regression to identify the zeros of the inverse covariance matrix. Banerjee et al. [6] performed a penalized maximum likelihood estimation with the lasso penalty for sparse estimation of the inverse of the covariance matrix. Friedman et al. [7], proposed the graphical Lasso method, under the sparsity assumption of

Θ = Σ^{- 1}

by using coordinate descent for the lasso penalty, through the objective function

log det Θ - t r S Θ - {λ ∥ Θ ∥}_{1},

(1)

where

Θ > 0

,

t r (A)

denotes the trace of matrix

A

, and

{∥ . ∥}_{1}

is the norm one operator. Bickel and Levina [8] used the hard thresholding estimator for the sparse estimation of the covariance matrix

Σ

. Furthermore, Cai and Zhang [9] developed an optimality theory for LDA in the high-dimensional setting by considering a different approach to solve the problem of LDA. Instead of estimating

δ = (μ_{2} - μ_{1})

, and

Θ = Σ^{- 1}

separately, they proposed a data-driven and tunning free classification rule called AdaLDA by directly estimating the discriminant direction

β = Θ δ

through solving an optimization problem. As the hard threshold estimator in regression provides inflexible estimators, Ratman et al. [10] refined a generalized threshold law by using the combination of the threshold method and the shrinkage method. Bin and Tibshirani [11] generalized the estimate of a sparse covariance matrix by simultaneously estimating the nonzero covariance and the graph structure (location of zeros). Refer to Fan et al. [12] for more related studies.

Apart from sparse covariance matrix estimation, a common approach to improve the estimation of

Σ

is the use of the class of shrinkage estimators, which was initially proposed by James and Stein [13] to define bias estimation in order to reduce the variance of

S

(see [14,15,16] for extensive reviews). Di Pillo [17] and Campbell [18] improved the estimate of

Σ^{- 1}

using the ridge idea. Peck and Van Niss [2] proposed another type of shrinkage estimator of

Σ^{- 1}

by reducing the Fisher’s classification error. Mkhadri [19] used the cross-validation (CV) method to estimate the shrinkage parameter for the estimation of

Σ^{- 1}

in classification rule. However, Choi et al. [20] demonstrated that the use of the CV method may not lead to a positive definite estimate for the high-dimensional case

n ≪ p

.

In the shrinkage method and graphical and factor models, additional information is needed in the estimation process (e.g., Beckel and Levina [21]; Khare and Rajaratnam [22]; Cai and Zhou [23]), whereas this surplus knowledge is not always available (Maurya [24]). Therefore, Ledoit and Wolf [25] proposed the rule of the optimal linear shrinkage estimator with optimal asymptotic properties by using the analysis of covariance matrix eigenvalues. For estimating the inverse covariance matrix (precision matrix) when

p \geq n

, other studies have been conducted such as Wang et al. [26], Hong and Kim [27], and Lee et al. [28].

This paper aims to classify high-dimensional observations using the LDA, where the inverse sample covariance matrix is singular and not invertible. We apply Lediot and Wolf’s shrinkage method to estimate

Θ

and efficiently classify new observations in a high-dimensional regime. Thus, the plan for the rest of this paper is as follows. In Section 2, the proposed methodology, along with some theory, is given. Section 3 includes extensive numerical assessments for performance analysis and compares the proposed discriminant rule with other existing methods. We conclude with the significant results in Section 4; Appendix A is allocated for the proofs.

2. Materials and Methods

In discriminant analysis, a set of observations are classified into predetermined categories using a function called the decision function or discriminant function. In other words, discriminant analysis seeks to identify linear or nonlinear combinations of independent variables that are best able to separate groups of observations using the discriminant rule.

Consider distinct populations

Π_{1}, \dots, Π_{K}

with density function

f_{j} (x); j = 1, 2, \dots, K

and prior probabilities

π_{j} = P r (Y \in Π_{j})

. An observation

x

is classified into

Π_{i}

if

x \in Π_{i} ⟺ i = \underset{j}{arg max} π_{j} f_{j} (x) .

(2)

In the simplest case,

K = 2

, it is assumed that

Π_{j} \sim N_{p} (μ_{j}, Σ_{j}); j = 1, 2

, so that

Π_{1}

is independent of

Π_{2}

,

μ_{j} \in R^{p}

, and

Σ_{j} > 0

. In the LDA, it is also assumed

Σ_{1} = Σ_{2} = Σ

. According to Equation (2),

x \in Π_{1}

if

f_{1} (x) > f_{2} (x)

; so, the discriminant function is obtained as follows

D_{12} (x) = {(μ_{1} - μ_{2})}^{T} Σ^{- 1} (x - \frac{μ_{1} + μ_{2}}{2}),

(3)

where

μ_{j}

and

Σ

are unknown, and we estimate them using the training sample by the mean vector

{\bar{x}}_{j}

and the pooled sample covariance matrix

S

, respectively, where

{\bar{x}}_{j} = \frac{1}{n_{j}} \sum_{i \in Π_{j}}^{} x_{i}

, and

N S = \sum_{i \in Π_{1}} (x_{i} - {\bar{x}}_{1}) {(x_{i} - {\bar{x}}_{1})}^{T} + \sum_{i \in Π_{2}} (x_{i} - {\bar{x}}_{2}) {(x_{i} - {\bar{x}}_{2})}^{T}; N = n_{1} + n_{2} - 2 .

Based on Equation (3), the classification function can be considered as a linear function W

W (x) = {({\bar{x}}_{1} - {\bar{x}}_{2})}^{T} S^{- 1} (x - \frac{{\bar{x}}_{1} + {\bar{x}}_{2}}{2}) .

(4)

Hence, an observation

x

is classified into

Π_{2}

if

W (x) > 0

, and it is classified into the population

Π_{1}

otherwise. Therefore, the probability of misclassification (PMC) depends on the sample values

{\bar{x}}_{1}

,

{\bar{x}}_{2}

, and

S

. If the

S

estimate is weak, the PMC will not be minimized and in high-dimension (

p \geq n

);

S^{- 1}

either cannot be calculated or it is not efficient. In this case, we use the approach of shrinkage methods.

2.1. Ledoit and Wolf Shrinkage Estimators

The James and Stein shrinkage estimator [13] is a convex combination of a sample covariance matrix and a target matrix

T

as follows

S^{*} = (1 - λ) S + λ T,

(5)

where

λ \in (0, 1)

is the shrinkage parameter and

T

is positive definite. The target matrix

T

should be chosen to have several properties. The target matrix must be structured, positive definite, and well-conditioned, representing our application’s true covariance matrix. The

T

matrix may be biased; however, with its well-defined structure, it has a low variance. Given that the matrix

T

is predetermined, determining the shrinkage parameter

λ

is important and should be chosen in such a way that the variance of the shrinkage estimator is less than the variance of

S

. If

n > p

, the variance

S^{*}

must be less than the variance of the target matrix, i.e.,

λ \to 0

, and if

p > n

, the target matrix must have less variance than the variance

S^{*}

, i.e.,

λ \to 1

. Therefore, the

λ

values have a significant effect on the degree of the misclassification error.

In the category of Stein-type shrinkage estimators; Ledoit and Wolf [25] proposed the estimation of the shrinkage parameter

λ

using the following result, when

p \geq n

.

Theorem 1.

Suppose

x_{1}, x_{2}, \dots, x_{n}

is a random sample from

N_{p} (μ_{j}, Σ)

,

μ_{j} \in R^{p}

,

Σ > 0

,

j = 1, 2, \dots, K

and

δ^{2} = {E [‖ S - I ‖}^{2}]

;

α^{2} = {E [‖ Σ - I ‖}^{2}]

and

β^{2} = {E [‖ S - Σ ‖}^{2}]

; then,

1.: $δ^{2} = α^{2} + β^{2}$ ;
2.: assuming $Σ^{*} = (1 - λ) S + λ I$ , the optimal shrinkage parameter that minimizes the risk value of $Σ^{*}$ is equal to ${\hat{λ}}^{L W} = \frac{{\hat{β}}^{2}}{{\hat{δ}}^{2}}$ , where

${\hat{β}}^{2} = \frac{1}{n} {\hat{a}}_{2} + \frac{p}{n} {\hat{a}}_{1}^{2}; {\hat{δ}}^{2} = \frac{n + 1}{n} {\hat{a}}_{2} + \frac{p}{n} {\hat{a}}_{1}^{2} - 2 {\hat{a}}_{1} + 1$

in which, based on Srivastava [29],

${\hat{a}}_{1} = \frac{1}{p} t r (S); {\hat{a}}_{2} = \frac{n^{2}}{p (n - 1) (n + 2)} (t r (S^{2}) - \frac{1}{n} {(t r (S))}^{2}) .$

See Ledoit and Wolf [25] for details.

2.2. Improved Linear Discriminant Rules

Since the shrinkage parameter,

λ

in the Ledoit and Wolf’s approach is obtained using the optimization method, in contrast to the CV method used by Mkhaderi [19], it will always lead to a positive definiteness of the sample covariance matrix. It also does not require additional information about explanatory variables and their independence. As a result, it has an advantage over other shrinkage estimation methods (Ledoit and Wolf [30]); therefore, the proposed method for reducing the misclassification error and the discriminant analysis in the high-dimensional case

p \geq n

leads to the following classification rule

\tilde{W} (x, λ^{L W}) = {({\bar{x}}_{1} - {\bar{x}}_{2})}^{T} {\tilde{S}}^{- 1} (λ^{L W}) (x - \frac{{\bar{x}}_{1} + {\bar{x}}_{2}}{2}),

(6)

where

\tilde{S} (λ^{L W}) = (1 - λ^{L W}) S + λ^{L W} T

can be obtained from the Equation (5), and

λ^{L W} = min ({\hat{λ}}^{L W}, 1) .

(7)

2.3. Properties of the Improved Discriminant Rule

Given that in the discriminant problem,

W (x)

has a normal distribution with the mean

E (W (x | Π_{1})) = {({\bar{x}}_{1} - {\bar{x}}_{2})}^{T} S^{- 1} μ_{1} - \frac{1}{2} {({\bar{x}}_{1} - {\bar{x}}_{2})}^{T} S^{- 1} ({\bar{x}}_{1} + {\bar{x}}_{2})

(8)

and variance of

var (W (x | Π_{1})) = {({\bar{x}}_{1} - {\bar{x}}_{2})}^{T} S^{- 1} Σ S^{- 1} ({\bar{x}}_{1} - {\bar{x}}_{2}),

(9)

we have the following results.

Lemma 1.

Under the assumptions of Section 2.2,

\tilde{W} (x, λ^{L W})

has a p-dimensional normal distribution with mean

E (\tilde{W} (x, λ^{L W}) | Π_{1}) = {({\bar{x}}_{1} - {\bar{x}}_{2})}^{T} {\tilde{S}}^{- 1} (λ^{L W}) μ_{1} - \frac{1}{2} {({\bar{x}}_{1} - {\bar{x}}_{2})}^{T} {\tilde{S}}^{- 1} (λ^{L W}) ({\bar{x}}_{1} + {\bar{x}}_{2})

and variance

var (\tilde{W} (x, λ^{L W}) | Π_{1}) = {({\bar{x}}_{1} - {\bar{x}}_{2})}^{T} {\tilde{S}}^{- 1} (λ^{L W}) Σ {\tilde{S}}^{- 1} (λ^{L W}) ({\bar{x}}_{1} - {\bar{x}}_{2})

Proof.

Refer to Appendix A. □

Theorem 2.

Under the assumption of Section 2.2, using Lemma 1, we have

E (\tilde{W} (x, λ^{L W}) | Π_{1}) = E (W (x) | Π_{1}) + B

var (\tilde{W} (x, λ^{L W}) | Π_{1}) \leq var (W (x) | Π_{1}),

where

B = \frac{λ^{L W} (λ^{L W} - 2)}{n} Δ^{2}

and

Δ = {[{(μ_{1} - μ_{2})}^{T} Σ^{- 1} (μ_{1} - μ_{2})]}^{\frac{1}{2}}

.

Proof.

Refer to Appendix A. □

3. Numerical Studies

To assess the performance of the estimator (5) in classification, we conducted a simulation study and analyzed some real data.

3.1. Simulation Study

Data were generated from two populations

N_{p} (0, Σ)

and

N_{p} (μ, Σ)

, with

p = 12

in which

0

is a p-dimensional zero vector,

μ

is a p-dimensional desired vector, and

Σ

is a p-dimensional square matrix. When the covariance matrix is almost singular, discriminant analysis is likely to be sensitive to different choices of the mean vector; so in this paper, two different forms for the mean vector

μ

are selected, namely,

{(m^{*}, 0, \dots, 0)}^{T}

and

{(m, m, \dots, m)}^{T}

(named Mod1 and Mod2, respectively). The values of

m^{*}

and m were chosen so that the Mahalanobis distance,

D = {(μ^{T} Σ^{- 1} μ)}^{\frac{1}{2}}

, was the same for each case, and the

μ

were chosen using different values of

D

(

0.5

,

1.5

and

2.5

). The covariance matrix was also considered as

Σ = (1 - ρ) I + ρ J

, where

\frac{- 1}{p - 1} \leq ρ \leq 1

,

I

is the identity matrix, and

J

is the unit matrix of dimension

p \times p

. In order to determine the correlation role of the explanatory variables in the estimator, two values

0.2

and

0.4

for

ρ

were considered. For each population and different combinations of

μ

and

Σ

, we generated 10 p-dimensional training and 50 p-dimensional test data vectors. We chose the target matrix as

T = I

, so the sample covariance matrix shrank to the identity matrix. This target imposed no variance to the shrinkage estimator. This simulation was repeated 1000 times, and the performance of proposed methodology LW was compared with linear discriminant analysis (LDA), the Mkhaderi method [19] (CV), the graphic lasso method (gLasso), and support vector machine (SVM).

The results for each case

μ

, i.e., Mod1 and Mod2, are summarized in Table 1 and Table 2. The column ‘Test’ shows the average value of the misclassification test errors for each parameter value of

ρ

. Further, in these tables, the mean value of the shrinkage parameter

λ

for each discriminant rule is shown. The quantities in parentheses are the standard deviations of the respective means. Bold values are the smallest among all, showing the best method.

The improvement of the shrinkage algorithm strongly depends on the Mahalanobis distance between two populations. When the Euclidean distance between the means is small, the mean estimation error caused by the poor estimate of

Σ

is very damaging to the classification. Therefore, as the Euclidean distance increases, the means move further apart, and it does not have much relative effect on the classification.

According to Table 1, by increasing

D

, the misclassification error decreased. Apparently, for each

D

, the shrinkage method LW had a lower classification error compared to the LDA, CV, gLasso, and SVM methods. This means that the Ledoit and Wolf method had better performance in determining and assigning new observations to populations.

On the other hand, according to Table 2, by changing the strategy and considering Mod2, the results obtained in Mod1 were still valid. Thus, changing all the values of the mean vector

μ

was established (better efficiency and performance of the proposed method of this research than the studied methods). Figure 1 simply shows the results stated in Table 1 and Table 2. Bold values are the smallest among all, showing the best method.

Figure 2 depicts the misclassification error for varying sample sizes. Not surprisingly, as n increased, the misclassification error decreased for all the methods in this study, but surprisingly, none of the methods had a smaller error compared to the LW. As shown in Figure 2, with the increasing sample size, the misclassification error in all methods was higher than the proposed method. Moreover, as the Mahalanobis distance increased, it became easier to identify which population the new observation

x^{*}

belonged to, which improved the classification in the proposed method.

To evaluate the efficiency of the LW method, simulations were performed using

p = 16, 30, 50, 100, 500

values. These results are summarized in Table 3. As the dimension increased, the classical method of linear discriminant analysis as well as the proposed cross-validation method of Mkhaderi [19] could not be used due to the singularity of the covariance matrix. With an increase in the value of

D

, the misclassification error decreased, and for each value

D

, the shrinkage method of LW was significantly better than the other methods.

3.2. Real Data Analyses

In this section, we assess the performance of the five methods in classification for the datasets in Table 4.

As shown in Table 5, the LW shrinkage method for classification was superior comparatively (marked as bold). In Data 4, the execution time in the system with specifications

C P U

:

i 7 - 4720 H Q

and

R a m

:

8 G B

for the gLasso method took more than 10 h, and for the LW method it was less than 5 min, which was a sign of the rapidity of the shrinkage method in the classification of high-dimensional data.

Figure 3 depicts the average of misclassification errors for the five methods, discussed in the paper.

4. Discussion and Conclusions

In the present paper, we aimed to improve the efficiency of the LDA classification method in high-dimensional situations, where the singularity of the inverse covariance matrix produced problems. We proposed to estimate the precision matrix with the shrinkage approach of Ledoit and Wolf (LW) [25], where the sample covariance matrix and a target matrix were linearly combined through a penalty factor in LDA classification.

The implementation of the suggested method on simulation data showed that for different scenarios, the proposed approach was superior to two powerful competitors, gLasso and SVM, in the sense of misclassification error. By increasing the Mahalanobis distance and using

p = 100

and

p = 500

, we became more confident that the LW method could be considered as an alternative to some well-known methods for better classification.

In the real data analyses, three results were considerable. First, the analysis of Data1 and Data2 showed that LW was as good as the other competitors or much better when

n > > p

. Second, the misclassification of Data3 (

n = 126, p = 309

) exhibited that our method was more reliable than the others in high-dimensional regimes (

n < p

). Third, the LW method had the same misclassification error for Data4 when

n < < p

, compared to others. However, the execution time of the LW method was much shorter. In addition to high accuracy, the LW method was strongly recommended from the computation burden point of view, especially in a high-dimensional setting, where computing precision matrix estimation is challenging.

Lastly, if the covariance matrices of the populations are not the same (

Σ_{i} \neq Σ_{j}; i, j = 1, 2, \dots, K

), instead of LDA, it is possible to use the quadratic discriminant analysis (QDA). Friedman [31] proposed the method of regularized discriminant analysis (RDA) by using the trace estimator,

p^{- 1} t r (Σ_{i}) I_{p}

, and applying twice the shrinking sample covariance matrix. Since the trace estimator pools the diagonal elements of sample covariance matrices and ignores off-diagonal elements, Wu et al. [32] introduced the ppQDA estimator, which pools all elements in the covariance matrix, and it does not need to impose sparse assumptions. Although the ppQDA method has good asymptotic properties, it may not have a good performance for data classification. Therefore, considering (5) and Friedman’s [31] method as follows

S_{i} (λ, γ) = (1 - γ) S_{i} (λ) + γ (p^{- 1} t r (S_{i} (λ)) I_{p}),

where

λ

,

γ

are tuning parameters, and

S_{i} (λ) = \frac{(1 - λ) (n_{i} - 1) S_{i} + λ (n_{1} + n_{2} - 2) S}{(1 - λ) (n_{i} - 1) + λ (n_{1} + n_{2} - 2)}

, it seems that by using the shrinkage method presented in this article, estimation of

Σ_{i}

can be improved and it is possible to reduce the misclassification error of high-dimensional data in the quadratic discriminant analysis mode.

Author Contributions

Conceptualization, R.L., D.S. and M.A.; Funding acquisition, M.A.; Methodology, R.L., D.S. and M.A.; Software, R.L.; Supervision, D.S. and M.A.; Visualization, R.L., D.S. and M.A.; Formal analysis, R.L., D.S. and M.A.; Writing—original draft preparation, R.L.; Writing—review and editing, R.L., D.S. and M.A. All authors have read and agreed to the published version of the manuscript.

Funding

This work was based upon research supported in part by the National Research Foundation (NRF) of South Africa, SARChI Research Chair UID: 71199, the South African DST-NRF-MRC SARChI Research Chair in Biostatistics (Grant No. 114613), and STATOMET at the Department of Statistics at the University of Pretoria, South Africa. The third author’s research (M. Arashi) is supported by a grant from Ferdowsi University of Mashhad (N.2/58266). The opinions expressed and conclusions arrived at are those of the authors and are not necessarily to be attributed to the NRF.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data are publicly available.

Acknowledgments

The authors would like to sincerely thank the anonymous reviewers for their constructive comments, which helped to improve the paper.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Here, we give the sketch of the proofs of Lemma 1 and Theorem 2.

Proof of Lemma 1.

According to (3), replacing

S^{- 1}

with

{\tilde{S}}^{- 1} (λ^{L W})

, the decision function

\tilde{W} (x, λ^{L W})

is given by

\begin{matrix} \tilde{W} (x, λ^{L W}) = & {({\bar{x}}_{1} - {\bar{x}}_{2})}^{T} {\tilde{S}}^{- 1} (λ^{L W}) (x - \frac{{\bar{x}}_{1} + {\bar{x}}_{2}}{2}) \\ = & {({\bar{x}}_{1} - {\bar{x}}_{2})}^{T} {\tilde{S}}^{- 1} (λ^{L W}) x - \frac{1}{2} {({\bar{x}}_{1} - {\bar{x}}_{2})}^{T} {\tilde{S}}^{- 1} (λ^{L W}) ({\bar{x}}_{1} + {\bar{x}}_{2}) . \end{matrix}

On the other hand, since

x

∼

N_{p} (μ, Σ)

is independent of

{\bar{x}}_{1}

,

{\bar{x}}_{2}

, and

S

, the conditional distribution

\tilde{W} (x, λ^{L W})

given

Π_{1}

has the following mean and variance

E (\tilde{W} (x, λ^{L W}) | Π_{1}) = {({\bar{x}}_{1} - \bar{x_{2}})}^{T} {\tilde{S}}^{- 1} (λ^{L W}) E (x | Π_{1}) - \frac{1}{2} {({\bar{x}}_{1} - {\bar{x}}_{2})}^{T} {\tilde{S}}^{- 1} (λ^{L W}) ({\bar{x}}_{1} + {\bar{x}}_{2}) .

var (\tilde{W} (x, λ^{L W}) | Π_{1}) = {({\bar{x}}_{1} - {\bar{x}}_{2})}^{T} {\tilde{S}}^{- 1} (λ^{L W}) var (x | Π_{1}) {\tilde{S}}^{- 1} (λ^{L W}) ({\bar{x}}_{1} - {\bar{x}}_{2})

The proof is complete. □

Proof of Theorem 2.

Considering

d = {\bar{x}}_{1} - {\bar{x}}_{2}

and

d^{*} = \frac{\bar{x_{1}} + \bar{x_{2}}}{2}

in Equations (8) and (9), we have

E (W (x | Π_{1})) = d^{T} S^{- 1} μ_{1} - d^{T} S^{- 1} d^{*}

var (W (x | Π_{1})) = d^{T} S^{- 1} Σ S^{- 1} d .

Let

x^{*} = A x + b

, where

A

is a nonsingular matrix,

A

and b are considered in such a way that

Σ ⟶ I

;

μ_{1} - μ_{2} ⟶ δ

,

δ = (Δ, 0, \dots 0)

;

Δ = {({(μ_{1} - μ_{2})}^{T} Σ^{- 1} (μ_{1} - μ_{2}))}^{\frac{1}{2}}

; and

μ_{1} ⟶ 0

; we also define the variables

Y, V

and

Z

as follows

S = I + \frac{1}{\sqrt{n}} V; {\bar{x}}_{1} = \frac{1}{\sqrt{n}} Z; d = δ - \frac{1}{\sqrt{n}} Y .

As a result, Equations (8) and (9) are equal to

E (W (x | Π_{1})) = - d^{T} S^{- 1} d^{*}

and

var (W (x | Π_{1})) = d^{T} S^{- 2} d .

Using the Taylor’s series expansion, we have

\begin{matrix} E (W (x | Π_{1})) = & - d^{T} S^{- 1} d^{*} \\ = & - d^{T} (I - \frac{1}{n^{\frac{1}{2}}} V + \frac{1}{n} V^{2} - \frac{1}{n^{\frac{3}{2}}} V^{3} + \dots) d^{*} \\ = & - d^{T} d^{*} + \frac{1}{n^{\frac{1}{2}}} d^{T} V d^{*} - \frac{1}{n} d^{T} V^{2} d^{*} + O (n^{- \frac{3}{2}}) \end{matrix}

and

\begin{matrix} var (W (x | Π_{1})) = & d^{T} S^{- 2} d \\ = & d^{T} (I - \frac{2}{n^{\frac{1}{2}}} V + \frac{3}{n} V^{2} - \frac{4}{n^{\frac{3}{2}}} V^{3} + \dots) d \\ = & d^{T} d - \frac{2}{n^{\frac{1}{2}}} d^{T} V d - \frac{3}{n} d^{T} V^{2} d + o (n^{- \frac{3}{2}}) . \end{matrix}

Using Ledoit and Wolf [25] and

\tilde{S} (λ^{L W}) = (1 - λ^{L W}) S + λ^{L W} T

, we obtain

\begin{matrix} E (\tilde{W} (x, λ^{L W}) | Π_{1})) = & - d^{T} {\tilde{S}}^{- 1} (λ^{L W}) d^{*} \\ = & - d^{T} d^{*} + \frac{1 - λ^{L W}}{n^{\frac{1}{2}}} d^{T} V d^{*} - \frac{{(1 - λ^{L W})}^{2}}{n} d^{T} V^{2} d^{*} \\ + & O ({(1 - λ^{L W})}^{3} n^{- \frac{3}{2}}) \\ = & - d^{T} d^{*} + \frac{1}{n^{\frac{1}{2}}} d^{T} V d^{*} - \frac{1}{n} d^{T} V^{2} d^{*} - \frac{λ^{L W}}{n^{\frac{1}{2}}} d^{T} V d^{*} \\ - & \frac{λ^{L W} (λ^{L W} - 2)}{n} d^{T} V^{2} d^{*} + O ({(1 - λ^{L W})}^{3} n^{- \frac{3}{2}}) \\ = & E (W (x | Π_{1})) + B + O (λ^{L W} n^{- \frac{1}{2}}), \end{matrix}

where

B = \frac{λ^{L W} (λ^{L W} - 2)}{n} Δ^{2}

.

Moreover,

\begin{matrix} var (\tilde{W} (x, λ^{L W}) | Π_{1}) = & d^{T} {\tilde{S}}^{- 2} (λ^{L W}) d \\ = & d^{T} d - \frac{2 (1 - λ^{L W})}{n^{\frac{1}{2}}} d^{T} V d - \frac{3 {(1 - λ^{L W})}^{2}}{n} d^{T} V^{2} d + o ({(1 - λ^{L W})}^{3} n^{- \frac{3}{2}}) \\ = & d^{T} d - \frac{2}{n^{\frac{1}{2}}} d^{T} V d + \frac{3}{n} d^{T} V^{2} d + \frac{3 λ^{L W} (λ^{L W} - 2)}{n} d^{T} V^{2} d \\ + & o (λ^{L W} n^{- \frac{1}{2}}) \\ = & var (W (x | Π_{1})) + ψ (λ^{L W}) Δ^{2} + o (λ^{L W} n^{- \frac{1}{2}}), \end{matrix}

in which

ψ (λ^{L W}) = \frac{3 λ^{L W} (λ^{L W} - 2)}{n}

. Since

0 < λ^{L W} < 1

, we obtain

ψ (λ^{L W}) < 0

, and the proof is complete. □

References

Clemmensen, L.; Hastie, T.; Witten, D.; Ersbøll, B. Sparse discriminant analysis. Technometrics 2011, 53, 406–413. [Google Scholar] [CrossRef] [Green Version]
Peck, R.; Van Ness, J. The use of shrinkage estimators in linear discriminant analysis. IEEE Trans. Pattern Anal. Mach. Intell. 1982, 5, 530–537. [Google Scholar] [CrossRef] [PubMed]
Srivastava, M.S. Multivariate theory for analyzing high dimensional data. J. Jpn. Stat. Soc. 2007, 37, 53–86. [Google Scholar] [CrossRef] [Green Version]
Dempster, A.P. Covariance selection. Biometrics 1972, 28, 157–175. [Google Scholar] [CrossRef]
Meinshausen, N.; Bühlmann, P. High-dimensional graphs and variable selection with the lasso. Ann. Stat. 2006, 34, 1436–1462. [Google Scholar] [CrossRef] [Green Version]
Banerjee, O.; El Ghaoui, L.; d’Aspremont, A. Model selection through sparse maximum likelihood estimation for multivariate Gaussian or binary data. J. Mach. Learn. Res. 2008, 9, 485–516. [Google Scholar]
Friedman, J.; Hastie, T.; Tibshirani, R. Sparse inverse covariance estimation with the graphical lasso. Biostatistics 2008, 9, 432–441. [Google Scholar] [CrossRef] [Green Version]
Bickel, P.J.; Levina, E. Covariance regularization by thresholding. Ann. Stat. 2008, 36, 2577–2604. [Google Scholar] [CrossRef]
Cai, T.T.; Zhang, L. High dimensional linear discriminant analysis: Optimality, adaptive algorithm and missing data. J. R. Stat. Soc. Ser. (Stat. Methodol.) 2019, 89, 675–705. [Google Scholar]
Rothman, A.J.; Levina, E.; Zhu, J. Generalized thresholding of large covariance matrices. J. Am. Stat. Assoc. 2009, 104, 177–186. [Google Scholar] [CrossRef]
Bien, J.; Tibshirani, R. Sparse estimation of a covariance matrix. Biometrika 2011, 98, 807–820. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Fan, J.; Liao, Y.; Liu, H. An overview of the estimation of large covariance and precision matrices. Econom. J. 2016, 19, C1–C32. [Google Scholar] [CrossRef]
Stein, C.; James, W. Estimation with quadratic loss. In Proceedings of the Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, CA, USA, 20–30 June 1961; Volume 1, pp. 361–379. [Google Scholar]
Efron, B. Biased versus unbiased estimation. Adv. Math. 1975, 16, 259–277. [Google Scholar] [CrossRef] [Green Version]
Efron, B.; Morris, C. Data analysis using Stein’s estimator and its generalizations. J. Am. Stat. Assoc. 1975, 70, 311–319. [Google Scholar] [CrossRef]
Efron, B.; Morris, C. Multivariate empirical Bayes and estimation of covariance matrices. Ann. Stat. 1976, 4, 22–32. [Google Scholar] [CrossRef]
Di Pillo, P.J. The application of bias to discriminant analysis. Commun. Stat. Theory Methods 1976, 5, 843–854. [Google Scholar] [CrossRef]
Campbell, N.A. Shrunken estimators in discriminant and canonical variate analysis. J. R. Stat. Soc. Ser. (Appl. Stat.) 1980, 29, 5–14. [Google Scholar] [CrossRef]
Mkhadri, A. Shrinkage parameter for the modified linear discriminant analysis. Pattern Recognit. Lett. 1995, 16, 267–275. [Google Scholar] [CrossRef] [Green Version]
Choi, Y.-G.; Lim, J.; Roy, A.; Park, J. Fixed support positive-definite modification of covariance matrix estimators via linear shrinkage. J. Multivar. Anal. 2019, 171, 234–249. [Google Scholar] [CrossRef] [Green Version]
Bickel, P.J.; Levina, E. Regularized estimation of large covariance matrices. Ann. Stat. 2008, 36, 199–227. [Google Scholar] [CrossRef]
Khare, K.; Rajaratnam, B. Wishart distributions for decomposable covariance graph models. Ann. Stat. 2011, 39, 514–555. [Google Scholar] [CrossRef] [Green Version]
Cai, T.; Zhou, H. Minimax estimation of large covariance matrices under ℓ₁-norm. Stat. Sin. 2012, 22, 1319–1349. [Google Scholar]
Maurya, A. A well-conditioned and sparse estimation of covariance and inverse covariance matrices using a joint penalty. J. Mach. Learn. Res. 2016, 17, 4457–4484. [Google Scholar]
Ledoit, O.; Wolf, M. A well-conditioned estimator for large-dimensional covariance matrices. J. Multivar. Anal. 2004, 88, 365–411. [Google Scholar] [CrossRef]
Wang, C.; Pan, G.; Tong, T.; Zhu, L. Shrinkage estimation of large dimensional precision matrix using random matrix theory. Stat. Sin. 2015, 25, 993–1008. [Google Scholar] [CrossRef] [Green Version]
Hong, Y.; Kim, C. Recent developments in high dimensional covariance estimation and its related issues, a review. J. Korean Stat. Soc. 2018, 47, 239–247. [Google Scholar] [CrossRef]
Le, K.T.; Chaux, C.; Richard, F.; Guedj, E. An adapted linear discriminant analysis with variable selection for the classification in high-dimension, and an application to medical data. Comput. Stat. Data Anal. 2020, 152, 107031. [Google Scholar] [CrossRef]
Srivastava, M.S. Some tests concerning the covariance matrix in high dimensional data. J. Jpn. Stat. Soc. 2005, 35, 251–272. [Google Scholar] [CrossRef] [Green Version]
Ledoit, O.; Wolf, M. Nonlinear shrinkage estimation of large-dimensional covariance matrices. Ann. Stat. 2012, 40, 1024–1060. [Google Scholar] [CrossRef]
Friedman, J.H. Regularized discriminant analysis. J. Am. Stat. Assoc. 1989, 88, 165–175. [Google Scholar] [CrossRef]
Wu, Y.; Qin, Y.; Zhu, M. Quadratic discriminant analysis for high-dimensional data. Stat. Sin. 2019, 29, 939–960. [Google Scholar] [CrossRef]

Figure 1. Misclassification error by increasing the Mahalanobis distance. (a) Mod1,

ρ = 0.2, n = 10

; (b) Mod1,

ρ = 0.4, n = 10

; (c) Mod2,

ρ = 0.2, n = 10

; (d) Mod2,

ρ = 0.4, n = 10

.

Figure 1. Misclassification error by increasing the Mahalanobis distance. (a) Mod1,

ρ = 0.2, n = 10

; (b) Mod1,

ρ = 0.4, n = 10

; (c) Mod2,

ρ = 0.2, n = 10

; (d) Mod2,

ρ = 0.4, n = 10

.

Figure 2. Misclassification error by increasing sample size and

p = 12

. (a) Mod1,

ρ = 0.2, D = 0.5

; (b) Mod1,

ρ = 0.2, D = 1.5

; (c) Mod1,

ρ = 0.2, D = 2.5

.

Figure 2. Misclassification error by increasing sample size and

p = 12

. (a) Mod1,

ρ = 0.2, D = 0.5

; (b) Mod1,

ρ = 0.2, D = 1.5

; (c) Mod1,

ρ = 0.2, D = 2.5

.

Figure 3. Misclassification error for the four real datasets.

Table 1. Misclassification error and shrinkage parameter values for Mod1.

		$ρ = 0.2$		$ρ = 0.4$
		$Test$	$\bar{λ}$	$Test$	$\bar{λ}$
$D = 0.5$	$L D A$	$0.487 (0.037)$	$- - -$	$0.475 (0.072)$	$- - -$
	$C V$	$0.462 (0.042)$	$0.949 (0.043)$	$0.472 (0.052)$	$0.957 (0.031)$
	$L W$	$0.443 (0.039)$	$0.796 (0.146)$	$0.462 (0.070)$	$0.409 (0.161)$
	$g L a s s o$	$0.478 (0.040)$	$0.007 (0.001)$	$0.479 (0.072)$	$0.009 (0.002)$
	$S V M$	$0.504 (0.035)$	$- - -$	$0.506 (0.034)$	$- - -$
$D = 1.5$	$L D A$	$0.406 (0.064)$	$- - -$	$0.369 (0.055)$	$- - -$
	$C V$	$0.374 (0.072)$	$0.974 (0.023)$	$0.340 (0.043)$	$0.952 (0.029)$
	$L W$	$0.351 (0.073)$	$0.846 (0.160)$	$0.318 (0.034)$	$0.439 (0.121)$
	$g L a s s o$	$0.390 (0.066)$	$0.007 (0.002)$	$0.345 (0.046)$	$0.009 (0.002)$
	$S V M$	$0.461 (0.052)$	$- - -$	$0.441 (0.057)$	$- - -$
$D = 2.5$	$L D A$	$0.268 (0.058)$	$- - -$	$0.251 (0.060)$	$- - -$
	$C V$	$0.195 (0.042)$	$0.975 (0.033)$	$0.235 (0.068)$	$0.969 (0.032)$
	$L W$	$0.183 (0.050)$	$0.833 (0.177)$	$0.189 (0.045)$	$0.591 (0.153)$
	$g L a s s o$	$0.233 (0.041)$	$0.007 (0.002)$	$0.233 (0.057)$	$0.008 (0.002)$
	$S V M$	$0.315 (0.064)$	$- - -$	$0.332 (0.087)$	$- - -$

Table 2. Misclassification error and shrinkage parameter values for Mod2.

		$ρ = 0.2$		$ρ = 0.4$
		$Test$	$\bar{λ}$	$Test$	$\bar{λ}$
$D = 0.5$	$L D A$	$0.491 (0.057)$	$- - -$	$0.502 (0.050)$	$- - - -$
	$C V$	$0.420 (0.091)$	$0.938 (0.039)$	$0.456 (0.065)$	$0.915 (0.040)$
	$L W$	$0.399 (0.099)$	$0.791 (0.204)$	$0.445 (0.053)$	$0.389 (0.120)$
	$g L a s s o$	$0.479 (0.042)$	$0.007 (0.001)$	$0.479 (0.055)$	$0.010 (0.002)$
	$S V M$	$0.438 (0.0679)$	$- - -$	$0.467 (0.086)$	$- - -$
$D = 1.5$	$L D A$	$0.214 (0.077)$	$0.00$	$0.324 (0.066)$	$0.00$
	$C V$	$0.104 (0.030)$	$0.965 (0.035)$	$0.218 (0.059)$	$0.950 (0.036)$
	$L W$	$0.100 (0.035)$	$0.716 (0.226)$	$0.214 (0.036)$	$0.679 (0.273)$
	$g L a s s o$	$0.186 (0.044)$	$0.007 (0.001)$	$0.303 (0.082)$	$0.008 (0.002)$
	$S V M$	$0.140 (0.078)$	$- - -$	$0.215 (0.030)$	$- - -$
$D = 2.5$	$L D A$	$0.097 (0.076)$	$- - -$	$0.183 (0.054)$	$- - -$
	$C V$	$0.024 (0.013)$	$0.989 (0.014)$	$0.085 (0.027)$	$0.960 (0.030)$
	$L W$	$0.023 (0.012)$	$0.501 (0.142)$	$0.081 (0.031)$	$0.580 (0.180)$
	$g L a s s o$	$0.063 (0.025)$	$0.007 (0.001)$	$0.147 (0.041)$	$0.008 (0.001)$
	$S V M$	$0.031 (0.020)$	$- - -$	$0.083 (0.029)$	$- - -$

Table 3. Misclassification error and shrinkage parameter values for

ρ = 0.4

in Mod1 state.

Table 3. Misclassification error and shrinkage parameter values for

ρ = 0.4

in Mod1 state.

		LDA	CV	LW	gLasso	SVM
$D = 0.5$	$p = 12$	$0.475 (0.072)$	$0.472 (0.052)$	$0.462 (0.070)$	$0.479 (0.072)$	$0.506 (0.034)$
	$p = 16$	$0.494 (0.060)$	$0.494 (0.042)$	$0.450 (0.050)$	$0.477 (0.054)$	$502 (0.050)$
	$p = 30$	$- - -$ ^a	$- - -$	$0.478 (0.066)$	$0.495 (0.076)$	$0.511 (0.030)$
	$p = 50$	$- - -$	$- - -$	$0.470 (0.063)$	$0.493 (0.061)$	$0.504 (0.037)$
	$p = 100$	$- - -$	$- - -$	$0.496 (0.045)$	$0.504 (0.039)$	$0.515 (0.034)$
	$p = 500$	$- - -$	$- - -$	$0.502 (0.044)$	$0.519 (0.056)$	$0.505 (0.034)$
$D = 1.5$	$p = 12$	$0.369 (0.055)$	$0.340 (0.043)$	$0.318 (0.034)$	$0.345 (0.046)$	$0.441 (0.057)$
	$p = 16$	$0.413 (0.077)$	$0.537 (0.098)$	$0.334 (0.049)$	$0.373 (0.074)$	$0.458 (0.045)$
	$p = 30$	$- - -$	$- - -$	$0.372 (0.041)$	$0.410 (0.070)$	$0.501 (0.053)$
	$p = 50$	$- - -$	$- - -$	$0.409 (0.073)$	$0.417 (0.061)$	$0.490 (0.037)$
	$p = 100$	$- - -$	$- - -$	$0.415 (0.055)$	$0.422 (0.040)$	$0.502 (0.027)$
	$p = 500$	$- - -$	$- - -$	$0.451 (0.059)$	$0.462 (0.038)$	$0.518 (0.027)$
$D = 2.5$	$p = 12$	$0.251 (0.060)$	$0.235 (0.068)$	$0.189 (0.045)$	$0.233 (0.057)$	$0.332 (0.087)$
	$p = 16$	$0.338 (0.072)$	$0.592 (0.115)$	$0.213 (0.054)$	$0.260 (0.057)$	$0.391 (0.078)$
	$p = 30$	$- - -$	$- - -$	$0.235 (0.062)$	$0.276 (0.070)$	$0.410 (0.079)$
	$p = 50$	$- - -$	$- - -$	$0.266 (0.055)$	$0.286 (0.061)$	$0.466 (0.051)$
	$p = 100$	$- - -$	$- - -$	$0.326 (0.067)$	$0.330 (0.053)$	$0.482 (0.044)$
	$p = 500$	$- - -$	$- - -$	$0.428 (0.069)$	$0.430 (0.076)$	$0.488 (0.061)$

^a The covariance matrix is singular.

Table 4. Datasets (Accessed on 30 November 2022).

	$DataName$	$Specification$	$Link$
Data1	$BreastCancer$	$n = 116$	www.UCIMachineLearning.com
		$p = 10$
Data2	$Insurance$	$n = 36, 634$	www.Kaggle.com
		$p = 17$
Data3	$LSVT$	$n = 126$	www.UCIMachineLearning.com
		$p = 309$
Data4	$mRNA$	$n = 219$	www.UCIMachineLearning.com
		$p = 1650$

Table 5. Misclassification error for the real datasets in Table 4.

	LDA	CV	SVM	gLasso	LW
Data1	$0.31707$	$0.26829$	$0.34146$	$0.29268$	$0.26829$
Data2	$0.00063$	$0.00063$	$0.00009$	$0.00018$	$0.00009$
Data3	$N a N$ ^a	$N a N$	$0.21053$	$0.18421$	$0.10526$
Data4	$N a N$	$N a N$	$0.00000$	$0.00000$	$0.00000$

^a The covariance matrix is singular.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lotfi, R.; Shahsavani, D.; Arashi, M. Classification in High Dimension Using the Ledoit–Wolf Shrinkage Method. Mathematics 2022, 10, 4069. https://doi.org/10.3390/math10214069

AMA Style

Lotfi R, Shahsavani D, Arashi M. Classification in High Dimension Using the Ledoit–Wolf Shrinkage Method. Mathematics. 2022; 10(21):4069. https://doi.org/10.3390/math10214069

Chicago/Turabian Style

Lotfi, Rasoul, Davood Shahsavani, and Mohammad Arashi. 2022. "Classification in High Dimension Using the Ledoit–Wolf Shrinkage Method" Mathematics 10, no. 21: 4069. https://doi.org/10.3390/math10214069

APA Style

Lotfi, R., Shahsavani, D., & Arashi, M. (2022). Classification in High Dimension Using the Ledoit–Wolf Shrinkage Method. Mathematics, 10(21), 4069. https://doi.org/10.3390/math10214069

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Classification in High Dimension Using the Ledoit–Wolf Shrinkage Method

Abstract

1. Introduction

2. Materials and Methods

2.1. Ledoit and Wolf Shrinkage Estimators

2.2. Improved Linear Discriminant Rules

2.3. Properties of the Improved Discriminant Rule

3. Numerical Studies

3.1. Simulation Study

3.2. Real Data Analyses

4. Discussion and Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI