A Trimmed Clustering-Based l 1-Principal Component Analysis Model for Image Classification and Clustering Problems with Outliers

Different versions of principal component analysis (PCA) have been widely used to extract important information for image recognition and image clustering problems. However, owing to the presence of outliers, this remains challenging. This paper proposes a new PCA methodology based on a novel discovery that the widely used l1-PCA is equivalent to a two-groups k-means clustering model. The projection vector of the l1-PCA is the vector difference between the two cluster centers estimated by the clustering model. In theory, this vector difference provides inter-cluster information, which is beneficial for distinguishing data objects from different classes. However, the performance of l1-PCA is not comparable with the state-of-the-art methods. This is because the l1-PCA can be sensitive to outliers, as the equivalent clustering model is not robust to outliers. To overcome this limitation, we introduce a trimming function to the clustering model and propose a trimmed-clustering based l1-PCA (TC-PCA). With this trimming set formulation, the TC-PCA is not sensitive to outliers. Besides, we mathematically prove the convergence of the proposed algorithm. Experimental results on image classification and clustering indicate that our proposed method outperforms the current state-of-the-art methods.


Introduction
Image classification and clustering problems are topics fundamental to various areas of machine learning [1][2][3] including image recognition and image clustering.Principal component analysis (PCA) has been widely used to perform dimensionality reduction and extract useful information for these problems [4][5][6][7][8][9][10][11].One of the common objectives of dimensionality reduction is to retain the most important information that is beneficial to data processing tasks and meanwhile filter out corrupted and noisy information from the dataset.One example is the face recognition problem.A face database may contain face images with occlusions such as scarves, sunglasses and so forth.Obviously, the key information that can recognize a person is the facial features not the scarves nor the sunglasses.A good dimension reduction method can effectively extract the key facial features and, at the same time, ignore the occluded information.Another example is the image clustering problem.The purpose of clustering is to partition the data into different clusters and group the similar data objects together.However, similar to the above face recognition problem, the corrupted and noisy information such as scarves can make two facial images very different even they are taken from the same person.This makes the clustering task very challenging.Again, a good dimension reduction method can filter out the scares and retain the key facial features and gives a more accurate clustering result.
The major challenge of PCA for supervised classification and unsupervised clustering problems is how to extract important information that can distinguish the characteristics of different classes/clusters from a corrupted and noisy dataset.For a face recognition problem, the facial features of different persons are the key to distinguish them.However, the corrupted and noisy information can be a factor that cause the difference among different facial images.Figure 1 shows two frontal face images and two facial images with scares.Obviously, facial features such as eyes of these persons are a bit different.They can be used to distinguish these persons.However, the face with scarves and the face without scarves are different.Also, the two scarves are different too.If the dimension reduction method wrongly identifies the scarves as the key features that cause the difference, the accuracy of the trained recognition system may heavily depend on this occluded information and lead to a poor recognition result.In image recognition and clustering problem, this kind of data object is known as outlier.This means the data object carry corrupted and noisy information.
The major challenge of PCA for supervised classification and unsupervised clustering problems is how to extract important information that can distinguish the characteristics of different classes/clusters from a corrupted and noisy dataset.For a face recognition problem, the facial features of different persons are the key to distinguish them.However, the corrupted and noisy information can be a factor that cause the difference among different facial images.Figure 1 shows two frontal face images and two facial images with scares.Obviously, facial features such as eyes of these persons are a bit different.They can be used to distinguish these persons.However, the face with scarves and the face without scarves are different.Also, the two scarves are different too.If the dimension reduction method wrongly identifies the scarves as the key features that cause the difference, the accuracy of the trained recognition system may heavily depend on this occluded information and lead to a poor recognition result.In image recognition and clustering problem, this kind of data object is known as outlier.This means the data object carry corrupted and noisy information.Many different methods have been proposed to solve the above outlier problem.However, these techniques still have evident defects when applied to real-world classification and clustering problems.The very first method used to extract the important information of the data is to apply eigenvalue decomposition to the covariance matrix of the data [12].The eigenvectors with small eigenvalues are discarded and the eigenvectors with large eigenvalues are retained for classification or clustering.This method works well only if the data do not contain any outlier.This is because an eigenvector of a covariance matrix represents the variance of the data in a specific direction.However, the occlusion of a facial image can cause large variation as well.The eigenvectors with large eigenvalues can wrongly identify the occluded information as important information.To address this problem, different  norm-based methods have been proposed.One example is the  -PCA, which adopts the  norm to measure the variation of features of the data [13,14].This approach replaces the squared  norm that is adopted by the classical PCA by a  norm.The  norm measure usually gives a much lower weight to the data objects that are far away from the majority.Although this approach is proved to be more robust to outliers and more effective than the classical PCA, seldom work is devoted to study how this approach extract information that can distinguish different classes or clusters for the classification or clustering problems.Another popular approach is the rankbased PCA method.Its idea is to decompose the data matrix into a single low-rank matrix and a sparse component [15][16][17].The low-rank matrix approximates the common features that are linearly correlated in the data while the sparse component approximates less frequently appeared features and these are assumed to be the corrupted and noisy information.For face recognition or clustering problem, the low-rank matrix represents the common facial features of different images that are highly correlated to each other.The scarves and sunglasses are less frequently appeared features in the database.They are relatively sparse information and belong to the sparse component.Although this approach can effectively identify the occluded information, it may discard some information that can distinguish the characteristics of different classes or clusters.In the low-rank formulation, the common facial features of different data objects are not strictly linearly correlated to each other.Some people may have larger eyes while some may have more attractive lips.These are key characteristics to distinguish different people.However, these are less frequently appeared features.They may be discarded and treated as sparse information.
To alleviate the aforementioned issues, we introduce a modified version of the  -PCA, which incorporates clustering with  -PCA.We find that the superiority performance of the  -PCA over Many different methods have been proposed to solve the above outlier problem.However, these techniques still have evident defects when applied to real-world classification and clustering problems.The very first method used to extract the important information of the data is to apply eigenvalue decomposition to the covariance matrix of the data [12].The eigenvectors with small eigenvalues are discarded and the eigenvectors with large eigenvalues are retained for classification or clustering.This method works well only if the data do not contain any outlier.This is because an eigenvector of a covariance matrix represents the variance of the data in a specific direction.However, the occlusion of a facial image can cause large variation as well.The eigenvectors with large eigenvalues can wrongly identify the occluded information as important information.To address this problem, different l p norm-based methods have been proposed.One example is the l 1 -PCA, which adopts the l 1 norm to measure the variation of features of the data [13,14].This approach replaces the squared l 2 norm that is adopted by the classical PCA by a l 1 norm.The l 1 norm measure usually gives a much lower weight to the data objects that are far away from the majority.Although this approach is proved to be more robust to outliers and more effective than the classical PCA, seldom work is devoted to study how this approach extract information that can distinguish different classes or clusters for the classification or clustering problems.Another popular approach is the rank-based PCA method.Its idea is to decompose the data matrix into a single low-rank matrix and a sparse component [15][16][17].The low-rank matrix approximates the common features that are linearly correlated in the data while the sparse component approximates less frequently appeared features and these are assumed to be the corrupted and noisy information.For face recognition or clustering problem, the low-rank matrix represents the common facial features of different images that are highly correlated to each other.The scarves and sunglasses are less frequently appeared features in the database.They are relatively sparse information and belong to the sparse component.Although this approach can effectively identify the occluded information, it may discard some information that can distinguish the characteristics of different classes or clusters.In the low-rank formulation, the common facial features of different data objects are not strictly linearly correlated to each other.Some people may have larger eyes while some may have more attractive lips.These are key characteristics to distinguish different people.However, these are less frequently appeared features.They may be discarded and treated as sparse information.
To alleviate the aforementioned issues, we introduce a modified version of the l 1 -PCA, which incorporates clustering with l 1 -PCA.We find that the superiority performance of the l 1 -PCA over the classical PCA is not only because of the l 1 norm formulation but also its ability to extract information that is beneficial to image classification and clustering problems.We mathematically prove that the l 1 -PCA can be expressed as a special two-group k-means clustering problem.The projection vector of the l 1 -PCA is the vector difference between the two cluster centers obtained by the k-means clustering problem.If the two clusters are two different classes of the data, the corresponding vector difference represents the inter-class direction that groups data objects with similar nature together.This is beneficial to distinguish the two classes.Figure 2 illustrates this situation.This figure shows two classes of data.The left cluster forms a class while the right cluster forms another class.The red stars are the cluster centers of the two different classes.The vector difference between these two centers is in the horizontal direction.Apparently, the horizontal direction provides the key information to distinguish the two classes.Although the l 1 -PCA possesses this important property, its performance is not as good as the state-of-the-art methods.The reason is that the k-means clustering algorithm adopts the squared l 2 norm in the formulation and is not robust to outliers.In other words, the l 1 -PCA is not robust to outlier.To overcome this limitation, we propose a new method, namely, trimmed-clustering based l 1 -PCA (TC-PCA) that replaces the squared l 2 norm by a trimming function in the special k-means clustering algorithm.The contributions of this paper are as follows.

•
We prove that the l 1 -PCA is equivalent to a two-group k-means clustering model.The projection vector estimated by the l 1 -PCA is the vector difference between the two cluster centers that are obtained by the k-means clustering algorithm.In other words, the projection vector of the l 1 -PCA represents the inter-cluster direction, that is beneficial to distinguish data objects from different classes.

•
We propose a novel TC-PCA model by integrating a trimming set into the two-group k-means clustering model, which makes the proposed method not sensitive to outliers.

•
We mathematically prove that the estimator of TC-PCA is insensitive to outliers, which shows the robustness of the proposed method.In addition, we mathematically prove the convergence of the proposed algorithm.
Appl.Sci.2019, 9, x FOR PEER REVIEW 3 of 26 the classical PCA is not only because of the  norm formulation but also its ability to extract information that is beneficial to image classification and clustering problems.We mathematically prove that the  -PCA can be expressed as a special two-group -means clustering problem.The projection vector of the  -PCA is the vector difference between the two cluster centers obtained by the -means clustering problem.If the two clusters are two different classes of the data, the corresponding vector difference represents the inter-class direction that groups data objects with similar nature together.This is beneficial to distinguish the two classes.Figure 2 illustrates this situation.This figure shows two classes of data.The left cluster forms a class while the right cluster forms another class.The red stars are the cluster centers of the two different classes.The vector difference between these two centers is in the horizontal direction.Apparently, the horizontal direction provides the key information to distinguish the two classes.Although the  -PCA possesses this important property, its performance is not as good as the state-of-the-art methods.The reason is that the -means clustering algorithm adopts the squared  norm in the formulation and is not robust to outliers.In other words, the  -PCA is not robust to outlier.To overcome this limitation, we propose a new method, namely, trimmed-clustering based l1-PCA (TC-PCA) that replaces the squared  norm by a trimming function in the special  -means clustering algorithm.The contributions of this paper are as follows.

•
We prove that the l1-PCA is equivalent to a two-group k-means clustering model.The projection vector estimated by the  -PCA is the vector difference between the two cluster centers that are obtained by the -means clustering algorithm.In other words, the projection vector of the  -PCA represents the inter-cluster direction, that is beneficial to distinguish data objects from different classes.

•
We propose a novel TC-PCA model by integrating a trimming set into the two-group k-means clustering model, which makes the proposed method not sensitive to outliers.

•
We mathematically prove that the estimator of TC-PCA is insensitive to outliers, which shows the robustness of the proposed method.In addition, we mathematically prove the convergence of the proposed algorithm.This paper is organized as follows.In the next section, we briefly review related work.Section 3 presents the proposed TC-PCA model and the overall implementation.In addition, we discuss the mathematical properties and performance of the proposed method.Experimental results are shown in Section 4 and Section 5 concludes the paper.

Related Work
The current PCA methodologies typically adopt the following two ways to obtain projection vectors and they are the  norm based estimator and the rank based estimator.This paper is organized as follows.In the next section, we briefly review related work.Section 3 presents the proposed TC-PCA model and the overall implementation.In addition, we discuss the mathematical properties and performance of the proposed method.Experimental results are shown in Sections 4 and 5 concludes the paper.

Related Work
The current PCA methodologies typically adopt the following two ways to obtain projection vectors and they are the l p norm based estimator and the rank based estimator.
(i) l p norm based estimator: it models the projection vectors by a function of a l p norm.In contrast with the classical squared l 2 norm, the l p norm gives much smaller function values to the data objects that are far away from the majority.This can diminish the negative effect caused by outliers.One of the best known and most influential approaches is the l 1 -PCA, which produces projection vectors by maximizing the l 1 -norm of the projected data.However, owing to singularity of the l 1 norm, many different solvers have been proposed.In Reference [13], the l 1 -PCA was reformulated as a linear programming problem, which was then solved by the simplex method.The authors in Reference [18] adopted a relaxation approach that turns the binary constraint into a semi-definite constraint and applied the interior point method to obtain the optimal solution.Kwak [14] reformulated the l 1 -PCA and expressed the optimization problem by the composition constraints of l 1 and l 2 -norms.This method finds one projection vector at a time.An orthogonalization procedure is incorporated to find the rest of the projection vectors.Later, the PCA with non-greedy l 1 -norm maximization [19] was proposed.In contrast with the l 1 -PCA, the non-greedy approach can find all required projection vectors at once.Zhou et al. [20] directly incorporated a l 1 -norm penalty to the original PCA formulation, which performed feature selection for image classification and clustering problems.To find an optimal solution of the l 1 -PCA in non-greedy form, Markopoulos et al. [21] proposed a polar expression to the data matrix and expressed the problem in a dual form.Their method can effectively search through all possibilities and output a very high-quality solution.Later, Markopoulos et al. [22] proposed a different solver based on bit-flipping iterations.Kwak [23] generalized l 1 -PCA by replacing the l 1 norm by a general l p norm.More recently, Luo et al. [24,25] proposed a novel l 2,1 -norm based PCA that does not require data centralization for dimensionality reduction.Variants of l 1 -PCA have been proposed and they are formulated as a minimization of a residual function.In References [26] and [27], the authors introduced the l 1 -norm residual function and adopted an alternative convex optimization together with divide-and-conquer approach, respectively, to find the projection vectors.Other than l 1 norm, Nie et al. [28] proposed the l 2 -norm residual function with an optimal mean strategy to automatically obtain the center of the dataset.He et al. [29] proposed a residual function of squared l 2 -norm to PCA algorithm based on maximum correntropy and solved by a half-quadratic optimization method.More recently, Nie et al. [30] proposed a more general l p -norm based residual function named as l 2,p -norm PCA for image recognition problems.Other than the above PCA problems, l p norm based estimators are also widely used in other types of dimensionality approaches including 2DPCA, LDA, 2DLDA, tensor and so forth.Li et al. [31] applied the l 1 -norm formulation for 2DPCA [1].Ju et al. [32] introduced a probabilistic formulation for the l 1 -norm based 2DPCA with the aid of variational inference.Other than 2DPCA, it is also widely applied in LDA.Zhong et al. [33] and Liu et al. [34] proposed the use of the l 1 -norm in the LDA formulation.They replaced the squared l 2 -norm for distance between classes and within classes by the l 1 -norm.Wang et al. [35] introduced a non-greedy l 1 -norm technique for the 2DLDA.
(ii) Rank-based estimator: its idea is to decompose the data matrix into a sparse matrix and a low-rank matrix, which is a polished version of the original dataset with low-rank feature.This approach generally assumes that all data objects are sparsely corrupted, and the key characteristics of the data can be represented by a low-rank matrix.In References [15][16][17], the authors proposed a robust PCA (RPCA) to recover the low-rank matrices via convex optimization.The idea of RPCA is to modify the entries of data matrix in the sense of l 1 -norm so that the rank of matrix is minimized.The RPCA is widely used in various applications such as video surveillance [36], dictionary learning [37,38], compressed hyperspectral imaging and face recognition [39].Zhang et al. [40] improved the RPCA by introducing another l 1 -norm penalty that further regularizes the low-rank matrix.Sun et al. [32] enhanced the RPCA for video surveillance problem by introducing a local regularization term to the model that can better preserve the local structure of the image data.Wang et al. [41] proposed a probabilistic robust matrix factorization that formulated the sparse component with a Laplace prior and the two matrices for the low-rank component with a Gaussian prior.Wang et al. [42] improved the probabilistic robust matrix factorization model and proposed a new framework based on Bayesian formulation.They imposed conjugate priors (multivariate normal distribution and Wishart distribution) onto the low-rank component and a generalized inverse Gaussian distribution onto the sparse errors.Zhao et al. [43] modified the aforementioned Bayesian framework and introduced a two-level generative Gaussian approach to model the complex noise.Xue et al. [44] modified the RPCA by introducing a total variation and rank-1 constraints to the model.

A Trimmed-Clustering Based l 1 -PCA Model
In this section, we show that the l 1 -PCA is equivalent to the two-group k-means clustering model.We then mathematically prove the equivalence of these two models.After that, we present the proposed TC-PCA model and the algorithm to obtain a set of orthonormal projection vectors.Mathematical properties and performance of the proposed method will also be discussed.

Relating l 1 -PCA to Two-Groups K-Means Clustering
Consider the following l 1 -PCA where u is the projection vector and c is a centroid of the dataset.The centroid c is usually set as the mean or median of the whole data.In our proposed model, the projection vector u can be obtained without necessity of data centralization procedure.This will be explained later.Now, we show that the l 1 -PCA can be expressed as the two-group k-means clustering model below where 2 is the l 2 norm distance.l = l i,1 , l i,2 is the indicator variable with l i,k = 1, k = 1, 2, if x i belongs to the kth cluster and 0 otherwise and l i,2 = 1 − l i,1 .This clustering model partitions the data to two disjoint parts.They are represented by the two centroids, û + c and − û + c.The data points that are closer to û + c are classified as first cluster and the rest are the second cluster.Now, we examine this property in l 1 -PCA model.The absolute term of this model can be expressed as This expression introduces another interpretation of the l 1 -PCA.The sign function virtually assigns {−1, 1} to each data point and each data point belongs to one of the two regions or sign((x i − c) T u) = −1.This assignment means that if the point x i − c is in the same semi-plane as u, it is assigned as the first cluster (i.e., sign(( . Otherwise, it is the second cluster (i.e., sign(( . In other words, the point x i that is closer to u + c belongs to the first cluster.If it is closer to −u + c, it belongs to the second cluster.This is the same as the two-group clustering model.Figure 3 gives a visual illustration to this relationship.Figure 3a shows the 1st projection vector obtained by the l 1 -PCA on the 2-dimensional normal dataset with zero mean and variances 1 and 10, whereas Figure 3b shows the same dataset with the cluster centers estimated by the two-group k-means clustering model.In Figure 3a, the data points x i − c that are in the same semi-plane ((x i − c) T u > 0, the upper half of the data) of the vector u belong to the first cluster (shaded region).Otherwise, they are in the opposite semi-plane ((x i − c) T u < 0, the lower half of the data) and belong to the second cluster (non-shaded region).In Figure 3b, the data are partitioned into two parts.
The data points closer to the upper centroid belong to the first cluster (shaded region) while the rest of them belong to the second cluster (non-shaded region).With this connection, we can see that the vector difference between the two clusters (or û + c − (− û + c)) shown in Figure 3b is the projection vector u shown in Figure 3a.This will be mathematically justified next.The following gives a mathematical justification about the equivalence between the  -PCA model and the special two-group -means clustering model.By expressing each absolute term as a binary variable  , Equation (1) becomes The first equality holds because || = max We will see that the variable  in the  -PCA model is the difference of the two indicator variables of the -means clustering algorithm.That is,  =  , −  , .The above squared  norm function leads to the following binary quadratic programming problem: where  = [ − ,  − , . . .,  − ] and  = [ ,  , . . .,  ] ∈ ℝ .Let  =  , −  , ∈ {−1,1},  =  ⋅  with ‖‖ = 1 and  is a constant.The theoretical justification is complete if Equation ( 2) can be expressed as the above binary quadratic programming problem.It is noted that Equation (2) can be rewritten as The second and third equality hold because ‖ ‖ =  and  =  ⋅ .Thus, To minimize (7) with respect to , we need the following claim.The following gives a mathematical justification about the equivalence between the l 1 -PCA model and the special two-group k-means clustering model.By expressing each absolute term as a binary variable s i , Equation (1) becomes ( The first equality holds because |x| = max s∈{−1,1} sx.The second equality holds because max It is noted that the maximum point of the l 2 norm function is the same as the squared l 2 norm function.That is, We will see that the variable s i in the l 1 -PCA model is the difference of the two indicator variables of the k-means clustering algorithm.That is, s i = l i,1 − l i,2 .The above squared l 2 norm function leads to the following binary quadratic programming problem: where 2) can be expressed as the above binary quadratic programming problem.It is noted that Equation (2) can be rewritten as The second and third equality hold because û 2 2 = λ 2 and û = λ•u.Thus, To minimize (7) with respect to λ, we need the following claim.

Claim:
The global minimum value of the FUNCTION f n .Here, γ and n are constants.n is a positive number. Proof: By taking γ = 2 n i=1 s i ((x i − c) T u) and x = λ in the above claim, Equation ( 7) is written as which is equal to (5).The third equality holds because min In other words, the l 1 -PCA is essentially equivalent to the two-group k-means clustering model.This equivalence has three major implications.First, the projection vector u estimated by the l 1 -PCA is the normalized vector difference between the two cluster centers, û + c and û − c, obtained by the clustering algorithm.That is, u = û/ û 2 .Second, the binary vector s i introduced in Equation ( 3) is the difference of the two indicator variables of the two-group k-means clustering model.Third, the projection vector u can be obtained without data centralization procedure.

The Proposed Model
Based on the two-group k-means clustering model in (2), we propose the following trimmed-clustering based l 1 -PCA (TC-PCA) model: where v = {v 1 , v 2 } is a set of cluster centers, l = l i,1 , l i,2 is the indicator variable with l i,2 = 1 − l i, 1 and is a trimming set containing samples that are close to the cluster centers.When p = 100, all sample data will be considered whereas the trimming set is an empty set if p = 0.A trimming set has been proved to be a robust M-estimator and has a high breakdown point property [45][46][47].Following the parameter setting used in References [48] and [49], in this paper, we only consider 75% (i.e., p = 75) of data to estimate the projection vectors.Note that ( 8) is a special case of (2) by incorporating the trimming set into the two-group k-means clustering model and setting c + û = v 1 and c − û = v 2 .Solving these two equations leads to

The Overall Implementation
We shall provide the optimality conditions for (8) and present the overall implementation of the algorithm to estimate a collection of projection vectors.
Taking the first derivative of (8) with respect to v k and set it to zero, we obtain the optimality condition for v k 0 = ∂J obj (v, l, Ω(p)) That is, we have which is the average of sample data in both the trimming set and the kth cluster.The optimality conditions for the indicator variables are The condition (12) simply takes l i,k = 1, k = 1, 2, if x i is closer to the kth cluster.The optimization of (8) with the trimming set ( 9) is performed by the alternating minimization between v and l as well as the update of Ω(p).In what follows, we present an algorithm for extracting a single projection vector and the overall algorithm to estimate a set of orthonormal projection vectors.
Input p and t max .Set J = ∞ and t = 0. Step 2.
Randomly choose two samples from the dataset as initial cluster centers.Step 3.
Update the cluster center v k via (11).Step 5.
Update the indicator function l i,k via (12).Step 6.
If there is a change of the indicator functions l i,k , go to Step 3. Otherwise, compute J ob j via (8).Step 7.
The implementation of Single Projection Vector Extraction Algorithm (SPVEA), Algorithm 1 is summarized as follows.There are two input parameters (Step 1): the percentage of data used for estimating the projection vector (p) and the maximum number of re-initialization of the algorithm (t max ).We then randomly select two samples from the dataset as initial cluster centers, followed by updating the trimming set, cluster centers and indicator functions (Step 2-Step 5).If there is no change for the assignment l i,k of sample data to the cluster centers, the function value of TC-PCA is computed (Step 6).The above process is repeated t max times and the optimal solution is obtained corresponding to the least value of J obj (Step 7-Step 8).In our experiments, we set t max = 10.That means, we use 10 different sets of initial guesses for cluster center initialization and the one with the smallest objective function value is outputted.
To construct an entire set of orthonormal projection vectors U = [u 1 , u 2 , . . ., u D ] ∈ R d×D , we follow [14] which estimates the projection vectors in a one-by-one manner.After obtaining the first projection vector using the SPVEA, the dataset X is projected to the subspace in order to construct the second projection vector via the following x where I is an identity matrix and x (τ+1) i is the projected dataset for the estimation of u τ+1 with x (1) i = x i .The complete orthonormal projection vectors are obtained via the proposed TC-PCA algorithm, which iteratively applies the SPVEA with updated subspace.The TC-PCA algorithm stops until D projection vectors are obtained.This is Algorithm 2.
If τ ≤ D, go to Step 2; otherwise, output U and stop.

Mathematical Properties
In this subsection, we prove that the estimator of the proposed TC-PCA model is insensitive to outliers by breakdown point analysis.Here, to simplify the proof, we assume that the outliers are a group of points that are far away from the majority.We also prove that the proposed TC-PCA algorithm converges.
In robust statistics, the breakdown point measures the ability of a statistic to resist the outliers contained in the dataset [50].The higher the breakdown point of an estimator, the more robust it is.
Theorem 1.The estimator of TC-PCA model has a breakdown point of 1 − p%, where p is a parameter of the trimming set (9).
Proof.Suppose that X = {x 1 , x 2 , . . ., x n } is a sample of size n.Without loss of generality, we assume that the first n•p% samples are fixed and x i → ∞ for n•p% + 1 ≤ i ≤ n, where • is a floor function.With the definition of Ω(p) in ( 9), the last n•(1 − p%) samples will be discarded and thus Ω(p) = x i : i = 1, 2, . . ., n•p% .Therefore, the breakdown point of the estimator of TC-PCA model is n•(1 − p%)/n = 1 − p%.
Theorem 1 reveals that the estimator of TC-PCA model is insensitive to outliers and reliable if less than 1 − p% of data are outliers in the dataset.Using similar techniques as in Theorem 1, it can be shown that the breakdown point of l 1 -PCA is zero, which implies that any outlier exists in the dataset will affect the estimation of a projection vector.
In addition to the breakdown property of the proposed method, the convergence of the TC-PCA algorithm to obtain a collection of orthonormal projection vectors is of paramount importance.Since the major part of the proposed TC-PCA algorithm is SPVEA, which depends on (8), this is equivalent to proving the objective function of TC-PCA is decreasing and bounded.
, Ω(p) t and J obj (v t , l t , Ω(p) t ) be the cluster center, indicator function, trimming set and the function value of TC-PCA at the tth iteration, respectively.
Consider the above variables at the (t + 1)th iteration, the algorithm starts by updating the trimming set which selects samples in the first pth percentile of Next, we consider the update of cluster center.It is noted that , which is positive definite.This means at the (t + 1)th iteration given l t and Ω(p) t+1 , the function g(v) = J obj (v, l t , Ω(p) t+1 ) is convex.As Equation ( 11) must satisfy the first order optimality condition of g(v), this implies the update leads to a global optimum of g(v).That is, Finally, the indicator variable selects the smallest of the two terms 2 and thus, the update of indicator variables leads to a smaller objective function value, as follows: Combining ( 14)- (16) shows that the objective function is decreasing and bounded, as shown below: and hence the proposed TC-PCA algorithm converges.

Synthetic Analysis
We demonstrate the robustness of the proposed TC-PCA to the data with outliers.Figure 4 shows the first projection vectors estimated by the l 1 -PCA and the proposed TC-PCA in a noisy environment.The dataset with outliers is obtained by adding 50 2-dimensional Gaussian noise into the dataset used in Figure 3.With 4.8% (50/(1000 + 50)) outliers in the dataset, the projection vector obtained by the l 1 -PCA is influenced by outliers (see Figure 4a), whereas the projection vector estimated by the TC-PCA is not affected by outliers (see Figure 4b).This is mainly due to the fact that the TC-PCA adopts the trimming set which retains only p% of samples that are close to the cluster centers for the estimation of projection vector while the remaining 1 − p% of the data that include outliers will be discarded.It is important to note that the percentage of data discarded by the trimming set should be greater than the percentage of outliers in the dataset so that an accurate projection vector can be obtained (see Theorem 1).

Synthetic Analysis
We demonstrate the robustness of the proposed TC-PCA to the data with outliers.Figure 4 shows the first projection vectors estimated by the l1-PCA and the proposed TC-PCA in a noisy environment.The dataset with outliers is obtained by adding 50 2-dimensional Gaussian noise into the dataset used in Figure 3.With 4.8% (50/(1000 + 50)) outliers in the dataset, the projection vector obtained by the l1-PCA is influenced by outliers (see Figure 4a), whereas the projection vector estimated by the TC-PCA is not affected by outliers (see Figure 4b).This is mainly due to the fact that the TC-PCA adopts the trimming set which retains only % of samples that are close to the cluster centers for the estimation of projection vector while the remaining 1 − % of the data that include outliers will be discarded.It is important to note that the percentage of data discarded by the trimming set should be greater than the percentage of outliers in the dataset so that an accurate projection vector can be obtained (see Theorem 1).

Experiments
The proposed method is applied to image classification and clustering with various configurations.We shall compare the performance of TC-PCA with the current existing methods

Experiments
The proposed method is applied to image classification and clustering with various configurations.We shall compare the performance of TC-PCA with the current existing methods including PCA [12,51,52], half-quadratic PCA (HQ-PCA) [29], l 1 -PCA [14], robust PCA (RPCA) [15], optimal mean RPCA (OM-RPCA) [28] and avoid mean PCA (AM-PCA) [24,25] using Japanese Female Facial Expression Database (JAFFE) [53], Yale [54], A.M. Martinez and R. Benavente (AR) [55]  In this paper, we shall use all images from JAFFE and Yale face databases, whereas 5 men and 5 women are randomly selected from the AR face database for our experiments.For Coil-20 database, we randomly selected 10 objects with 6 views that are near-frontal views of the objects.We artificially impose outliers to two of the objects by imposing random bars at the middle part of the images.Figure 5 shows some examples of images with various configurations of the four databases.Yale face databases, whereas 5 men and 5 women are randomly selected from the AR face database for our experiments.For Coil-20 database, we randomly selected 10 objects with 6 views that are nearfrontal views of the objects.We artificially impose outliers to two of the objects by imposing random bars at the middle part of the images.Figure 5 shows some examples of images with various configurations of the four databases.

Image Classification
To evaluate the classification performance for outlier problems of various methodologies, we first divide the images of each database into two parts, namely, non-standard and standard.The former is defined as a group of images under lighting conditions or with occlusions whereas the latter represents normal images with various forms or facial expressions.Let  ,  = 1,2, . . ., , be the number of images of the  individual in the standard part for a particular database with q represents the number of individuals in the respective database and r be the number of images randomly selected within these  images.In the training phase,  −  images in the standard part and half of the images in the non-standard part are randomly selected to form the training set whereas the remaining images (i.e., r images of each individual in the standard part and the other half of the images in the non-standard part) are used for testing.Let  be any image in the training set and

Image Classification
To evaluate the classification performance for outlier problems of various methodologies, we first divide the images of each database into two parts, namely, non-standard and standard.The former is defined as a group of images under lighting conditions or with occlusions whereas the latter represents normal images with various forms or facial expressions.Let N i , i = 1, 2, . . ., q, be the number of images of the ith individual in the standard part for a particular database with q represents the number of individuals in the respective database and r be the number of images randomly selected within these N i images.In the training phase, N i − r images in the standard part and half of the images in the non-standard part are randomly selected to form the training set whereas the remaining images (i.e., r images of each individual in the standard part and the other half of the images in the non-standard part) are used for testing.Let Φ be any image in the training set and Ψ be the testing image.Both Φ and Ψ are projected onto u to obtain the weight vectors w Φ = u T (Φ − m) and w Ψ = u T (Ψ − m) [58], where m is the mean of training set.Then the testing image is assigned to a class based on the 1-nearest neighbor classifier, which has been extensively used in image classification problems [24,28].To quantify the classification performance of each methodology, we adopt the averaged maximum classification rate (AMCR) and averaged maximum F1-score (AMF1-score) with respect to the optimal number of orthonormal projection vectors.The formula for classification rate is defined as where TP(c), FP(c), FN(c) and TN(c) are the true positive, false positive, false negative and true negative for class c respectively.The maximum classification rate is the classification rate with the optimal number of projection vectors.The F1-score is defined as follows: Precision and recall are calculated as follows: recall(c) = TP(c) TP(c) + FN(c) (20) Similar to the classification rate, the maximum F1-score is the F1-score with the optimal number of projection vectors.Now, we explain the procedure to obtain the optimal number of projection vectors.
For each r per above random training set, the optimal number of vectors is selected by the holdout set method [59].First, a pair of holdout sets is created by splitting the above training set into two.70% of the samples are randomly selected to form the holdout set for training while the rest of them form the holdout set for testing.The optimal number of projection vectors is then the number of dimensions that produces the best classification rate (i.e., CR) with the 1-nearest neighbor being applied to the holdout set for testing 10 times.The averaged maximum classification rate (AMCR), averaged maximum F1-score (AMF1-score) and averaged optimal number of projection vectors (Avg.Dim.) are then computed by repeating all the above procedures to the training and testing sets 10 times.
Table 1 shows the comparative classification performance AMCR, Avg.Dim.(in the bracket) and AMF1-score (in the square bracket) with respect to the number of testing images selected (r = 1 to 9) for each individual in the standard part using the JAFFE face database.Note that both training and testing sets contain only standard images and thus, we shall evaluate the performance of various methods without non-standard images.As can be seen from Table 1, the proposed method generally performs better than the other methods.The TC-PCA usually achieves the highest AMCR and AMF1-score regardless of the number of testing images.Moreover, the TC-PCA provides a more effective way for dimensionality reduction than other PCA methods as it usually performs better than the classifier with all dimensions.This may not be the case for other PCA methods.Table 2 summarizes the AMCR and AMF1-score with respect to r using the Yale face database.We remark that the training and testing sets consist of both standard (i.e., normal face images) and non-standard images (i.e., face images under different lighting conditions).The percentages of non-standard images in the training set when r = 1, 2, 3 and 4 are 17.97%, 20.35%, 23.47% and 27.71%, respectively.In addition to reporting the overall AMCR and AMF1-score, we shall assess the performance of various methodologies in classifying non-standard and standard images.The following are some discussions and the main points we observe from Table 2.

1.
The overall AMCR and AMF1-score of TC-PCA is usually the highest among the PCA methods for any number of testing images.It is higher than other PCA methods up to 4.6% for both AMCR and AMF1-score.When the number of testing images is one (i.e., r = 1), the proposed method performs better than the second best, OM-RPCA method by 2.4% for AMCR and 1.4% for AMF1-score.Besides, the overall AMCR and AMF1-score of any PCA method increases with the number of testing images and approximately converges to 81%.As the number of standard images increases (i.e., more testing images), the recognition problems are easier for most methods, which lead to higher overall AMCRs and AMF1-scores.For a smaller number of testing images, the portion of non-standard images are higher in the testing set and the method must learn more precise features to correctly classify the images.This shows that the proposed method learns more effective features than other PCA methods.

2.
The TC-PCA outperforms other methods in classifying non-standard images up to 8.26% for AMCR and 7.2% for AMF1-score.These results experimentally justify that the proposed method can successfully discard information given by the non-standard images such as face images under various lighting conditions.On the other hand, the performance of all methods in classifying standard images are similar (around 97%-100%), which is consistent with the findings of JAFFE face database shown in Table 1.

3.
The overall classification accuracy and F1-score vary with the number of testing images for the Yale face database whereas the classification performance is not sensitive to the number of testing images for the JAFFE face database.Such a difference is mainly due to the ability of each methodology in recognizing non-standard images.Our experimental results reveal that the proposed method has a superior classification performance in classifying standard images and moreover, the use of trimming set in the TC-PCA model makes the proposed method not sensitive to outliers, which can improve the non-standard image classification performance.These results indicate that the overall classification performance of all methods is comparable for the face databases with only standard images while the proposed method outperforms all other methods for the face databases consisting of both standard and non-standard images.4.
The proposed method needs 9 to 21 fewer number of projection vectors to achieve high classification rates and F1-scores compared with other PCA methods.Experiments also show that the TC-PCA method usually performs better than the classifier with all dimensions (i.e., the size of an image, which is 81 × 61 = 4941) and it only needs 18 to 35 dimensions to achieve good results.This may not be the case for other dimensionality reduction techniques.This shows that the TC-PCA method can effectively discard the noisy information caused by the curse of dimensionality and also better represent the key characteristics of the data.Next, we compare the classification performance of all methods using AR face database.In this database, we perform three experiments.The first experiment (Normal + Lighting) uses all normal face images and images under lighting conditions.Similarly, the second and third experiments use all normal face images and images with sunglasses (Normal + Sunglasses) and scarves (Normal + Scarves), respectively.Note that the percentages of non-standard images in the training set when r = 1, 2, 3 and 4 are 17.65%, 20.00%, 23.08% and 27.27%, respectively and the classification results are shown in Tables 3-5.Similar to the experimental results of JAFFE and Yale face databases, the proposed method performs well and its overall AMCR and AMF1-score are usually the highest.It usually outperforms other methods in classifying non-standard images for all three experiments.We remark that the proposed method performs significantly better than the other methods in recognizing non-standard images by 36%-42.7% in the "Normal + Sunglasses" experiment with r = 2. Besides, the overall AMCRs and AMF1-scores indicate that the TC-PCA again performs better than the classifier with all dimensions for any number of testing images.This may not be achieved by other PCA methods.However, the AMCRs of non-standard image classification of all methods are less than 20% in the "Normal + Scarves" experiment.The unsatisfactory non-standard image classification performance in the "Normal + Scarves" experiment suggests that more advanced facial features are necessary to correctly classifying some non-standard images.Now, we show that the projection vectors estimated by the proposed TC-PCA can provide inter-cluster information that is beneficial to classification and clustering problems.Figure 6 shows projected data with projection vectors estimated by the l 1 -PCA and the proposed TC-PCA in three different cases (AR (Normal + Light), AR (Normal + Sunglasses) & AR (Normal + Scarves)).Here, we select AR (Normal + Light) as an example.It is obvious that the projected data consists of two compact clusters with some outliers.The proposed TC-PCA can ignore the negative impact of the non-standard images and successfully identify the two clusters, in which the data objects in the same cluster belong to the same classes.No data objects from the same classes are assigned to the two different clusters.Moreover, the vector difference between the two cluster centers, which is the projection vector estimated by the TC-PCA, provides the inter-class information.This is important for classification and clustering problems.However, the l 1 -PCA is greatly affected by the non-standard images and the first projection vectors are driven towards to the images.This explains why the performance of TC-PCA is better than other methods by 6% and 10%-21% better for r = 1 in AR (Normal + Scarves), AR (Normal + Light) and AR (Normal + Sunglasses).We also compare the performance of different PCA methods for an object classification problem using Coil image database.In this experiment, the percentages of non-standard images in the training set when r = 1, 2, 3 are 16.67%, 20% and 25% respectively.The classification rates are shown in Table 6.The proposed method usually performs well and the overall AMCR and AMF1-score are usually the highest.Similar to the face databases, the proposed method performs better than other PCA methods in recognizing the non-standard images.Moreover, the proposed method usually uses a fewer number of projection vectors to give good results.It is also remarked that the performance of the proposed method is as good as all dimensional case.However, the proposed method uses a much lower dimension to get the same results.This shows that the proposed method can extract useful information that can effectively represent the data.We also compare the performance of different PCA methods for an object classification problem using Coil image database.In this experiment, the percentages of non-standard images in the training set when  = 1,2,3 are 16.67%, 20% and 25% respectively.The classification rates are shown in Table 6.The proposed method usually performs well and the overall AMCR and AMF1-score are usually the highest.Similar to the face databases, the proposed method performs better than other PCA methods in recognizing the non-standard images.Moreover, the proposed method usually uses a fewer number of projection vectors to give good results.It is also remarked that the performance of the proposed method is as good as all dimensional case.However, the proposed method uses a much lower dimension to get the same results.This shows that the proposed method can extract useful information that can effectively represent the data.

Image Clustering
In addition to face image classification, experimental results [60,61] showed that PCA can be used as a pre-processing step to enhance the accuracy of k-means clustering.In this experiment, we shall show that the clustering performance of the TC-PCA's subspace is better than that of the other PCA's subspaces.The experimental setting is as follows: the database is projected onto U and the clustering performance is evaluated on the respective subspace.The clustering rate is obtained based on the known class labels via the following where C i and T j(i) are ith cluster and corresponding set of training labels, respectively.In the above calculation, each cluster C i is assigned to one training label j(i) only.No two or more clusters are assigned to the same training label.Since estimating the optimal number dimension for unsupervised learning problem is still a challenging research problem, we use the normalized area under the curve (NAUC) of clustering rates over the number of orthonormal projection vectors, D, to quantify the results.Note that for each value of D, we repeat the k-means clustering 30 times with the same starting vectors and the clustering result with the smallest objective value is obtained.
To evaluate the robustness of PCA methods for data clustering problem with the presence of outliers and noisy dimensions, we consider the following synthetic dataset.The dataset has 103 dimensions.The first three dimensions have 10 clusters with 100 outliers around them.Each cluster follows a normal distribution with 100 samples.Thus, the dataset has totally 1100 samples.The remaining dimensions are generated by a mixture of normal and uniform distributions.They do not carry any information about the 10 clusters.They are generated by a mixture of normal and uniform distributions.To compute the clustering rate as shown in Equation ( 21) of different PCA methods, we only consider the class labels of the 10 clusters and ignore the outliers in the calculation.In other words, we applied the k-means clustering algorithm to 1100 samples and only evaluate on 1000 samples, which are the 10 clusters.Table 7 shows the clustering results after applying different PCA methods.The TC-PCA method performs the best.It is 40% better than the clustering result without applying any PCA method.Besides, it performs better than other PCA methods such as RPCA nearly 35%.This implies that the proposed method is not confused by the noisy dimensions and is able to cluster the data more effectively.  For the synthetic data, the range of D is from 1 to 3.
Table 7 shows the comparative clustering performance for the databases used in the face image classification experiments.Generally speaking, the proposed method performs better than HQ-PCA and RPCA by around 0.5%-4.80%and the rest by around 1.78%-5.03%.The results indicate that TC-PCA's subspace is better than the other PCA's subspaces for clustering.Moreover, the NAUC of the proposed method is higher than the clustering results with all dimensions in all five experiments, which may not be achieved by other PCA methods.

Parameter Study
In this section, we study the effectiveness of the two parameters of SPVEA.First, we study the effectiveness of the trimming parameter, p, which is set as 75 in all the experiments.Then, we study the robustness of the cluster center initialization of the SPVEA.

Effectiveness of the Trimming Parameter
In this subsection, we shall have the parameter study of TC-PCA, including the parameter of trimming set, p and the number of testing images selected for each individual in the standard part, r, to the classification performance, as shown in Table 8.Here, we choose p = 75, 80, 85, 90, 95 and r = 1, 2, 3, 4.There are two important points that can be observed from Table 8.First, the classification performance is not sensitive to r for the JAFFE face database whereas the AMCR increases with r for the Yale and AR face databases at the same value of p, which are consistent with our previous results in Section 4.1 (see also discussions in the same section).Second, the classification performance for the JAFFE face database is stable for any value of p whereas the AMCR improves from p = 95 to p = 75 for Yale and AR face databases at the same value of r.These results further justify that the AMCR for the databases with both standard and non-standard images increases gradually by removing a portion of outlier images in the training set, which in turn, reveals the effectiveness of trimming set in the TC-PCA model.It is important to note that we fix the parameter D from 1 to 70 for image classification and clustering but our experimental results show that increasing the value of D to more than 70 does not affect the AMCR and NAUC.Other than the above analysis, we also find that the parameter setting p = 75 is the best for these face databases.We compare the performance of this setting with parameter estimation using the holdout method as described in Section 4.1, which has been used to estimate the optimal number of projection vectors.The result is shown in Table 9.We can see that other than r = 2 for AR (Normal + Sunglasses) and r = 4 for AR (Normal + Scarves), the parameter setting p = 75 performs better than automatic parameterization.This shows that the parameter setting p = 75 is the best for these face databases.In this subsection, we study the robustness of cluster center initialization procedure of the SPVEA.We apply SPVEA 30 times with different sets of initial guesses to each of the four face databases with r = 1 and obtain projection matrices.We measure the difference among these projection matrices by the following formula Di f f (D V ) = std( V(i) T V( j) 1 /D V ), (22) where V(i) is the projection matrices obtained by SPVEA with ith set of initial guesses and D v is the number of projection vectors of the projection matrix.Di f f (D V ) is to compute the standard deviation of all the dot products of any two projection matrices obtained by different sets of initial guesses.If all projection matrices V(i) are similar to each other, the V(i) T V( j) 1 will be close to one and thus, obtain a very small standard deviation.Table 10 below shows the values of Di f f (D V ) to the four face databases under three settings: D V = 1, D V = 10 and D V = 20.We can see that all values are very small and close to zero.This shows that the projection vectors generated by SPVEA is robust to cluster center initialization.In this subsection, we investigate the case when the number of outliers exceeds 50%.We set r = 8 for the four face databases.That means there are around 60% non-standard images in each situation.The results are shown in Table 11.We can observe that the performance of any PCA method is usually not as good as the one without applying any PCA method.The reason may be that current PCA methods attempt to find the common components that can best represent the data.Given the scatter nature of outliers, this can confuse most PCA methods.Although the proposed method does not perform as good as the one without applying any PCA method, it performs the best, 2nd best and 3rd best among the PCA methods in Yale, AR (Normal + Lighting), AR (Normal + Sunglasses) and AR (Normal + Scarves) respectively.

Figure 1 .
Figure 1.Example frontal faces and faces with scarves.

Figure 1 .
Figure 1.Example frontal faces and faces with scarves.

Figure 2 .
Figure 2. Illustration of the clustering property of the principal component analysis ( -PCA).

Figure 2 .
Figure 2. Illustration of the clustering property of the principal component analysis (l 1 -PCA).

Figure 3 .
Figure 3. (a) First projection vector obtained by 1-PCA.(b) Cluster centers obtained by two-group means clustering model.
The second equality holds because max ‖ ‖   = ‖‖ .It is noted that the maximum point of the  norm function is the same as the squared  norm function.That is, max ∈{ , }  ( − ) ∝ max ∈{ , }  ( − ) .

Figure 3 .
Figure 3. (a) First projection vector obtained by l 1 -PCA.(b) Cluster centers obtained by two-group k-means clustering model.

Figure 4 .
Figure 4. First projection vectors obtained by (a) l1-PCA and (b) trimmed clustering principal component analysis (TC-PCA) for the dataset with outliers.

Figure 4 .
Figure 4. First projection vectors obtained by (a) l 1 -PCA and (b) trimmed clustering principal component analysis (TC-PCA) for the dataset with outliers.
face and Columbia University Image Library (COIL)-20 image [56,57] databases.To show the effectiveness of dimensionality reduction, the performance of the data with all dimensions (All Dim.) is also included.The JAFFE face database contains 213 images of 7 facial expressions posed by 10 Japanese female models.Each image is in 256 gray scales per pixel and resized to 90 × 90 pixels.The Yale face database consists of 165 8-bit grayscale images of 15 individuals.Each image is resized to 81×61 pixels aligned by the positions of the two eyes and each individual has 11 images with 8 normal faces and 3 face images under different lighting conditions.The AR face database consists of 126 individuals with various facial expressions, illumination conditions and occlusions.The face portion of each image was cropped and resized to 99 × 72 pixels.Each individual has 8 normal faces, 6 faces under various lighting conditions, 6 faces with sunglasses and 6 faces with scarves.The COIL-20 database involves 1440 images, in which the images are separated into twenty categories and every category includes 72 images.Each image is resized to 90 × 90.

Figure 6 .
Figure 6.The 1st and 2nd projection vectors obtained by   -PCA and TC-PCA for AR Face Database with (a) Normal + Lighting, (b) Normal + Sunglasses and (c) Normal + Scarves.

Figure 6 .
Figure 6.The 1st and 2nd projection vectors obtained by l 1 -PCA and TC-PCA for AR Face Database with (a) Normal + Lighting, (b) Normal + Sunglasses and (c) Normal + Scarves.

Table 1 .
Comparative Classification Performance for JAFFE Face Database.

Table 2 .
Comparative Classification Performance for Yale Face Database.

Table 3 .
Comparative Classification Performance for AR Face Database (Normal + Lighting).

Table 4 .
Comparative Classification Performance for AR Face Database (Normal + Sunglasses).

Table 5 .
Comparative Classification Performance for AR Face Database (Normal + Scarves).

Table 6 .
Comparative Classification Performance for Coil.

Table 8 .
Classification Performance (averaged maximum classification rate (AMCR)) of TC-PCA for Various Parameters.

Table 9 .
Classification Performance (AMCR) of TC-PCA with Holdout Estimation for the Trimming Parameter.

Table 11 .
Comparative Classification Performance of Different PCA Methods with Around 60% Outliers.