Hierarchical Discriminant Analysis

The Internet of Things (IoT) generates lots of high-dimensional sensor intelligent data. The processing of high-dimensional data (e.g., data visualization and data classification) is very difficult, so it requires excellent subspace learning algorithms to learn a latent subspace to preserve the intrinsic structure of the high-dimensional data, and abandon the least useful information in the subsequent processing. In this context, many subspace learning algorithms have been presented. However, in the process of transforming the high-dimensional data into the low-dimensional space, the huge difference between the sum of inter-class distance and the sum of intra-class distance for distinct data may cause a bias problem. That means that the impact of intra-class distance is overwhelmed. To address this problem, we propose a novel algorithm called Hierarchical Discriminant Analysis (HDA). It minimizes the sum of intra-class distance first, and then maximizes the sum of inter-class distance. This proposed method balances the bias from the inter-class and that from the intra-class to achieve better performance. Extensive experiments are conducted on several benchmark face datasets. The results reveal that HDA obtains better performance than other dimensionality reduction algorithms.


Introduction
In recent years, the high penetration rate of Internet of Things (IoT) in all activities of everyday life is fostering the belief that for any kind of high-dimensional IoT data there is always a solution able to successfully deal with it. The proposed solution is the dimensionality reduction algorithm. It is well known that the high dimensionality of feature vectors is a critical problem in practical pattern recognition. Naturally, dimensionality reduction technology has been shown to be of great importance to data preprocessing, such as face recognition [1] and image retrieval [2]. It reduces the computational complexity through reducing the dimension and improving the performance at the same time. A general framework [3] for dimensionality reduction defines a general process according to describing the different purposes of the dimensionality reduction algorithm that was proposed. Among this general framework, the dimensionality reduction is divided into unsupervised algorithms and supervised algorithms.
The unsupervised algorithm PCA [4] adopts linear transformation to achieve dimensional reduction commendably. However, PCA shows a noneffective performance in handling manifold data. Then, LLE [14] is proposed, which firstly uses linear coefficients to represent the local geometry of a given point, then explores a low-dimensional embedding for reconstruction in the subspace. However, LLE only defines mappings on the training data which means it is ill-defined in defining mappings on the testing data. To remedy this problem, NPE [6], orthogonal neighborhood preserving projection (ONPP) [15] and LPP are put on the table. They figure out the problem of LLE while preserving the original structure. NPE and ONPP solve the generalized eigenvalue problem with different constraint conditions. LPP [5] reserves the local structure by preferring an embedding algorithm. This algorithm extends to new points easily. However, as an unsupervised dimensionality reduction algorithm, LPP only shows good performance in datasets without labels.
To address the problem that unsupervised algorithms cannot work well in classification tasks, many supervised algorithms are proposed, such as linear discriminant analysis (LDA) [16], which maximizes the inter-class scatter, minimizes the intra-class scatter simultaneously and finds appropriate project directions for classification tasks. However, LDA still has some limitations. For instance, LDA only considers the global Euclidean structure. Marginal Fisher analysis [10] is proposed as an improved algorithm to surmount this problem. MFA constructs the penalty graph and the intrinsic graph to hold the local structure thereby solving the limitation problem of LDA. Both LDA and MFA can discover the projection directions that simultaneously maximize the inter-class scatter and minimize the intra-class scatter. DAG-DNE compensates for the small inter-class scatter in the subspace of DNE [11]. DAG-DNE constructs a homogeneous neighbor graph and heterogeneous neighbor graph to maintain the original structure perfectly. It maximizes the margin between the inter-class scatter and intra-class scatter so that points in the same class are compact and points in the different classes become separable in the subspace at the same time. However, DAG-DNE and all above dimensionality reduction algorithms, optimize intra-class and inter-class simultaneously. The inter-class scatter is larger than the intra-class scatter, thus it will aim at inter-class but ignore the intra-class. Thus, the huge difference between the sum of inter-class distance and the sum of intra-class distance for distinct datasets may cause a bias problem, which means the impact of intra-class distance is tiny in optimization tasks. Naturally, the separation optimization idea comes to mind.
In this paper, we proposed a novel supervised subspace learning algorithm to further increase the performance, called hierarchical discriminant analysis. More concretely, we construct two adjacency graphs by distinguishing homogeneous and heterogeneous neighbors in HDA. Then HDA optimizes inter-class distance and intra-class distance independently. We minimize the intra-class distance first, then maximize the inter-class distance. Because of the hierarchical work, the optimization of intra-class distance and inter-class distance are detached. The process of optimization does not have to be biased to the inter-class scatter. Both of them get the best results. Thus, the influence between classes can be eliminated and we can find a good projection matrix.
The rest of this paper is structured as follows. In Section 2 we introduce the background on MFA, LDNE and DAG-DNE.I. Section 3, we dedicate to introducing HDA and revealing its connections with MFA, LDNE and DAG-DNE. In Section 4, several simulation experiments applying our algorithm are presented to verify its effectiveness. This is followed by the conclusions in Section 5.

Backgroud
In this section we emphatically introduce three classical algorithms, MFA [10], LDNE [12] and DAG-DNE [13], which are related to our research point.

Marginal Fisher Analysis
MFA proposes a new criteria. It can conquer the limitation of LDA by characterizing the intra-class compactness and inter-class separability. MFA algorithm follows the steps below: (1) Construct the adjacency matrices. There are intrinsic graph characterizing the intra-class compactness and penalty graph characterizing the inter-class separability. For each sample x i , the intra-class matrix S w is defined as: where N K 1 (x i ) represents the index set of the K 1 nearest neighbors of node x i in the same class.
For each sample x i , the inter-class matrix S b is defined as: where P K 2 (x i ) expresses the set of the K 2 nearest neighbors of point x i that are in the different classes. (2) Marginal Fisher Analysis Criterion. Find the optimal projection direction by minimizing the intra-class compactness and maximizing the inter-class separability at the same time.
and P is composed of the optimal r projection vectors, X is the set of samples.

Locality-Based Discriminant Neighborhood Embedding
LDNE uses a weight function instead of 1 or 0 to adopt the adjacency graph. It maximizes the difference between the inter-class scatter and the intra-class scatter to capture the top projection matrix. The LDNE algorithm is done by the following steps: (1) Use K-nearest neighbors to construct the adjacency. The adjacency weight matrix S is declared as: where S ω (x i ) denotes the intra-class neighbors of each sample point x i , S b (x i ) denotes the inter-class neighbors of x i , and the parameter β is a regulator. (2) Feature mapping: Optimize the following objective function: where H = D − S, and D is a diagonal matrix with D ii = ∑ j S ij . Function (5) can be considered as a generalized eigen-decomposition problem: The optimal projection P consists of r eigenvectors corresponding to the r largest eigenvalues.

Double Adjacency Graphs-Based Discriminant Neighborhood Embedding
DAG-DNE is a linear manifold learning algorithm based on DNE, which constructs homogeneous and heterogeneous neighbor adjacency graphs. This algorithm follows the steps below: (1) Construct two adjacency graphs. Let A w and A b be the intra-class and inter-class adjacency matrices. x i is the sample point. The intra-class adjacency matrix A w is defined as The inter-class adjacency matrix A b is defined as (2) Optimize the following objective function by finding a projection P.
The projection matrix P can be solved through the generalized eigenvalue problem as follows: The optimal projection P consists of r eigenvectors corresponding to the r largest eigenvalues.
From optimizing the equation of the related work we can see that the optimization work is achieved through maximizing the distance of inter-class minus the distance of the intra-class or maximizing the distance inter-class divided by the intra-class. Thus, the inter-class and intra-class are simultaneously optimized. Our hierarchical discriminant analysis separates the inter-class and the intra-class. This method avoids the interference between the inter-class and the intra-class. The detail of our algorithm will be presented in the next part.

Hierarchical Discriminant Analysis
This section discusses a novel dimensional reduction algorithm called hierarchical discriminant analysis. In MFA and DAG-DNE, two adjacency matrices for a node's homogenous neighbors and heterogeneous neighbors are optimized simultaneously. The huge difference between the sum of inter-class distance and the sum of intra-class distance for distinct data may cause a bias problem, which is too weak for intra-class scatter to play a fundamental role. HDA considers the adjacency graphs respectively, which optimizes the sum of distances between each node and its neighbors of the same class firstly, and then optimizes the sum of distances between the nodes of different classes.

HDA
is the set of training points, where N is the number of training points, x i ∈ R D , y i ∈ {1, 2, . . . , c}, y i is the class label of x i , d is the dimensionality of points, and c is classes' number. For these unclassified training sets, we come up with a potential manifold subspace. We use matrix transformation to project the original high-dimensional data into a low-dimensional space. By doing so, homogeneous samples are compacted and the heterogeneous samples are scattered in low-dimensional data. Consequently, the computational complexity and classification performance are improved.
Similar to DAG-DNE [13], HDA constructs two adjacency matrices respectively, the intra-class and inter-class adjacency matrices. Given a sample x i , we suppose the set of its k homogenous and heterogeneous neighbors are π + k (x i ) and π − k (x i ). We construct the intra-class adjacency matrix F w and the inter-class adjacency matrix F b .
And the local intra-class scatter is defined as: where D w is a diagonal matrix and its entries are column sum of F w , i.e., D w ii = ∑ j F w ij . First, we minimize the intra-class scatter, which means: min Φ (P 1 ) s.t.P 1 T P 1 = I The objective function can be rewritten by some algebraic steps: where S =D w −F w . Therefore, the form of trace optimization function can be rewritten as While the local inter-class scatter is defined as: where D b is a diagonal matrix and its entries are the column sum of F b , i.e., D b ii = ∑ j F b ij . Now, we maximize the inter-class scatter: max Ψ (P) s.t.P T P = I (18) It is worth noting that the training set has changed since first minimization, and we rewrite this function as: where X new = XP 1 .
So that the optimization problem as shown below: Since S and M are real symmetric matrices, the optimization scatters (17) and (20) are same as the eigen-decomposition problem of the matrices XSX T and XMX T . The two projection matrices P are composed of the egienvetors that corresponding to the egienvalues of XSX T and XMX T . The optimal solutions P 1 and P have the form P = [p 1 , . . . , p r ].
Therefore, the image of any point x i can be represented as v i = P T x i . The details about HDA are given below (Algorithm 1).

Algorithm 1 Hierarchical Discriminant Analysis
, and the dimensionality of discriminant subspace r. Output: Projection matrix P. 1: Compute the intra-class adjacency graph F w F w ij = +1, x i ∈ π + k x j or x j ∈ π + k (x i )0 otherwise and the inter-class adjacency matrix F b .
Minimize the intra-class distance by decomposing the matrix XSX T , where S =D w −F w .

Comparisons with MFA, LDNE and DAG-DNE
HDA, LDNE and DAG-DNE are all dimensionality reduction algorithms. In this section, we will probe into the relationships between HDA and the other three algorithms.

HDA vs. MFA
These two algorithms both build an intrinsic(intra-class) graph and penalty(inter-class) graph to keep the original structure of the given data. MFA designs two graphs to find the projection matrix which characterizes the intra-class closely and inter-class separately. However, MFA chooses the dimensionality of the discriminant subspace by experience, which causes some important information to be lost. HDA maximizes the distance of the inter-class and minimizes the distance of the intra-class hierarchically. It uses this method to find the projection matrix to estimate the dimension of the discriminant subspace.

HDA vs. LDNE
Both HDA and LDNE are supervised subspace learning algorithms and build two adjacency graphs to keep the original structure of the given data. LDNE assigns different weights to intra-class neighbors and inter-class neighbors of a given point. It follows the 'locality' idea of LPP, and produces balanced links through the set up connection between each point and its heterogeneous neighbors. By analyzing the experimental result, we get an effectiveness result on the subspace.

HDA vs. DAG-DNE
DAG-DNE algorithm maintains balanced links by constructing two adjacency graphs. Through this method, samples in the same class are compact and samples in different classes are separable in the discriminant subspace. However, DAG-DNE optimizes the heterogeneous neighbors and homogeneous neighbors simultaneously. By doing this, the distance of inter-class scatter is not wide enough in the subspace. The HDA algorithm also constructs two graphs, and separately minimizes the within-class distance first, then maximizes the inter-class distance. Because of the hierarchical work, the optimization of intra-class distance and inter-class distance are detached. The process of optimization is not biased to the inter-class scatter. As a consequence, the intra-class is compact while the inter-class is separated in the subspace.

Experiments
We illustrate a set of experiments to confirm the performance of HDA on image classification in this part. Several benchmark datasets are used, i.e., Yale, Olivetti Research Lab (ORL) and UMIST. The samples from three datasets are shown in Figures 1-3. We compare HDA with other representative dimensional reduction algorithms, including LPP, MFA and DAG-DNE.W. Then we choose the nearest neighbor parameter K for those algorithms when constructing the adjacency graphs. We exhibited the results in the form of mean recognition rate.

Yale Dataset
The Yale datset is constructed by the Yale Center for Computational Vision and Control. It contains 165 gray scale images of 15 individuals. These images display variations in lighting condition and facial expression (normal, happy, sad, sleepy, surprised and wink). Each image is 32 × 32 dimension.
We choose 60% from the dataset as training samples, and the rest of the samples for testing. The nearest neighbor parameter K can be taken as 1, 3 and 5. Figure 4 shows the average accuracy rate of K = 1 after running the four algorithms 10 times. Figure 5 with K = 3 and Figure 6 with K = 5 also obtain similar results like Figure 4 after being run 10 times. From these three figures, we can see that HDA has a higher accuracy rate than other three algorithms and we can gain the best performance.

UMIST Dataset
The UMIST dataset consists of 564 images of 20 individuals. The dataset takes race, sex and appearance into account. Each individual takes several poses from profile to frontal views. The original size of each image is 112 × 92 pixels. In our experiments, the whole dataset is resized to 32 × 32 pixels.
Similar to applying the Yale dataset, we use UMIST dataset to test the recognition rate of the proposed algorithm and the other three algorithms. We select 20% from the UMIST dataset as training samples and use others as testing samples. Figures 7-9 show the average accuracy rate of different K after running the four algorithms 10 times. As shown in these figures, our algorithm reaches the top and presents the best recognition accuracy compare to the other three algorithms.

ORL Dataset
The Olivetti Research Lab dataset is composed of 40 distinct objects. Each of them contains 10 different images. The ORL dataset contains images in different conditions, such us light conditions, facial expressions, etc. All of the images were taken under a dark homogenous background in an upright and frontal position. The size of each image is 112 × 92 pixels, with 256 gray levels per pixel. Similarly, we resize those image to 32 × 32.
Similarly, we choose the nearest neighbor parameter K be 1, 3 and 5. Take 60% training samples from the original dataset, and the other 40% as testing samples. As shown in Figures 10-12, HDA presents the best accuracy rate compared to the other algorithms after 10 runs and HDA is almost the best one under a different number of projection vectors.    Table 1 shows the comparison of average recognition accuracy for different algorithms and different K values. HDA has higher recognition accuracy than the other three algorithms. It improves the classification performance through a dimensional reduction method.

Conclusions
The collection of a huge amount of high-dimensional sensor intelligent data from the Internet of Things (IoT) causes an information overload problem, so the importance of this research is even more prominent. In this paper, a novel algorithm named hierarchical discriminant analysis is proposed, which aims to find a good latent subspace and preserves the intrinsic structure of the high-dimensional intelligent data. Our proposed algorithm constructs two adjacency graphs to preserve the local structure and deal with optimization problems separately. The experimental results show that HDA is more effective than MFA, LDNE and DAG-DNE on three real image datasets. In other words, hierarchical discriminant analysis can generate a good discriminant subspace. However, HDA is still a linear algorithm, so future work will focus on extending it to be nonlinear to improve the classification performance in cloud computing environments [17][18][19].