Locality Preserving and Label-Aware Constraint-Based Hybrid Dictionary Learning for Image Classiﬁcation

: Dictionary learning has been an important role in the success of data representation. As a complete view of data representation, hybrid dictionary learning (HDL) is still in its infant stage. In previous HDL approaches, the scheme of how to learn an effective hybrid dictionary for image classiﬁcation has not been well addressed. In this paper, we proposed a locality preserving and label-aware constraint-based hybrid dictionary learning (LPLC-HDL) method, and apply it in image classiﬁcation effectively. More speciﬁcally, the locality information of the data is preserved by using a graph Laplacian matrix based on the shared dictionary for learning the commonality representation, and a label-aware constraint with group regularization is imposed on the coding coefﬁcients corresponding to the class-speciﬁc dictionary for learning the particularity representation. Moreover, all the introduced constraints in the proposed LPLC-HDL method are based on the l 2 -norm regularization, which can be solved efﬁciently via employing an alternative optimization strategy. The extensive experiments on the benchmark image datasets demonstrate that our method is an improvement over previous competing methods on both the hand-crafted and deep features.


Introduction
Due to the insufficiency of data representation, Dictionary Learning (DL) has aroused considerable interest in the past decade and achieved much success in the various applications, such as image denoising [1,2], person re-identification [3,4] and vision recognition [5][6][7][8]. Generally speaking, the DL methods are developed based on a basic hypothesis, which are that a test signal can be well approximated using the linear combination of some atoms in a dictionary. Thus the dictionary usually plays an important role in the success of these applications. Traditionally, the DL methods can be roughly divided into two categories: The unsupervised DL methods and the supervised DL methods [9,10].
In the unsupervised DL methods, a dictionary is optimized to reconstruct all the training samples without any label assignment; hence, there is no class information in the learned dictionary. By further integrating the label information into dictionary learning, the supervised DL methods can achieve better classification performance than the unsupervised ones for image classification. Supervised dictionary learning encodes the input signals using the learned dictionary, then utilizes the representation coefficients or the residuals for classification. Thus the discriminative ability of the dictionary and the representative ability of the coding coefficients play the key roles in this kind of approach. According to the types of the dictionary, the supervised DL methods can be further divided into three categories: The class-shared DL methods, the class-specific DL methods and the hybrid DL methods [11].
The class-shared DL methods generally force the coding coefficients to be discriminative via learning a single dictionary shared by all the classes. Based on the K-SVD algorithm, Zhang and Li [5] proposed the Discriminative K-SVD (D-KSVD) method to construct a classification error term for learning a linear classifier. Jiang et al. [6] further proposed the Label Consistent K-SVD (LC-KSVD) which encourages the coding coefficients from the same class to be as similar as possible. Considering the characteristics of atoms, Song et al. [12] designed an indicator function to regularize the class-shared dictionary to improve the discriminative ability of coding coefficients. In general, the computation of the test stage is very efficient in the class-shared DL methods, but it is hard to improve the coefficients' discriminativity for better classification performance as the class-shared dictionary is not enough for fitting the complex data.
In the class-specific dictionary learning, each sub-dictionary is assigned to a single class and the sub-dictionaries with the different classes are encouraged to be as independent as possible. As a representative class-specific DL method, Fisher Discrimination Dictionary Learning (FDDL) [13] employs Fisher discrimination criterion on the coding coefficients, then utilizes the representation residual of each class to establish the discriminative term. Using with the incoherence constraint, Ramirez et al. [14] proposed a structured dictionary learning scheme to promote the discriminative ability of the class-specific sub-dictionaries. Akhtar et al. [15] developed a Joint discriminative Bayesian Dictionary and Classifier learning (JBDC) model to associate the dictionary atoms by the class labels with Bernoulli distributions. The class-specific DL methods usually associate a dictionary atom to a single class directly; hence, the reconstruction error with respect to each class can be used for classification. However, the test stage of this category often requires the coefficient computation of test data over many sub-dictionaries.
In the hybrid dictionary learning, a dictionary is designed to have a set of class-shared atoms in addition to the class-specific sub-dictionaries. Wang and Kong [16] proposed a hybrid dictionary dubbed DL-COPAR to explicitly separate the common and particular features of the data, which also encourages the class-specific sub-dictionaries to be incoherent. Vu et al. [17] developed a Low-Rank Shared Dictionary Learning (LRSDL) method to preserve the common features of samples. Gao et al. [18] developed a Category-specific and Shared Dictionary Learning (CSDL) approach for fine-grained image classification. Wang et al. [19] designed a structured dictionary consisting of label-particular atoms corresponding to some class and shared atoms commonly used by all the classes, and introduced a Cross-Label Suppression for Dictionary Learning (CLSDL) to generate approximate sparse coding vectors for classification. To some extent, the hybrid dictionary is very effective at preserving the complex structure of the visual data. However, it is nontrivial to design the class-specific and shared dictionaries with the proper number of atoms, which often has a severe effect on classification performance.
In addition to utilizing the class label information, more and more supervised DL approaches have been proposed to incorporate the locality information of the data into the learned dictionary. By calculating the distances between the bases (atoms) and the training samples, Wang et al. [20] developed a Locality-constrained Linear Coding (LLC) model to select the k-nearest neighbor bases for coding, and set the coding coefficients of other atoms to zero. Wei et al. [21] proposed locality-sensitive dictionary learning to enhance the power of discrimination for sparse coding. Song et al. [22] integrated the locality constraints into the multi-layer discriminative dictionary to avoid the appearance of over-fitting. By coupling the locality reconstruction and the label reconstruction, the LCLE-DL method [7] ensures that the locality-based and label-based coding coefficients are as approximate to each other as possible. It is noted that the locality constraint in LCLE-DL may cause the dictionary atoms from the different classes to be similar, which weakens the discriminative ability of the learned dictionary.
It is observed that the real-world object categories are not only a marked difference, but are also strongly correlated in terms of the visual property, e.g., faces from the different persons often share similar illumination and pose variants; objects in the Caltech 101 dataset [23] have the correlated background. These correlations are not very helpful to distinguish the different categories, but without them the data with common features cannot be well represented. Thus, the dictionary learning approach should learn the distinctive features with the class-specific dictionary, and simultaneously exploit the common features of the correlated classes by learning a commonality dictionary. To this end, we proposed the locality preserving and label-aware constraint-based hybrid dictionary learning (LPLC-HDL) method for image classification, which is composed of a label-aware constraint, a group regularization and a locality constraint. The main contributions are summarized as follows.
[1]. The proposed LPLC-HDL method learns the hybrid dictionary by fully exploiting the locality information and the label information of the data. In this way, the learned hybrid dictionary can not only preserve the complex structural information of the data, but also have strong discriminativity for image classification. [2]. In LPLC-HDL, a locality constraint is constructed to encourage the samples from different classes with similar features to have similar commonality representation; then, a label-aware constraint is integrated to make the class-specific dictionary sparsely represent the samples from the same class, so that the robust particularitycommonality representation can be obtained by the proposed LPLC-HDL. [3]. In a departure from the competing methods which impose the l 0 -norm or l 1 -norm on the coefficients, LPLC-HDL consists of l 2 -norm constraints that can be calculated efficiently. The objective function is solved elegantly by employing an alternative optimization technique.
The rest of this paper is outlined as follows. Section 2 reviews the related work on our LPLC-HDL method. Then Section 3 presents the details of LPLC-HDL and an effective optimization is introduced in Section 4. To verify the efficiency of our method for image classification, the experiments are conducted in Section 5. Finally, the conclusion is summarized in Section 6.

Notation and Background
In this section, we first provide the notation used in this paper, then review the LCLE-DL algorithm and the objective function of hybrid dictionary learning (DL), which can be taken as the theoretical background of our LPLC-HDL method.

Notation
Let X = [X 1 , · · · , X C ] ∈ R m×N be a set of N training samples in an m dimension with class labels Y i ∈ [1, · · · , C]; here, C is the class number of the training samples and X i is a matrix consisting of N i training samples of the ith class. Suppose D 0 ∈ R m×k 0 and D p = [D 1 , · · · , D C ] ∈ R m×K p make up the learned hybrid dictionary D = [D 0 , D p ] from the training samples X, where k 0 is the atom number of shared dictionary, K p = ∑ C i=1 k i denotes the atom number of class-specific dictionary and k i is the atom number of ith class sub-dictionary. Let Z = [Z 1 , · · · , Z C ] ∈ R (k 0 +K p )×N be the coding coefficients of training samples X over the hybrid dictionary D; then, i , · · · , Z C i ] ∈ R k i ×N represent the coding coefficients over the shared dictionary and the ith class sub-dictionary, respectively.
According to [24], a row vector of coefficient matrix Z can be defined as a profile of the corresponding dictionary atom. Therefore, we can define a vector z r = [z r 1 ; z r 2 ; · · · ; z r C ] ∈ R N×1 (r = 1, · · · , K) as the profile of atom d r for all the training samples, where the subvector z r c = [z r 1 , z r 2 , · · · , z r N c ] T ∈ R N c ×1 is the sub-profile for the training samples of the cth class.

The LCLE-DL Algorithm
To improve the classification performance, the LCLE-DL algorithm [7] takes both the locality and label information of dictionary atoms into account in the learning process. This algorithm firstly constructs the locality constraint to ensure that similar profiles have similar atoms, then establishes the label embedding constraint to encourage the atoms of the same class to have similar profiles. The objective function of LCLE-DL is defined as follows.
where Z p ∈ R K p ×N and V p ∈ R K p ×N denote the locality-based and the label-based coding coefficients, L ∈ R K p ×K p is the graph Laplacian matrix that is constructed by the atom's similarity in the dictionary D p ∈ R m×K p , U ∈ R K p ×K p is the scaled label matrix which is constructed using the label matrix of the dictionary D p . X − D p Z p 2 2 combined with the second term encodes the reconstruction under the locality constraint; X − D p V p 2 2 combined with fourth term encodes the reconstruction under the label embedding; Z p − V p 2 2 is used to transfer the label constraint to the locality constraint. α, β and γ are the regularization parameters; the constraint on the atoms can avoid the scaling issue.
The LCLE-DL algorithm first exploits the K-SVD algorithm to learn sub-dictionaries D i (i = 1, · · · , C) using with the training samples X i . Similar to the label matrix of the training samples Y, the label matrix of the dictionary D p can be obtained as B = [b 1 , · · · , b K p ] ∈ R K p ×C . Then a weighted label matrix is constructed by G = B(B T B) − 1 2 ∈ R K p ×C . Next, the label embedding of atoms is defined as where U = GG T ∈ R K p ×K p is the scaled label matrix of the dictionary D p ; the above terms make the coding coefficients V p have a block-diagonal structure with strong discriminative information.
The learned dictionary inherits the manifold structure of the data via using the derivedgraph Laplacian matrix L, and the optimal representation of the samples can be obtained with the label embedding of dictionary atoms. By combining the double reconstructions, LCLE-DL ensures the label-based and the locality-based coding coefficients are as approximate to each other as possible. However, the locality constraint is imposed on the class-specific dictionary in the LCLE-DL algorithm, which may cause the dictionary atoms from the different classes to be similar, thus the discriminative ability of the dictionary is weakened.

The Objective Function of Hybrid DL
In recent years, the hybrid DL [16,19,25,26] has been getting more and more attention in the classification problem. The hybrid dictionary has been shown to perform better than the other types of dictionaries, as it can preserve both the class-specific and common information of the data. To learn such a dictionary, we can define the objective function of Hybrid DL as follows.
where D = [D 0 , D 1 , · · · , D C ] ∈ R m×K , D 0 is the dictionary shared by all the classes, D i (i = 1, · · · , C) is the ith class sub-dictionary, Z i 0 and Z i i are the coding coefficients of the samples from the ith class over D 0 and D i , and ψ(D) and φ(Z) denote the functions about the hybrid dictionary and the coding coefficients, respectively. The constraint of φ(Z) typically adopts l 1 [26] norm or l 2,1 [25,27] norm for sparse coding.

The Proposed Method
By learning the shared dictionary D 0 , the previous hybrid DL algorithms can capture the common features of the data, but they do not concern the correlation among these features, which will reduce the robustness of the learned dictionary. In this section, we firstly utilize the locality information of the atoms in D 0 to construct a locality constraint, then impose it on the coefficients Z 0 , so that the correlation of the common features is captured explicitly and the learned dictionary D 0 is very robust concerning commonality representation. Moreover, once the correlation is discarded, the classification of a query sample will be dominated by the class-specific sub-dictionary corresponding to the correct class to reach minimized data fidelity.
To obtain the discriminative ability of the class-specific dictionary, we further introduce a label-aware constraint as well as group regularization on the distinctive coding coefficients for the particularity representation. Since this constraint integrated with the locality constraint to reconstruct the input data, they will reinforce each other in the learning process, which results in a discriminative hybrid dictionary for image classification.
Accordingly, the objective function of the proposed LPLC-HDL method can be formulated as follows.
where λ, γ and η are the regularization parameters, which can adjust the weights of the label-aware constraint, the group regularization and the the locality constraint, respectively.
Here we set the Euclidean length of all the atoms in the shared dictionary D 0 and the class-specific dictionary D p to be 1, which can avoid the scaling issue.

The Locality Constraint for Commonality Representation
Locality information of the data has played an important role in many real applications. By incorporating locality information for learning a dictionary, we can ensure that the samples with common features tend to have similar coding coefficients [7]. Further more, the dictionary atoms measure the similarity of the samples, which are more robust to the noise and outliers than the original samples. Hence, we use the atoms of the shared dictionary to capture the correlation among the common features and construct a locality constraint.
Based on the shared dictionary D 0 ∈ R m×k 0 , we can construct a nearest neighbor graph M 0 ∈ R k 0 ×k 0 as follows.
where δ is a parameter to control the exponential function, kNN(d s ) denotes k nearest neighbors of atom d s , M r,s 0 indicates the similarity between the atoms d r and d s . For convenience of calculation, we invariably set the parameters δ = 4 and k = 1, as they have the stable values in the experiments.
Once M 0 is calculated, we construct a graph Laplacian matrix L 0 as follows.
Since L 0 is constructed based on the dictionary D 0 , it will be updated in coordination with D 0 in the learning process.
By now, we can obtain a locality constraint term for choosing graph Laplacian matrix L 0 as follows.
Because the profile z r /z s and the atom d r /d s have a one-to-one correspondence, the above equation ensures that similar atoms encourage similar profiles [7]. Hence, the correlated information of the common features can be inherited by the coefficient matrix Z 0 and the graph Laplacian matrix L 0 .

The Constraints for Particularity Representation
To obtain the particularity representation for the classification, we will assign the labels to the atoms of the class-specific dictionary, as presented in [7,19]. If an atom d t ∈ D i (i = 1, · · · , C), the label i will be assigned to the atom d t and kept invariant in the iterations. We take S i as the index set for the atoms of ith class sub-dictionary D i , and S as the index set for all the atoms of the class-specific dictionary. For the particularity representation of samples X i , it is desirable that the large coefficients should mainly occur on the atoms in S i . In other words, the sub-profiles associated with the atoms in S\S i need to be suppressed to some extent.
For the ith class samples, we construct a matrix P i ∈ R K×K to pick up the sub-profiles from the representation Z T i , which locate at the atoms in S\S i rather than the atoms in S i (i = 1, · · · , C), so that we can define a label-aware constraint term as follows.
and the matrix P i can be written as where P i (t 1 , t 2 ) is the (t 1 , t 2 )th entry of matrix P i . For the particularity representation Z i p , minimizing the label-aware constraint can suppress the large values in the sub-profiles associated with the atoms in S\S i , as well as encourage the large ones in the sub-profiles associated with ith class atoms. Therefore, it is expected that this constraint with a proper scalar can make the particularity representation approximately sparse. Besides, once a series of matrices P i (i = 1, · · · , C) are constructed, they will be unchanged in the iterations. Thus it is very efficient for coding over the class-specific dictionary.

The Group Regularization
Furthermore, to promote the consistency of particularity representation from the same class, we introduce the group regularization on Z i p . In light of the label information of training samples, assume one sample is related to one vertex; the vertices corresponding to the same class samples are connected and neighboring each other; thus, each class forms a densely connected sub-graph. Considering the training samples X i and their coding coefficients Z i p , we first define k i graph maps with mapping each graph to a line that consists of N i points, as follows Here z i n (k) (n = 1, 2, · · · , N i ) is the kth component of z i n , which corresponds to the kth atom in the ith class sub-dictionary D i .
Then, we can calculate the variation for these k i graph maps as follows.
where L i ∈ R N i ×N i denotes the normalized Laplacian of the overall graph for the ith class, which can be derived as For the different classes, the vertices related to their samples should not be connected; thus, the graphs of C classes are isolated from each other. Therefore, we can obtain the total variation for K p graph maps of all the C classes as Keeping this group regularization small will promote the consistency of the representation for the same class samples. Moreover, by combing it with the label-aware constraint, the coding coefficients for the different classes will be remarkably distinct with the large coefficients locating in the different areas, which is very favorable for the classification task.

Optimization Strategy of LPLC-HDL
As the objective function in Equation (4) is not a jointly convex optimization problem for the variables (D 0 , L 0 , Z 0 , D p , Z p ), it will be divided into two sub-problems by learning the shared and class-specific dictionaries alternatively, that is, updating variables (D 0 , L 0 , Z 0 ) by fixing (D p , Z p ), then updating variables (D p , Z p ) by fixing (D 0 , L 0 , Z 0 ).
We firstly use k-means algorithm to initialize shared dictionary D 0 ∈ R m×k 0 by using all the training samples, then initialize ith class sub-dictionary D i ∈ R m×k i by using the training samples X i and concatenate all these sub-dictionaries as the class-specific dictionary D p ∈ R m×K p . Next, we compute the initialized coefficients Z 0 and Z p with the Multi-Variate Ridge Regression (MVRR) algorithm, and obtain the initialized matrix L 0 by Equation (6). In line with the corresponding sub-dictionaries, the serial of matrices P i and L i (i = 1, 2, · · · , C) are constructed with Equations (9) and (12), respectively. After finishing the initialization, we can optimize the objective function as the following steps.

Shared Dictionary Learning
By fixing the variables (D p , Z p ), the objective function of our LPLC-HDL for learning the shared dictionary becomes where X i = X i − D p Z i p . To update the variables (D 0 , L 0 , Z 0 ), we turn to an iterative scheme, i.e., updating D 0 by fixing (L 0 , Z 0 ), constructing L 0 based on D 0 ; updating Z 0 by fixing (D 0 , L 0 ). The detailed steps are elaborated as below.
(a) Update the shared dictionary D 0 and matrix L 0 Without loss of generality, we first concentrate on optimizing the shared dictionary D 0 by fixing the variables (L 0 , Z 0 ). The function (13) for D 0 becomes where The quadratic constraint is introduced on each atom in D 0 to avoid the scaling issue risk. The above minimization can be solved by adopting the Lagrange dual algorithm presented as in [28].
Based on the obtained dictionary D 0 , the graph Laplacian matrix L 0 can be constructed using (5) and (6).
(b) Update the commonality representation Z 0 By fixing the variables (D 0 , L 0 ), the optimization of the commonality representation Z 0 in Equation (13) can be formulated as where there is a closed-form solution for Z 0 as all the above terms are based on the l 2 norm regularization.
By setting the derivative of (15) to zero, the optimal representation Z 0 can be obtained

Class-Specific Dictionary Learning
Assuming the variables (D 0 , L 0 , Z 0 ) are fixed, the objective function (4) for learning the class-specific dictionary can be formulated as follows. where To update the variables (D p , Z p ), we follow the iterative scheme presented in Section 4.1.1, i.e., updating D p by fixing Z p ; then updating Z p by fixing D p . The updating steps are elaborated as below.
(c) Update the class-specific dictionary D p By fixing the variable Z p , the optimization of dictionary D p can be simplified as the optimization of each atom d k p (k = 1, 2, · · · , K p ). Provided k ∈ S i and the other atoms in D p are fixed, to update the atom d k p we can solve the optimization problem as follows. where p , the solution of atom d k p can be easily derived asd Consider the energy of each atom as being constrained in (16), the solution can be further normalized asd Because the atoms with indices out of S i are fixed when updating the atoms with indices in S i , we can compute E = X − ∑ t/ ∈S i d t pz t p in advance and take it into the calculation ofẼ to accelerate the update. Likewise, we successively update the atoms corresponding to S i with (i = 1, 2, · · · , C) and obtain the overall class-specific dictionary D p . (d) Update the particularity representation Z p For the particularity representation of the ith class Z i p , the optimization problem depending on it becomes The solution of the above equation can be obtained by computing each code in Z i p as follows where x i (n) (n = 1, 2, · · · , N i ) is the nth training sample of the ith class, z i p (n) is the code of x i (n) and L i (n 1 , n 2 ) denotes the (n 1 , n 2 )th entry of matrix L i . Since all the terms in Equation (21) are based on the quadratic form, the solution of z i p (n) can be easily obtained by setting its derivation to zero.
By using (12) and denotingL i as L i − I i with the identity matrix I i ∈ R N i ×N i , we can further obtain a matrix version for Z i p update as: By repeating the steps (a)-(d), we can iteratively obtain all the optimal variables (D 0 , L 0 , Z 0 , D p , Z p ). The overall optimization of our LPLC-HDL is summarized in Algorithm 1. Moreover, the objective function value of Equation (4) decreases as the number of iterations increases on a given dataset. For example, the convergence curve of LPLC-HDL on the Yale face dataset [29] is illustrated in Figure 1. From it we can see that the proposed algorithm converges quickly, no more than 20 iterations. In Algorithm 1, the time complexity of our LPLC-HDL mainly comes from these parts: O(k 3 0 mN) for updating the shared dictionary D 0 , O(k 2 0 ) for updating the graph Laplacian matrix L 0 , O(k 3 0 ) for updating the commonality representation Z 0 , O(K p mN) for updating the class-specific dictionary D p and O(K 3 p ) for updating the particularity representation Z p . Thus the final time complexity is O((k 3 0 + K p )mN + K 3 p ) in each iteration.
Algorithm 1 Optimization procedure of the proposed LPLC-HDL 1: Input: Train samples X, class label Y, the parameters λ, γ, η 2: Initialization: Using the k-means algorithm, initialize D 0 by all the training samples, compute the matrix L 0 with Equation (6); Initialize D i (i = 1, 2, · · · , C) by the training samples X i , concatenate these sub-dictionaries as the initialized D p ; Initialize the coefficients Z 0 and Z p with the MVRR algorithm; Construct the serial of matrices P i and L i with Equations (9) and (12), respectively. 3: while converge condition no satisfied do 4: (a) Update D 0 by solving Equation (14), update L 0 by using Equations (5) and (6); 5: (c) Update D p as follows: 7: for i = 1 to C do 8: Compute for k ∈ S i do 10: Update the atom d k p by solving Equation (19)

Classification Procedure
After completing Algorithm 1, we can use the learned hybrid dictionary D = [D 0 , D p ] to represent a test sample and predict its label. Firstly, we find the commonality-particularity representationẑ t = [(z 0 t ) T , (z p t ) T ] T of a test sample x t by solving the following equation: Secondly, we compute x t = x t − D 0 z 0 t based onẑ t to exclude the contribution of the shared dictionary. Thus, the class label of x t can be determined by:
In the proposed LPLC-HDL method, the three parameters λ, γ and η are selected by five-fold cross validation on the training set, and their optimal values on each dataset are shown in Table 1. In addition, the atom numbers of the shared dictionary and the classspecific dictionary for each dataset are elaborated as detailed in the following experiments.

Experiments on the Yale Face Dataset
In the experiments, we first consider the Yale face dataset which contains 165 gray scale images for 15 individuals with 11 images per category. Each individual contains one different facial expression or configuration: Left-light, center-light, right-light, w/glasses, w/no glasses, happy, sad, normal, surprised, sleepy and wink, as shown in Figure 2a. Each image has 24 × 24 pixel resolution, and is resized to 576-dimensional vectors with normalization for representation. Following the setting in [19], six images of each individual are randomly selected for training and the rest are used for testing. In addition, the number of dictionary atoms is set to 15 × 4 + 60 = 120, which means four class-specific atoms for each individual and 60 shared atoms for all the individuals. The parameters of our LPLC-HDL on the Yale face dataset can be seen in Table 1.  To acquire a stable recognition accuracy, we operate LPLC-HDL over 30 times rather than 10 times with independent training/testing splits. The comparison results on the Yale face dataset are listed in Table 2. Since the variations in terms of facial expression are complex, the variations in the testing samples cannot be well represented via directly using the training data. Hence SRC has worse accuracy than the other dictionary learning methods. We can also see that the hybrid dictionary learning methods including DL-COPAR [16], LRSDL [17] and CLSDDL [19] outperform the remaining competing approaches, and the proposed LPLC-HDL method achieves the best recognition accuracy of 97.01%, which illustrates that our method can distinguish the shared and class-specific information of face images more appropriately. LLC [37] 82.1 ± 2.6 LC-KSVD (120) [6] 83.6 ± 2.7 SVM [38] 94.4 ± 2.8 LCLE-DL (120) [7] 91.2 ± 3.1 SRC [33] 81.3 ± 2.5 DL-COPAR (120) [16] 92.3 ± 3.7 CRC [34] 89.8 ± 1.9 LRSDL (120) [17] 92.7 ± 3.3 ProCRC [35] 91.7 ± 2.1 CLSDDL-LC (120) [19] 95.9 ± 2.2 D-KSVD (120) [5] 82.3 ± 4.5 CLSDDL-GC (120) [19] 95.3 ± 2.9 FDDL (120) [13] 89.6 ± 2.6 LPLC-HDL (120) 97.01 ± 2. 6 We further illustrate the performance of LPLC-HDL on face recognition accuracy with different sizes of the shared and class-specific dictionaries, as shown in Figure 3. From it we can see that a higher recognition accuracy on the Yale face dataset can be obtained by increasing the atom numbers of both the shared and class-specific dictionaries. Besides, Figure 3 also shows that when the number of the class-specific atoms is few, the recognition accuracy is sensitive to the number of the shared atoms, as the large shared dictionary is harmful to the discriminative power of the class-specific dictionary in this case. However, increasing the number of the class-specific atoms beyond the particular number of training samples of each class brings no notable increase in recognition accuracy. Considering that the large dictionary will slow down both the training time and the testing time, the atom numbers of the shared and class-specific dictionaries are not set to big values in the experiments.

Experiments on the Extended YaleB Face Dataset
The Extended YaleB face dataset contains 2414 frontal face images of 38 people. For each person, there are about 64 face images and the original images have 192 × 168 pixels. This dataset is challenging due to varying poses and illumination conditions, as displayed in Figure 2b. For comparison, we use the normalized 32 × 32 images instead of the original pixel information. In addition, we randomly select 20 images per category for training and take the rest as the testing images. The parameters of LPLC-HDL on the Extended YaleB face dataset are also shown in Table 1. The hybrid dictionary size is set to 38 × 13 + 76 = 570, which denotes 13 class-specific atoms for each person and 76 shared atoms for all the persons, with the same structure adopted by the other hybrid dictionary learning methods.
We repeatedly run LPLC-HDL and all the comparison methods 10 times for reliable accuracy, and the average recognition rates are listed in Table 3. As shown in Table 3, compared with the K-SVD, D-KSVD, FDDL and LC-KSVD methods, LCLE-DL achieves a better recognition result with the same dictionary size. The reason for this behavior is that the LCLE-DL method can effectively utilize the locality and label information of the atoms in the dictionary learning. It is also shown that the hybrid dictionary learning methods, including DL-COPAR, LRSDL, CLSDDL and LPLC-HDL, generally outperform the other DL methods, which demonstrates the discriminative ability of the hybrid dictionary. By integrating the locality and label information into the hybrid dictionary, our LPLC-HDL method obtains the best recognition rate of 97.25%, which outperforms the second best approach CLSDDL-GC by 0.8%, and at least 1.2% higher than the other competing methods. Table 3. The recognition rates (%) on the Extended YaleB face dataset.

Experiments on the LFW Face Dataset
The LFW face dataset has more than 13,000 images with the name of the person pictured, and all of them are collected from the web for unconstrained face recognition and verification. Following the prior work [7], we use a subset of the LFW face dataset which consists of 1215 images from 86 persons. In this subset, there are around 11-20 images for each person, and all the images are resized to be 32 × 32. Some samples from this face dataset are shown in Figure 2c. For each person, eight samples are randomly selected for training and the remaining samples are taken for testing. The parameters of LPLC-HDL on the LFW face dataset are also shown in Table 1. In addition, the hybrid dictionary size is set to 86 × 4 + 86 = 430, which means four class-specific atoms for each individual and 86 shared atoms for all the individuals.

Object Classification
In this subsection, we evaluate LPLC-HDL on the Caltech-101 dataset [23] for object classification. This dataset contains a total of 9146 images from 101 object categories and a background category. The number of images for per category varies from a minimum of 31 to a maximum of 800 images. The resolution of each image is about 300 × 200, as shown in Figure 4. Following the settings [6,40], we perform the Spatial Pyramid Features (SPFs) on this dataset. Firstly, we partition each image into 2 L × 2 L sub-regions with different spatial scales L = 0, 1, 2, then extract SIFT descriptors over a sub-region with a spacing of 8 pixels and concatenate them as the SPFs. Next, we encode the SPFs with a codebook of size 1024. Finally, the dimension of the features is reduced to 3000 using the the Principal Component Analysis (PCA) algorithm.
For this dataset, 10 samples of each category are selected as the training data and the remaining are for testing. The parameters of LPLC-HDL on this dataset can be seen in Table 1. The hybrid dictionary size is set to 102 × 9 + 100 = 1018, which denotes nine classspecific atoms for each category and 100 shared atoms for all the categories. The proposed LPLC-HDL and the comparison methods are carried out 10 times, the average classification rates are reported in Table 5. As can be seen in Table 5, our LPLC-HDL method achieves the best classification result again, with improvement margins of at least 1.3% compared with the comparison methods.

Flower Classification
We finally evaluate the proposed LPLC-HDL on the Oxford 102 Flowers dataset [32] for fine-grained image classification, which consists of 8189 images from 102 categories, and each category contains at least 40 images. This dataset is very challenging because there exist large variations within the same category but small differences across several categories. The flowers appear at different scales, poses and lighting conditions; some flower images are shown in Figure 5. For each category, 10 images are used for training, 10 for validation, and the rest for testing, as in [32]. For ease of comparison, we take the convolutional neural network (CNN) features provided by Cai et al. [35] as the imagelevel features.

Parameter Sensitivity
In the proposed LPLC-HDL method, there are three key parameters, i.e., λ, γ and η, which are used to balance the importance of the label-aware constraint, the group regularization and the locality constraint. To analyze the sensitivity of the parameters, we define a candidate set {10 −5 , 10 −4 , 10 −3 , 10 −2 , 10 −1 , 10 0 , 10 1 , 10 2 , 10 3 , 10 4 , 10 5 } for them and then perform LPLC-HDL with different combinations of the parameters on the Yale face dataset and the Caltech 101 dataset. By fixing the parameter η, the classification accuracy versus different values of the parameters λ and γ are shown in Figures 6a and 7a. As can be seen in the figures, the best classification result can be obtained when the parameters λ and γ locate in a feasible range. When γ is very small, the effect of the group regularization is limited, leading to weak discrimination of the class-specific dictionary. On the other hand, when γ becomes very large, the classification accuracy drops as the remaining terms in (4) become less important, which decreases the representation ability of the hybrid dictionary.
By fixing the parameters λ and γ, the classification accuracy versus different values of the parameter η are shown in Figures 6b and 7b. From them we can see that the classification accuracy is insensitive to the parameter η when its value located in a certain range, e.g., 10 −5 ≤ η ≤ 10 −1 on the Yale face dataset. It should be noted that the incoherence between the shared and class-specific dictionaries is increased with increasing parameter η, which influences the reconstruction of the test data and decreases the classification accuracy.
Due to the diversity of the datasets, it is still an open problem to adaptively select the optimal parameters for the different datasets. In the experiments, we use an effective and simple way to find the optimal values for the parameters λ, γ and η. Based on the previous analysis, we first fix the parameter η to a small value such as 0.01, then search the candidate combination of the parameters λ and β from the coarse set of {10 −5 , 10 −4 , 10 −3 , 10 −2 , 10 −1 , 10 0 , 10 1 , 10 2 , 10 3 , 10 4 , 10 5 }. According to the best coarse combination of them, we can further define a fine candidate set where their optimal values may exist. Then we perform the proposed LPLC-HDL again with different combinations of the parameters λ and β selected from the fine candidate set. This way, we can obtain the optimal values of the parameters for all the experimental datasets: hence, the best classification results are guaranteed. Besides the key parameters λ, γ and η in our LPLC-HDL method, there are also the parameters δ and k in the proposed locality constraint. In the experiments, we find that these two parameters have stable values on the experimental datasets. For example, Figure 8a,b show the classification accuracies of our LPLC-HDL method with respect to the parameter δ by fixing the remaining parameters on the Yale face dataset and the Caltech 101 dataset. From the subfigures, we can see that the classification accuracy is insensitive to the parameter δ, and the approximate best result can be obtained when δ = 4. For the parameter k, this similar phenomenon can be observed on the experimental datasets. Thus we can set δ = 4 and k = 1 in our LPLC-HDL method for convenience of calculation.

Evaluation of Computational Time
We also conducted experiments to evaluate the running time of the proposed LPLC-HDL and other representative DL methods on the two face datasets and the Caltech-101 dataset, the comparison results are listed in Table 7. "Train" denotes the running time of each iteration, and "Test" is the average processing time for classifying one test sample. All the experiments are conducted on a 64-bit computer with Intel i7-7700 3.6 GHz CPU and 12 GB RAM under the MATLAB R2019b programming environment. From Table 7, we can see that, although slower than CLSDDL-LC [19], the training efficiency of LPLC-HDL is obviously higher than that of D-KSVD [5], LC-KSVD [6], FDDL [13] and DL-COPAR [16]. In the testing stage, the proposed LPLC-HDL has similar testing efficiency as D-KSVD and LC-KSVD, and the testing process of LPLC-HDL is much faster than that of FDDL and DL-COPAR. Specifically, the average time for classifying a test image by LPLC-HDL is always less than that of CLSDDL-LC on the experiments.

Conclusions
In this paper, we propose a novel hybrid dictionary learning (LPLC-HDL) method by taking advantage of the locality and label information of the data, which can solve the image classification task effectively. The LPLC-HDL method incorporates the locality constraint with the label-aware constraint to more appropriately distinguish the shared and particularity information. More specifically, the locality constraint on the shared dictionary is used to model the similar features of the images from different classes; the label-aware constraint and the group regularization are coupled to make the class-specific dictionary more discriminative. An effective alternative strategy is developed to solve the objective function of LPLC-HDL. After training the hybrid dictionary, the class label of the test image can be predicted more accurately by excluding the contribution of the shared dictionary. The experimental results on three face datasets, the object dataset and the flower dataset demonstrate that our method has effectiveness and superiority for image classification.