Discriminative Nonnegative Tucker Decomposition for Tensor Data Representation

: Nonnegative Tucker decomposition (NTD) is an unsupervised method and has been extended in many applied ﬁelds. However, NTD does not make use of the label information of sample data, even though such label information is available. To remedy the defect, in this paper, we propose a label constraint NTD method, namely Discriminative NTD (DNTD), which considers a fraction of the label information of the sample data as a discriminative constraint. Differing from other label-based methods, the proposed method enforces the sample data, with the same label to be aligned on the same axis or line. Combining the NTD and the label-discriminative constraint term, DNTD can not only extract the part-based representation of the data tensor but also boost the discriminative ability of the NTD. An iterative updating algorithm is provided to solve the objective function of DNTD. Finally, the proposed DNTD method is applied to image clustering. Experimental results on ORL, COIL20, Yale datasets show the clustering accuracy of DNTD is improved by 8.47– 32.17% and the normalized mutual information is improved by 10.43–29.64% compared with the state-of-the-art approaches.


Introduction
In recent years, data analysis has attracted more and more attention in many application fields, such as machine learning and artificial intelligence, as well as computer vision. For instance, in Ref. [1], cluster learning of data is analysed by tensor low-rank representation. In Ref. [2], routing recommendations are implemented for heterogeneous databy tensor-based frameworks. Long et al. [3] recover the missing entries of visual data by tensor completion. Bernardi et al. [4] utilize secant varieties and tensor decomposition to generate a hitchhiker's guide. The real-world sample data are often high dimensional, while the important information and structures are laid in a low-dimensional representation space of the sample data. Thus, many dimensionality reduction methods have been proposed to seek an appropriate low-dimensional representation of the original sample data.
Nonnegative matrix factorization (NMF) [5] as a traditional dimensionality reduction method has been widely used for different purposes, such as image processing [6], community detection [7], clustering [8], etc. The NMF of a nonnegative data matrix X ∈ R m×n + is to find two low-rank nonnegative matrices, namely U ∈ R m×s + and V ∈ R n×s + , so that X ≈ UV . For example, if X is an image sample data matrix whose column includes the pixel data of an image sample, then m can represent the number of image sample pixel data, n can represent the number of all image samples, s can denote the dimension of the low-dimensional representation of the image sample data matrix, U is called the basis matrix, and V is called the encoding matrix and is also regarded as the low-dimensional representation of the image sample data matrix. Generally, s is chosen as a number much less than m or n. However, a weakness of the NMF is that it fails to consider to preserve the geometrical information of the sample data. Thus, Cai et al. proposed a graph-regularized NMF (GNMF), which encodes the geometrical information of the sample data in the lowdimensional representation space by constructing a nearest-neighbor graph [9]. Using the hypergraph regularization and the correntropy instead of the Euclidean norm in the loss term of NMF , Yu et al. proposed a correntropy-based hypergraph-regularized NMF (CHNMF) [10]. The CHNMF considers the high-order geometric relationships inherent in the sample data and reduces the influence of noises and outliers.
When the label information of sample data is available, naturally, it can also be used to construct a graph. Generally speaking, sample data can be separated into fully labeled, partially labeled, and unlabeled, correspondingly, and the algorithms using these data are categorized into supervised, semi-supervised, and unsupervised algorithms [11]. On the one hand, in practice, a mass of labeled sample data are hard to obtain; however, littlelabeled sample data are readily achieved. So, it is not difficult to generate a semi-supervised method. On the other hand, the semi-supervised method can utilize all sample data and simultaneously use the labels of partial sample data so that the semi-supervised method is able to obtain the performance of the unsupervised method, simultaneously propagating labels with the guidance of the supervisory information. Restricting the data to satisfy the prior label information is a common way for using label information to develop a semisupervised method [12]. For example, Liu et al. introduced a constrained NMF (CNMF) by incorporating the label information of some sample data into the objective function of NMF [13]. In CNMF, sample data with the same labels are merged into a single point. Babaee et al. studied a discriminative NMF (DNMF) by utilizing a fraction of the label information of sample data in a regularization term [14]. Differing from CNMF, DNMF enforces the sample data with the same label to be aligned on the same axis to construct a label matrix, and then produces a label-discriminative regularizer by bridging the label matrix and labeled sample data. However, when these NMF-based methods deal with several-image sample data, the data are first vectorized, forming a matrix-like X of the example mentioned above, and then are represented by a low-rank approximation, which often destroys the internal structure of sample data. In Ref. [15], the study indicates the tensor can partly solve the problem.
In fact, in real-world examples, there are many sample data represented by a tensor (i.e., multiway array), e.g., color images, video clips, multichannel electroencephalography (EEG), etc. For these reasons, tensor factorization techniques were proposed to deal with data tensors. The Tucker decomposition [16], which is one of the tensor factorization methods, has been widely applied to face recognition [17], image processing [18], signal processing [15], etc. In Ref. [19], Kim et al. introduced a nonnegative Tucker decomposition (NTD) by combining Tucker decomposition with nonnegativity constraints on the core tensor and factor matrices. Since then, NTD has been extended in many applications, such as clustering [20], pattern extraction [21], and signal analysis [22], and several variants of NTD from diverse perspectives have been proposed. For example, Qiu et al. studied a graph-regularized nonnegative Tucker Decomposition (GNTD), which incorporates graph regularization into NTD to preserve the geometrical information from a data tensor [23]. Pan et al. introduced an orthogonal nonnegative Tucker decomposition (ONTD) by considering the orthogonality on each factor matrix [24]. However, NTD and its variants mentioned above are unsupervised algorithms, meaning they do not use the available label information of sample data.
Recently, many researchers have considered cases when the label information of sample data is available. Therefore, to make better use of the label information of sample data, many works have been introduced by incorporating the label information constraints into the framework of NMF or NTD, such as CNMF [13], DNMF [14], graph-based Discriminative NMF (GDNMF) [25], graph-regularized and sparse NMF with hard constraints (GSNMFC) [26], semi-supervised robust distribution-based NMF (SRDNMF) [27], and semi-supervised NTD (SNTD) [28]. In these methods, CNMF and DNMF both incorporate the label information into NMF; however, they do not consider the geometric structure between sample data. Thus, GDNMF was proposed, which incorporated the graph regularization and label information into NMF. Using this trick, GSNMFC was proposed by jointly incorporating a graph regularizer, label information, and a sparseness constraint. However, while these mentioned methods utilize the label information, they do not regard the robustness. To improve the robustness, SRDNMF introduced a Kullback-Leibler divergence and label discriminative constraint into NMF. However, these semi-supervised methods are learning-algorithms-based NMF, which may destroy the inherent structure of sample data. To alleviate the drawbacks, based on NTD, SNTD was introduced by jointly propagating the limited label information and learning the nonnegative tensor representation [28]. Although the SNTD uses the label constraint, it fails to consider that the sample data with the same label could be aligned on the same axis or line.
Motivated by recent progress [14,19], in this paper, we propose a label-constrained NTD, called Discriminative NTD, or DNTD for short. We incorporate a fraction of the label information of the sample data into NTD, and the key idea is that the sample data that belong to the same class should be aligned on the same axis or line in the low-dimensional representation. We also discuss how to efficiently solve the corresponding optimization problem, and we provide the optimization scheme and its convergence proof. The main contributions of the proposed approach are: • By constructing the label matrix and coupling the label-discriminative regularizer to the objective function of NTD, the DNTD method can not only extract the part-based representation from the data tensor but also boost the discriminative ability of the NTD. Furthermore, the key idea of the label-discriminative term is that the sample data belonging to the same label are very close or aligned on the same axis or line in the low-dimensional representation; • An efficient updating algorithm is developed to solve the optimization problem and the convergence proof is provided;

•
Numerical examples from real-world applications are provided to demonstrate the effectiveness of the proposed method.
The rest of the paper is organized as follows: In Section 2, we briefly review the NTD. In Section 3, the DNTD method is proposed, and the detailed algorithm and proof of convergence of the algorithm are provided. In Section 4, experiments for clustering tasks are presented. Finally, in Section 5, conclusions are drawn.

Nonnegative Tucker Decomposition
Given a nonnegative data tensor X ∈ R is a data sample, nonnegative Tucker decomposition (NTD) aims at decomposing the nonnegative tensor X into a nonnegative core tensor G ∈ R J 1 ×J 2 ×···×J N + , which is multiplied by N nonnegative factor matrices A (r) ∈ R I r ×J r + (r = 1, 2, . . . , N) along each mode [19]. To achieve this goal, NTD minimizes the sum of squared residues between the data tensor X and the multilinear product of the core tensor G and factor matrices A (r) , which can be formulated as where, and in the following, the operator × r is referred to as the r-mode product [16]. For example, the r-mode product of a tensor Y ∈ R J 1 ×J 2 ×···×J N and a matrix U ∈ R I r ×J r , denoted by Y × r U, is of size J 1 × · · · × J r−1 × I r × J r+1 × · · · × J N and (Y × r U) j 1 ...j r−1 i r j r+1 ...j N = ∑ J r j r =1 y j 1 ...j r−1 j r j r+1 ...j N u i r j r .
Equation (1) can be represented as a matrix form: where, and in the following, X (N) ∈ R are the mode-N unfolding matrices of the data tensor X and core tensor G [16], respectively, (1) and ⊗ denote the Kronecker product. Therefore, the NTD can be transformed to the NMF with the encoding matrix A (N) ∈ R I N ×J N + , where I N and J N can be regarded as the number of all the samples and the dimension of low-dimensional representation of the data tenor X with respect to the basis matrix G (N) (⊗ p =N A (p) ) , respectively.

Discriminative Nonnegative Tucker Decomposition
To our knowledge, the NTD is an unsupervised method that fails to take the label information of the sample data into account even though the label information is available. However, actually, the label information of sample data has been widely used for increasing supervisory performance together with unsupervised priors [14,25,26,28].
Given the partial labels of sample data, naturally, it is reasonable to assume that sample data, which belong to the same class, should be very close or aligned on the same axis or line. In NTD, therefore, we would expect each of the sample data classes to be placed in a clearly separated cluster in the low-dimensional representation matrix A (N) . To achieve these properties, based on the available label information, we introduce the label matrix Q ∈ R k×I N (where k is the number of data classes) as follows [14]: 1 if sample data X j is labeled and belongs to i-th category, 0 otherwise.
For example, consider the case of I N = 10 sample data, out of which q = 7 are labeled with the categories l 1 = 1, l 2 = 3, l 3 = 1, l 4 = 2, l 5 = 2, l 6 = 1, and l 7 = 4; then, the label matrix Q would be defined as: Based on the introduced label matrix Q, we assume that there are l labeled sample data. Without loss of generality, if the first l sample data are labeled, then the label-discriminative constraint term can be introduced as follows: . . , a l , 0, . . . , 0] ∈ R I N ×J N , and matrix B transforms and scales the vectors in the part-based low-dimensional representation to obtain the best fit for matrix Q. Matrix B is allowed to take negative values; therefore, combining the label-discriminative constraint term and the objective function (1) of NTD, the DNTD is obtained by minimizing the following objective function: where λ is a nonnegative parameter for balancing the regularization term. Equivalently, Equation (2) can be rewritten in matrix form: where, and in the following, (1) . n is one of numbers of set {1, 2, . . . , N}.
By using Lagrange multipliers and considering (3), we change the objective function in (2) into the following Lagrange function: where Φ n and Ψ r are the Lagrange multipliers matrices of G (n) and A (r) , respectively, and Tr(·) denotes the trace of a matrix. The function (4) can be rewritten as Tr(Ψ r A (r) ). (5) Obviously, the objective function in (2) is not convex in all the variables together. Thus, it is very difficult to find the global optimal solution. In the following, we develop an iterative updating algorithm, which updates one of core tensors, factor matrices, and B each time while fixing the others to obtain the local minimum.

Solutions of Factor Matrices A (N)
The partial derivative of L in (5) Similarly, we consider the KKT condition ∂L/∂A (N) = 0 and A (N) Ψ N = 0. As a result, we obtain the following updating rule for A (N) : where, and in the following, for a matrix W = (W ij ), we let | W |= (| W ij |) and define

Solutions of Core Tensor G
The objective function in (4) can be changed into Tr(Ψ r A (r) ), (8) where, and in the following, Similarly, by applying the KKT condition ∂L/∂vec(G) = 0 and (vec(G)) i (vec(Φ)) i = 0, we obtain the following updating rule:

Solutions of Matrix B
The partial derivative of L in (5) with respect to B is For B, since there is no Lagrange multiplier, directly, we derive by setting ∂L/∂B = 0 and solve B to obtain the following updating rule: Theorem 1. The objective function in (2) is nonincreasing under the updating rules in (6), (7), (9), and (10). The objective function is invariant under these updates if and only if A (r) , r = 1, 2, . . . , N, G, and B are at a stationary point.

Proof of Convergence
To prove Theorem 1, we first give a definition and several lemmas.

Definition 1 ([5]
). G(u, u ) is an auxiliary function for F(u) if the conditions G(u, u ) ≥ F(u) and G(u, u) = F(u) are satisfied.

Lemma 1 ([5]
). If G(u, u ) is an auxiliary function for F(u), then F(u) is nonincreasing under the updating rule u t+1 = argmin u G(u, u t ).
The equality F(u t+1 ) = F(u t ) holds only if u t is a local minimum of G(u, u t ). By iterating the update rule (11), u t converges to the local minimum of G(u, u t ).
For any element A is an auxiliary function for F ij (A ). According to Definition 1, we only need to where Comparing (12) with (13), we can see that G(A To prove this inequality, we have is an auxiliary function for F ij (A (N) ij ).
is an auxiliary function for F i (g i ).
Since the proofs of Lemmas 3 and 4 are essentially similar to the proof of Lemma 2, they are omitted here.
Proof of Theorem 1. Replacing G(u, u t ) in (11) by (12), the minimum is obtained by setting According to Lemma 2, F ij (A (n) ij ) is nonincreasing under the updating rules (6) for A (n) . Similarly, putting G(A (14) into (11), we obtain It is noted that the above equation is the same as (7). By Lemma 3,(14) is an auxiliary ij ) is nonincreasing under the updating rules (7) for A (N) . Furthermore, putting G(g i , g t i ) of (15) into (11), we obtain Use the same method as the proof that F ij (A (N) ij ) is nonincreasing under the update rules (7), we obtain F i (g i ) is nonincreasing under the updating rule (9).
Finally, we prove the convergence of the updating rule in (10) for B. Since the updating rule is derived by setting the derivative of Lagrangian with respect to B to 0 and B is unconstrained, it follows from the convexity of the objective function (2), which is equivalent to minimizing the objective function relevant to B in each iteration. Therefore, Theorem 1 is true.

Experiments Section
In this Section, we show the use of our proposed DNTD scheme and two metrics, namely accuracy (AC) and normalized mutual information (NMI) [29], which are used to evaluate the performance for clustering on three datasets.

Datasets
To evaluate the effectiveness of the proposed DNTD method, three datasets are adopted to perform experiments in the following subsection. The descriptions of these datasets are given as follows:

ORL Dataset
The ORL (https://github.com/saeid436/Face-Recognition-MLP/tree/main/ORL, 11 December 2022) dataset collects 400 grayscale 112 × 92 faces images, which consist of 40 different subjects with 10 distinct images. For some subjects, these images were taken at different times, varying the lightning, facial expressions, and facial details. All images were taken against a dark homogeneous background, with subjects in an upright, frontal position. For each image, we resized it to be 32 × 32 pixels in our experiment. These images constructed a tensor X ∈ R 32×32×400 .

COIL20 Dataset
The COIL20 (http://www.cad.zju.edu.cn/home/dengcai/Data/MLData.html, 5 April 2022) dataset contains a total of 1440 grayscale images of 20 subjects, each of which has 72 images taken from diverse poses. Specifically, the subjects were placed on a motorized turntable rotating through 360 degrees to change subject pose with respect to a fixed camera, and the images of each subject were taken 5 degrees apart. In this experiment, each image was resized to 32 × 32 pixels and all images were stacked into a tensor X ∈ R 32×32×1440 .

Yale Dataset
The Yale (http://cvc.cs.yale.edu/cvc/projects/yalefaces/yalefaces.html, 5 April 2022) dataset consists of 165 images of 15 individuals. Each individual includes 11 images, which were taken in different facial expression or configurations, such as sad, sleepy, surprised, left-light, right-light, wearing glasses, no glasses, and so on, and each image was resized into 32 × 32 pixels. All images formed a tensor X ∈ R 32×32×165 .

Compared Algorithms and Experimental Setting
In order to verify the proposed algorithm is efficient and can enhance the clustering performance on these datasets, we compared it with the following algorithms: • K-means [30]: K-means algorithm aims to divide n sample data with m dimensions to K clusters so that the within-cluster sum of squares is minimized. It is a traditional clustering method. In our experiment, we use the command "kmeans" in Matlab 2016b to execute K-means algorithm; • Agglomerative hierarchical clustering (AHC) [31]: It is a bottom-up clustering method. It treats single-sample data as a class and then merges these sample data so that the number of classes gradually reduce, until finally they cluster to the required number of classes; • NMF [5]: NMF algorithm is one of the typical clustering algorithms. In our experiment, we adopted the Frobenius-norm formulation; • NTD [19]: NTD algorithm is considered a generalization of NMF.; • DNMF [14]: DNMF algorithm incorporates the label information of sample data into the objective function of NMF and enforces the samples with the same label aligned on the same axis.
In each experiment, we randomly selected k categories as the evaluated data. Each experiment was operated 10 times and we applied K-means 10 times in the low-dimensional representation of each time. We randomly selected 30% sample data of each category as the labeled data in semi-supervised methods, such as DNMF and DNTD. For all compared methods, we set the dimension of the encoding matrix in the low-dimensional representation space as the number of the selected categories of each time. For NTD and the proposed method, we tested the clustering performance of NTD method and the proposed method when the size of core tensor is 1 4 I 1 × 1 4 I 2 × k , 1 2 I 1 × 1 2 I 2 × k, and 3 4 I 1 × 3 4 I 2 × k on subdatasets of three datasets. As a result, the clustering performance is better in most cases when the core tensor size is 1 2 I 1 × 1 2 I 2 × k on COIL20 dataset, 3 4 I 1 × 3 4 I 2 × k on ORL dataset, and 3 4 I 1 × 3 4 I 2 × k on Yale dataset; so, the sizes of the core tensors of NTD method and the proposed DNTD method are set as 1 2 I 1 × 1 2 I 2 × k on COIL20 dataset and 3 4 I 1 × 3 4 I 2 × k on others, respectively. Regarding the regularization parameter of DNMF and the proposed method, empirically, we set λ = 10 −1 on COIL20 dataset and λ = 10 6 on others.

Clustering Results
Tables 1-6 represent the average clustering performance and standard deviation on the ORL dataset, COIL20 dataset, and Yale dataset, respectively. Tables 1-6 show that the proposed algorithm achieves the better performance for clustering on three datasets. On ORL dataset, the proposed algorithm achieves 14.41%, 32.17%, 8.86%, 6.88%, and 4.81% improvements in AC and 12.09%, 27.50%, 8.33%, 5.60%, and 4.22% in NMI by comparing with K-means, AHC, NMF, NTD, and DNMF, respectively. For COIL20 dataset, the proposed algorithm attains 7.32%, 8.47%, 4.02%, and 6.11% improvements corresponding to AC and 5.47%, 10.43%, 4.35%, and 8.10% improvements corresponding to NMI in comparison with K-means, NMF, NTD, and DNMF, respectively. Compared with AHC algorithm, AC of the proposed algorithm is improved by 1.59%; however, NMI is reduced by 2.54%. On Yale dataset, the proposed algorithm increases 6.53%, 26.86%, 4.55%, 4.66%, and 1.29% in AC and 6.40%, 29.64%, 5.30%, 5.56%, and 1.27% in NMI in contrast to K-means, AHC, NMF, NTD, and DNMF, respectively. Furthermore, we also applied AHC algorithm in the new representation generated by DNTD while other parameter settings are kept unaltered, and the average of clustering performance is 62.58%, 69.93% on ORL dataset, 74.32%, and 73.33% on COIL20 dataset, and 26.41% and 21.00% on Yale dataset with respect to AC and NMI, respectively. Comprehensively, the clustering performance of DNTD with AHC algorithm is lower than that with K-means algorithm in most cases; therefore, we utilized K-means algorithm together with the proposed DNTD method to test the clustering performance.
As we can see, the semi-supervised DNTD and DNMF methods are superior to the unsupervised methods, such as NTD, NMF, K-means, and AHC on the ORL and Yale datasets, which means that taking the label information of sample data into account is useful for improving the performances of NTD and NMF. Particularly, the improvements in the clustering performance of the DNTD method is obvious and attractive on the ORL dataset. Moreover, the changes between NTD and DNTD in AC and NMI exceed that between NMF and DNMF on all datasets, which indicates that incorporating the labeldiscriminative constraint term into NTD is more effective than the combination of the label-discriminative term and NMF. Furthermore, the DNTD outdoes all the compared methods.The DNTD is more powerful when boosting the discriminative ability by the available label information of sample data than the DNMF and is capable of effectively extracting the low-dimensional representation of the data tensor. All in all, the proposed DNTD method owns the best average clustering performance among the compared methods thanks to the label-discriminative information of sample data in conjunction with the NTD of data tensor representation.  Our proposed method is a semi-supervised algorithm, so there exists a close relationship between the clustering performance and the number of labeled sample data. Figure 1 represents the clustering performance with a varied number of labeled sample data on the ORL dataset and Yale dataset. From Figure 1, we can discover that the clustering performance of the DNTD method can be enhanced with the increase in the number of labeled sample data; furthermore, the DNTD method surpasses the other compared methods as well.
In addition, we also observed that the clustering performance of the proposed method is better on the ORL and COIL20 datasets than on the Yale dataset. In our view, there exist two main factors lowering the performance of the proposed method on the Yale dataset. Firstly, some of the images in the Yale dataset are taken in poor light, which leads to the black and blurred backgrounds in these images. Secondly, some individuals are wearing sunglasses, that is, the faces of these individuals are partly shielded by the glass. These two factors result in bad image quality, which directly leads the obtained data to lose many true values, while also simultaneously generating many incorrect values. So, the proposed algorithm is largely interfered with, so that the performance is lowered.

Parameter Selection
In our experiment, there exists one parameter λ to be decided. The parameter λ measures the degree of the discrimination of the label information of the sample data. In this subsection, we study the importance of the parameter λ in the DNTD method by clustering performance on the above three datasets. The parameter is selected from the set {10 −2 , 10 −1 , 10 0 , 10 1 , 10 2 , 10 3 , 10 4 , 10 5 , 10 6 , 10 7 }. On each dataset, we fix the number of categories as 10 for simplicity. Firstly, we randomly select 10 categories to perform the experiment and apply K-means 10 times in the low-dimensional representation; then, we repeat the above operations 10 times and calculate the average. The results with the effect of the parameter are displayed in Figure 2. From Figure 2, we discover the DNTD scheme displays a good stable performance when λ is ranging from 10 −2 to 10 5 on all datasets; that is, it is not sensitive to the parameter λ, which helps to promote the stability of DNTD. Furthermore, we observe the DNTD scheme gains the best performance when λ = 10 6 on the ORL and Yale dataset, empirically; therefore, λ is selected to be 10 6 on the ORL and Yale dataset, and 10 −1 on the COIL20 dataset.

Conclusions and Future Work
In this paper, we have introduced a label constraint nonnegative Tucker decomposition method for tensor data representation, called discriminative nonnegative Tucker decomposition, or DNTD for short, which specifically takes the label information of the sample data into account by constructing a label matrix and further constructing a label-discriminative constraint term. Moreover, an updating rule and the proof of its convergence have been provided. As a result, numerical experiments on three datasets have demonstrated the effectiveness of the proposed method for clustering performance.
Although the proposed DNTD method has shown good clustering performance, it still has limitations from a geometric perspective. In the future, to preserve the geometric structures of sample data on manifold, an investigation will be conducted. Furthermore, as an important clustering algorithm, NTD does not completely show an excellent and robust performance when dealing with some outliers. In the future, generating a robust and efficient learning method will be considerable work.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author.

Conflicts of Interest:
The authors declare no conflicts of interest.

Abbreviations
The following abbreviations are used in this manuscript: