Deep Subspace Clustering with Block Diagonal Constraint

: The deep subspace clustering method, which adopts deep neural networks to learn a representation matrix for subspace clustering, has shown good performance. However, this representation matrix ignores the structural constraint when it is applied to subspace clustering. It is known that samples from different classes can be taken as embedding in independent subspaces. Thus, the representation matrix should have a block diagonal structure. This paper presents the Deep Subspace Clustering with Block Diagonal Constraint (DSC-BDC), a model which constrains the representation matrix with block diagonal structure and gives a block diagonal regularizer for learning a suitable representation. Furthermore, to enhance the representation capacity, DSC-BDC reforms the block-diagonal structure constraint by performing a separation strategy on the representation matrix. Speciﬁcally, the separation strategy ensures that the most compact samples are selected to the represent data. An alternative optimization algorithm is designed for our model. Extensive experiments on four public and real-world databases demonstrate the effectiveness and superiority of our proposed model.


Introduction
Subspace clustering is a significant unsupervised learning method in many applications, such as image representation [1][2][3], face clustering [4][5][6], motion segmentation [7,8], bioinformatics [9], medical image analysis [10], etc. Usually, many high-dimensional datasets from the real-world applications are described by a union of low-dimensional subspaces. Thus, the main target of subspace clustering is to segment a series of samples into disjoint clusters such that the samples within each cluster belong to the same subspace [11,12].
In the past decade, abundant subspace clustering methods have been proposed, and most methods [2,[13][14][15] are based on Spectral Clustering, which constructs a representation matrix measuring similarity between data points and segments input data based on this representation matrix. Specifically, the subspace clustering method based on self-expression [2,15,16] is one of the most popular clustering methods, and the main idea of this method is that using the self-expression of each data sample obtains the representation coefficient for establishing a representation matrix. Subspace clustering methods based on self-expression include Spares Subspace Clustering (SSC) [17], Low-Rank Representation (LRR) [15], Low Rank Subspace Clustering (LRSC) [18], etc. Among them, SSC and LRR, respectively, give a regular term with sparseness and low-rank to constrain the representation matrix for sparse and low-rank. LRSC decomposes the data matrix into a noise matrix and a clean matrix, and then obtains the low-rank representation matrix for clustering. However, these models are suitable for any representation matrix and are not targeted for clustering application.
As is known, the ideal representation matrix should have block diagonal structure for the subspace clustering. Thus, in recent years, many studies [19][20][21] utilize the block diagonal prior for the representation matrix. For example, Feng et al. [19] directly pursued the block-diagonal structure by proposing a graph Laplacian constraint based formulation; Lu et al. [20] adopted a block-diagonal regularization directly to construct the representation matrix; and Guo et al. [21] introduced block-diagonal constraints on multi-view representation matrices to obtain accurate heterogeneous information. Those methods demonstrate that the subspace clustering performance is improved by adding the block diagonal prior on the representation matrix. However, it is worth noting that the above algorithms assume that data could be linearly represented with each other in the input space.
In addition, most traditional works on subspace clustering [2,14,16,[22][23][24] are based on the assumption that the data are linearly separable. However, actual data usually do not conform to the linear subspace model. For example, in the task of face clustering, they are affected by the light source, facial geometric features and reflectivity. Under these influences, face data are usually in a nonlinear subspace. Kernel methods [25][26][27][28] are used to implicitly map data to higher-dimensional spaces to better conform to linear models in the resulting spaces. However, the selection of different kernel types is empirical without theoretical guarantee.
With the rapid development of deep learning, the deep neural networks have received widespread attention because of their excellent ability to capture complex underlying data structures and learn informative features (for clustering). Inspired by that, subspace clustering based on deep neural networks came into view naturally, and many deep subspace clustering models based on different architectures emerged. These models can better exploit the nonlinear relationships between the sample points [29][30][31][32][33][34][35][36][37][38][39][40]. The commonly used deep architectures for these models are: Auto-encoders (AEs), Variational Autoencoders (VAEs) [39,41] and Generative Adversarial Networks (GANs) [37,38,40,42], which are used to produce feature representations suitable for clustering [43]. As good generative models, VAEs and GANs have an intrinsic difference. VAEs aim to find a probability density through an explicit method and obtain the optimal solution by minimizing the lower limit of the log likelihood function. GANs are an adversarial way to find a balance, and they do not need to be given an explicit probability density function. Mishra et al. [37] proposed an approach for the problem of clustering in the Generative Adversarial Network (GAN)-defined latent space of GANs to unveil the patterns in multimodal distributions, via a latent space inversion network.
For the subspace clustering problem, the main tasks are feature extraction and affinity matrix construction. Therefore, as a simple generative and feature extraction model, AEs are often used as the deep architecture combined with subspace clustering. Peng et al. [29] presented the first deep learning-based subspace clustering method that progressively transforms input data into nonlinear latent space. Deep Subspace Clustering Network (DSC-Net) [30] firstly designs the self-expression and the corresponding loss function for the self-expressiveness matrix of data with deep auto-encoder. Peng et al. [31] learned a set of explicit transformations to progressively map input data points into nonlinear latent spaces while preserving the local and global subspace structure. Zhang et al. [32] adopted a dual self-supervision structure, which uses the division results of spectral clustering to simultaneously supervise the representation process and the self-expression process. Zhang et al. [33] re-formulated subspace clustering as a classification problem, which in turn removes the spectral clustering step from the computations. Zhou et al. [34] utilized GANs to guide the learning of DSC-Net, so that the network can be trained in the direction of better results every time. Peng et al. [36] presented the first work revealing the sample-assignment invariance prior based on the idea of treating labels as ideal representations. These deep learning-based methods outperform other state-of-the-art subspace clustering methods significantly.
Because of the good performance demonstrated by the deep neural network and block-diagonal constraint, this paper employs the deep neural network to set up the subspace clustering model with Block Diagonal Constraint (DSC-BDC), in which the block diagonal regularization is used to restrain the self-expression matrix. At the same time, to improve the flexibility and ability of self-expression, we conduct a separation strategy to indirectly restrain the block diagonal structure of the representation matrix. Specifically, the main contributions of this paper can be summarized as follows. • A deep subspace clustering method based on an auto-encode is proposed, and block-diagonal constraints on representation matrix are used for better cluster performance. • A separation strategy on the block diagonal constraint is proposed for more flexibility. • Due to the existing of new block diagonal regularizer, an alternating optimization method to solve proposed model is developed. • The proposed DSC-BDC is evaluated on four databases, and the results demonstrate the effectiveness of our model.
The rest of this paper is organized as follows. Section 2 gives the proposed DSC-BDC model and optimization algorithm. Section 3 provides experimental results and analysis based on four public databases. Finally, the conclusions and future works are summarized in Section 4.

Deep Subspace Clustering Model with Block Diagonal Constraint
In this section, the details of the proposed Deep Subspace Clustering Model with Block Diagonal Constraint (DSC-BDC) are described. Then, the entire optimization algorithm is given. In addition, the relationship between the proposed model and related work is analyzed.

Model Formulation
Given the input samples set X = {x 1 , x 2 , · · · , x N }, which consists of N samples. Each sample may be raw data or manually extracted features, such as images or extracted image features. As illustrated in Figure 1, our deep network model selects the convolutional auto-encoder as the basic framework and uses fully connected without bias and nonlinear activation as the self-expression layer. Let Θ denote the auto-encoder module parameters, which can be decomposed into encoder parameters θ e and decoder parameters θ d . Z denotes the self-expression module parameters, which is the representation matrix. F = { f 1 , f 2 , · · · , f N } is the latent representation corresponding to X = {x 1 , x 2 , · · · , x N }, which is learned by the encoder. Therefore, latent feature f i = ϕ(x i , θ e ), ϕ(·) means to map the original data to latent feature. The output of decoder is X = x 1 , x 2 , . . . , x N , which is a function of Θ = {Θ, Z}. The naive deep subspace clustering can be formulated as a minimization problem as follows: where α 1 is a positive trade-off parameter that balances the terms of the objective function. Here, DSC-Net uses 1 2 F − FZ 2 F to measure the self-expression errors, 1 2 X − X 2 F to calculate the reconstruction cost and Z p as a prior structure regularization of self-expressive representation Z, which is used to avoid trivial solution. Subspace clustering methods based on self-expressive property firstly learn a representation matrix, and then spectral clustering is performed on the representation matrix to obtain the clustering result. Thus, the representation matrix plays an important role in spectral clustering. Whether it can truly reflect the similarity between data samples will directly affect the clustering performance. The ideal representation matrix in spectral clustering should have obvious block diagonal structure since all the inter-cluster affinities should be zero. To approximate such a structure, sparse or low-rank priors are adopted to the self-expressive representation matrix. there are two main types of constrained representation matrices: l 1 norm to pursue sparsity [17] and nuclear norm to pursue low-rankness [15].
However, these two constraints are too rough and cannot guarantee a good diagonal structure. Unlike these indirect structure priors, Lu et al. [20] firstly proposed block diagonal regularization terms to directly constrain the representation matrix as a block diagonal structure.
This theorem provides theoretical support for the the block diagonal constraint. We can know with Theorem 1 that Z is k block diagonal ( which means that Z has k connected components) if and only if where λ i (L Z )(i = 1, · · · , N) are the eigenvalues of L Z in decreasing order. Thus, we can employ the sum of the k smallest eigenvalues of L Z as the k block diagonal regularizer, i.e., Obviously, the representation matrix Z is k block diagonal when Z κ = 0. Thus, Z κ measures the k block diagonal degree of Z, and we can minimize Z κ for learning the representation Z with k block diagonal structure. While introducing the above k block diagonal regularization in [20], we can rewrite the DSC-Net loss function as follows: We know that Z κ = 0 will cause a strict block diagonal structure, whose elements are zero except for the diagonal block elements. This is a strong constraint and will make the network over-fit and unstable for increasing error.
Therefore, this paper introduces a separation strategy to separate the self-expression matrix for providing more freedom to guarantee both the exact self-expression and approximating k block diagonal structure. Different from directly constraining Z, this separation strategy decomposes Z into C and D, and then we only constrain C with a block diagonal structure to relax the constraints. We define Z = C D to call for two distinct matrices, C and D, with the same size, here Z ij = D ij · C ij , which denotes the element-wise product. We bound the magnitudes of C via the constraint C ≤ C max , which is to avoid the magnitude of C dominating the other variables. For D, it is easy to see that D ij = 0 will make (C D) ij = 0, even if C ij = 0. Therefore, we use D min as the lower bound to constrain the elements in D. Motivated by the above idea, we improve DSC-Net model to our deep subspace clustering model with block diagonal constraint (DSC-BDC), which can be formulated as follows: where Z ∈ R N×N is the representation matrix corresponding to the subspace structure of X; C ∈ R N×N and D ∈ R N×N are obtained by matrix multiplication decomposition of representation matrix Z; and α 1 , α 2 , α 3 , α 4 are the regular term parameters. The first term of objective function is the reconstruction errors of the auto-encoder network, the second term is the self-expression error and the third term is the regular term to avoid trivial solutions. For the last three items, Z − C D 2 F is the multiplicative decomposition term, D 2 F is the regularization term and the template C is constrained as a block diagonal structure through the C κ term.
Thus, the representation matrix Z will be with an approximate block diagonal structure, and it is hoped this improves the clustering performance.

Model Optimization
According to Equation (2), the regularization term C κ means that the sum of k smallest eigenvalues of L C needs to be minimized, which is difficult to solve. However, according to the basic properties of the eigenvalues [20,45], the regularization term can be rewritten as: where W = {W ∈ R N×N : 0 W I, tr(W) = k}.
L C = D C − (C + C T )/2, D C ∈ R N×N is a diagonal matrix whose diagonal elements are ∑ j (c ij + c ji )/2, W T = {w 1 , w 2 , . . . , w N }, and I ∈ R N×N is the identity matrix. L C , W denotes the inner product between the two matrices. Thus, Equation (4) can be rewritten as: Obviously, the model (9) is non-convex optimization problems, which is difficult to solve directly. Therefore, we intend to employ the alternating direction method (ADM) [46] to alternatively optimize variables Θ, Z, D, C and W.
Update Θ, Z: While all other variables C, D and W are fixed, we can update Θ, Z by solving sub-problem (10) It is worth noting that (10) is differentiable on Θ and Z, thus we use stochastic gradient descent to update Θ and Z. Update D: When other variables are fixed, the corresponding sub-problem for D is then given by The elements D ij are independent of each other. Thus, we can update D ij separately by the resulting sub-problem Solving the above formula, it can be obtained that: We define P = A −1 B, where the matrices A and B are different but have the same size. We have , which denotes the element-wise division. For convenience, the solution of (12) can be re-expressed as: where Update C: Fix all other variables Z, Θ, D and W to update C. The corresponding sub-problem for C is then given by The solution of (16) can be re-expressed as: where To obtain a stable result, it is necessary to add a small positive number to the denominator term, i.e., α 2 (D ij + D ji ) + . The proof of updating rule is provided in Appendix A.
Update W: Fixing all the other parameters, W can be solved from the following problem. Fix all other variables Z, Θ, D and W to update C. The corresponding sub-problem for C is then given by: Following Lu et al. [20], the optimal solution is defined as where V = {v 1 , v 2 , . . . , v k } and v i is an eigenvector associated with the ith smallest eigenvalue of L C . Based on the above analysis, we summarize the optimization procedure for our model in Algorithm 1.
1: D ij ← D min ; 2: W ← I; 3: C ← 0; 4: while not converged do 5: iter ← iter + 1; W old ← W; 10: Update Z with (10); 11: Update D with (14); 12: Update C with (17); 13: Update W with (20). 14: end while According to the previous sub-problems, we give the complexity analysis as follows: For updating Z, the computational complexity in network is O(n 2 ) at a time. Obviously, for updating C and D, one needs O(n 2 ) operations. For updating W, one needs O(n 3 ) operations to perform the eigenvalue decomposition. Thus, it is easy to obtain the total complexity of the proposed algorithm at each iteration by summing all the steps. We conclude that the computational complexity of our proposed method is then O(n 3 ).

Experiments
In this section, we carry out a large number of experiments to evaluate DSC-BDC and compare it with several excellent methods published in recent years to verify the performance of the proposed algorithm. In addition, we give a particular explanation about the parameter settings, the pre-trained models and the fine-tuning procedures used in our experiments.
All algorithms were coded with Python 3.6 with Pytorch and were run on the platform of Windows 10 with Intel Core i7-8550U 1.80GHz CPU and 8G RAM.
For DSC-Net method, we run the same epochs as DSC-BDC and report the last epoch performance. For ConvSCN-BD, we use the default setting as the authors' suggestion. In the experimental setting, all the methods involve k-means clustering method for the final clustering results. The initialization of solution heavily influences the performance of k-means. Thus, we run it 20 times, and the average performance and standard deviation are reported.

Databases
We evaluated DSC-BDC method on four public databases, i.e., the Extended Yale B, ORL and PIE face image databases and the COIL20 object image database. A brief introduction of these databases is listed in Table 1.

Training Strategy and Parameter Settings
During model training, we first use the convolutional autoencoder as the pre-training model to initialize the network parameters Θ, and then the Θ is used as the initial parameter of fine-tuning.
During pre-training, we remove the self-expression module, and the pre-trained network degenerates into the simplest convolutional auto-encoder. The output X of the decoder is only a function of {θ e , θ d }. We pre-train the network by minimizing the following reconstruction error: When the convolutional auto-encoder is trained, we add a self-expression module, and at the same time extract the parameters of the self-expression module, which is the coefficient matrix Z ∈ R N×N . The output X of the decoder is only a function of {θ e , θ s , θ d }. This is the deep subspace clustering network, which can be trained by minimizing the following loss: For the convolutional layers, the stride of kernel is 2 in both horizontal and vertical directions, and nonlinear activation is the Rectified Linear Unit (ReLU). At the same time, the learning rate is set as 0.001 in all experiments. For all databases, the kernel size in convolutional layers is set to 3 × 3. For Extended Yale B database, we set the channels as 20. In the fine-tuning stage, we set α 1 = 2, α 2 = 40, α 3 = 1.25, α 4 = 2 and = 1 × 10 −6 and run 450 epochs for DSC-BDC. For ORL database, we set the channels as 3-3-5-5-3-3. In the fine-tuning stage, we set epoch = 700, α 1 = 2, α 2 = 40, α 3 = 1.25, α 4 = 2 and = 1 × 10 −6 . In addition, for PIE database, we set the number of channels to 20 and set epoch = 450, α 1 = 1, α 2 = 40, α 3 = 1, α 4 = 2 and = 1 × 10 −6 . On COIL20 database, the number of channels is 15, and the parameters are consistent with PIE databases except that epoch = 40.

Experimental Results and Analysis
We employ three popular metrics to evaluate the clustering performance of all compared algorithms: Accuracy (ACC), Normalized Mutual Information (NMI) and Adjusted Rand Index (ARI). With the block-diagonal representation Z, we calculate the affinity matrix A = |Z|+|Z T | 2 . Then, the clustering result is obtained by performing the spectral clustering method on affinity matrix A.
The experimental results of different methods on the Extended Yale B, ORL, PIE and COIL20 databases are shown in the Tables 2-5. The bold numbers in the table are the best results, and the short line means that some evaluation metrics of some methods are not shown.
From the results in Tables 2-5, the following observations can be found: (1) Overall, DSC-BDC achieves competitive and stable performance compared to most baselines on all databases.
Taking the Extended Yale B database as an example, DSC-BDC outperforms the other best method DSC-Net by 0.44%, 0.74% and 0.73% in terms of ACC, NMI and ARI, respectively. (2) Observing the experimental results of SSC and AE+SSC, it can be seen that, in most cases, the latter has higher accuracy, which indicates that deep networks can better extract nonlinear data features to improve the clustering effect. (3) Comparing BDR and DSC-Net, we can see that deep networks have powerful representation learning ability. Compared with the clustering results obtained by the BDR method (no network) with block diagonal structure constraints, DSC-Net does not strictly require the representation matrix structure and also obtained good clustering performance. (4) Our DSC-BDC framework achieves better performance than DSC-Net. The reason might be that a direct block diagonal regularizer is more effective than regularization using indirect structure priors in self-expression. For visually intuition, we show the corresponding derived representation matrix, as a clearer block diagonal structure of the representation matrix results in higher clustering performance. We show the three databases (Extended Yale B, ORL and COIL20) in Figure 2. Clearly, the representation matrix Z has apparent block diagonal structure. We next explore the advantages of the method using the separation strategy (DSC-BDC) over the method without the separation strategy (ConvSCN-BD). We plot the accuracy of different databases in Figure 3. This implies the effectiveness of the separation strategy. We verify the convergence of DSC-BDC through the experiment of drawing convergence curves. The convergence curves of the four public databases are shown in Figure 4. In each subfigure, we define the horizontal axis to represent the number of iteration steps and the vertical axis to correspond to the value of termination criterion. It can be observed that DSC-BDC algorithm can converge with few iteration steps.

Conclusions
We propose a Deep Subspace Clustering with Block Diagonal Constraint (DSC-BDC) model, in which the coefficient representation matrix is loosely block diagonally constrained by means of a separation strategy. Furthermore, for solving the joint optimization problem, the algorithm alternately updates the auto-encoder module, representation matrix parameters and matrix decomposition module parameters. On the other side, by using the block diagonal constraint, we can learn a better representation matrix from the self-expression coefficient to obtain more stable and excellent clustering performance. From the experiments on four benchmark image datasets, we can conclude that our model is effective and promising. In future work, we will extend our model to deep multi-view subspace clustering. We assume that there is consistent information among different views of multi-view data and that each view has view-specific information [21,24,45]. We impose a separation strategy on multi-view data, so that each view shares consistent structural information on the premise of maintaining its view-specific information. At the same time, the block diagonal constraint is constrained to the shared information (matrix) to improve the clustering performance, and we use the network to learn the variables of the separation strategy module instead of solving the variables through the alternating iterative algorithm.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A
Proof of Updating the Variable C Proof. The sub-problem of updating C is as follows where C = {C ∈ R N×N : 0 ≤ C ≤ C max , diag(C) = 0, C = C T }.
We first modify formula (A1) equivalently: The elements C ij are independent of each other. We can update the C ij separately. Thus, (A3) can be expressed as where Then, we have Obviously, (A1) is equivalent to the following optimization problem To ensure that C is a symmetric matrix, we rewrite (A7) as Then, we define , if i = j; 0, otherwise. (A9) Solving the above, formula (A8) can be obtained: For convenience, the solution of (A8) can be re-expressed as: where The proof is completed.