A Penalized Matrix Normal Mixture Model for Clustering Matrix Data

Along with advances in technology, matrix data, such as medical/industrial images, have emerged in many practical fields. These data usually have high dimensions and are not easy to cluster due to their intrinsic correlated structure among rows and columns. Most approaches convert matrix data to multi dimensional vectors and apply conventional clustering methods to them, and thus, suffer from an extreme high-dimensionality problem as well as a lack of interpretability of the correlated structure among row/column variables. Recently, a regularized model was proposed for clustering matrix-valued data by imposing a sparsity structure for the mean signal of each cluster. We extend their approach by regularizing further on the covariance to cope better with the curse of dimensionality for large size images. A penalized matrix normal mixture model with lasso-type penalty terms in both mean and covariance matrices is proposed, and then an expectation maximization algorithm is developed to estimate the parameters. The proposed method has the competence of both parsimonious modeling and reflecting the proper conditional correlation structure. The estimators are consistent, and their limiting distributions are derived. We applied the proposed method to simulated data as well as real datasets and measured its clustering performance with the clustering accuracy (ACC) and the adjusted rand index (ARI). The experiment results show that the proposed method performed better with higher ACC and ARI than those of conventional methods.


Introduction
Among the clustering methods, the Gaussian mixture model, which is model-based clustering, is widely used thanks to its easy interpretability. Since most clustering approaches suffer from the curse of dimensionality in "high dimension/low sample size data", so does model-based clustering. The problem is further exacerbated in a Gaussian mixture model that has to estimate multiple covariance matrices. For example, given a p-dimensional random sample X i = (X i1 , . . . , X ip ) T , i = 1, . . . , n for a multivariate Gaussian mixture model with G components, we need to estimate a total of Gp(p + 1)/2 + Gp + G − 1 parameters. So, the number of parameters to be estimated grows quickly with dimension p. To address this problem, a penalized Gaussian mixture clustering model has been proposed that imposes constraints on the mean or precision matrix. In particular, Ref. [1] suggests a penalized multivariate normal mixture clustering model for variable selection by regularizing the mean structure. In each cluster, there may be some variables that are not relevant to distinguish it from other clusters. So, finding the non-informative (noise) variables and removing them may largely improve interpretability. Additionally, Ref. [2] suggests a penalized multivariate normal mixture clustering model for achieving better clustering performance by regularizing the precision matrices. In another way, Ref. [3] adopts the idea of a penalized covariance estimation from [4] and applies a lasso-type penalty to the off-diagonal component of the precision matrices.
However, as the dimension p increases, the computational speed extremely slows down, making it difficult to estimate the parameters practically for these methods. In recent years, along with advances in technology, matrix data have emerged in many practical fields. The conventional Gaussian mixture method converts these matrix data into vector data. So, the dimension of the data can be extremely large, i.e., a p × q matrix will be converted to a vector of pq dimension. In addition, clustering a matrix dataset is difficult because correlations between variables or occasions can change across occasions. To solve the aforementioned issues, Ref. [5] suggests finite mixtures of matrix normal distributions for matrix data. Direct analysis of three-way data structures by using a matrix normal distribution allows considering and estimating correlations between variables and occasions. Recently, the matrix mixture model was extended for various non-normal matrix data: Refs. [6,7] proposed finite mixture of matrix skewed distributions, and [8] introduced two matrix-variate distributions-both elliptical heavy-tailed generalizations of the matrix-variate normal distribution that are used in a finite mixture model. Ref. [9] suggested a penalized matrix normal mixture model by imposing a penalty for the mean matrix. This method can capture the row-wise and column-wise correlation simultaneously and has the ability to find the sparsity nature that is inherent in the signals and images. However, the approach of [9] shows weakness for high dimension/low sample size data because it cannot reduce the number of precision parameters to estimate.
In a multivariate normal distribution, the conditional correlation between two variables, given the other variables, is explicitly determined by the precision matrix. That is, if the (i, j)th element of the precision matrix is zero, then the conditional correlation between the ith and the jth variables given the other variables is zero, and they are conditionally independent. Ref. [10] introduced the idea of covariance selection, which consists of inferring a sparse estimation of the precision matrix and interpreting its sparsity structure as a conditional dependency between variables. A Gaussian graphical model is a network whose values on the vertices follow a centered multivariate normal distribution, and has been applied to analyze high-dimensional data with small sample size through the regularized estimation of the precision matrix [11]. A notable approach for the precision matrix estimation is the graphical lasso algorithm [12].
In this paper, the work of [9] is extended by further regularizing the precision matrix as in the Gaussian graphical model, to cope better with the curse of dimensionality for large size images. Specifically, we propose a penalized matrix normal mixture model with lasso-type penalty terms in both mean and covariance matrices, and then an expectation maximization algorithm is developed to estimate the parameters. The estimators are consistent and their limiting distributions are derived. The experiment results show that the proposed approach attains much improved clustering performance in comparison with conventional methods.
The capability of the proposed model is compared with two popular distance-based methods and two conventional model-based methods in Appendix A Table A1. The major contributions of this research to overcome the limitations of the conventional methods are as follows: (1) Since the data type for the Gaussian mixture model is a vector and K-means and spectral clustering methods use dissimilarity and similarity between the observations as input, these methods do not have the modeling ability for the row-wise/columnwise covariance structure of matrix data (images). On the other hand, both [9] and the proposed model can capture the row-wise and column-wise correlation of image pixels simultaneously by estimating the row covariance U and column covariance V, where the covariance matrix Σ is the Kronecker product of U and V. This capability of modeling separable row-wise/column-wise covariance structure helps for better clustering of image data. (2) Because the proposed model is a model-based clustering method, it has the sound mathematical basis and the interpretability of results with the estimated parameters, whereas centroid-based and similarity-based methods do not.
(3) If variables are measured on different scales and one variable has much larger values than the others, then the first variable will dominate the clustering. Therefore, it is preferred to use a clustering method that is invariant against affine transformations. K-means and similarity-based methods using the Euclidean distance are not invariant against affine transformations, whereas Gaussian mixture models are. (4) Model-based clustering methods have free parameters to estimate and often have difficulty in the estimation of parameters when the data are of a high dimension but the sample is small. Given p × q, random matrix data X, and the number of clusters G, the conventional Gaussian mixture model needs to estimate a total of Gpq(pq + 1)/2 + Gpq + G − 1 parameters because it converts the data into a vector of pq dimension. The method [9] is regularized on the mean matrix, and thus it needs to estimate Gp(p + 1)/2 + Gq(q + 1)/2 + Gpq + G − 1 − r parameters, where r is the number of the mean parameters values that are shrunken to zero. Since the proposed method is regularized further on the precision matrix, that is, on the row and column covariances, U and V, with shrunken elements, it needs to estimate Gp(p + 1)/2 + Gq(q + 1)/2 + Gpq + G − 1 − r * parameters, where r * is the number of the mean and precision parameters values that are shrunken to zero. The range of r and r * is 0 ≤ r ≤ Gpq and r ≤ r * ≤ Gp(p + 1)/2 + Gq(q + 1)/2 + Gpq, respectively. Therefore, the proposed approach is a more parsimonious model that has the fewest parameters and is able to cope better with the curse of dimensionality for large size images. (5) There often exist correlations between the rows and the columns of matrix data. A separable covariance structure of the proposed model can effectively reduce parameters, speed up calculations, and provide practical interpretability. Therefore, the proposed method has the competence of both parsimonious modeling to overcome the curse of dimensionality and reflecting the proper conditional correlation structure for high dimension/low sample size data in comparison with the conventional methods.
The paper is organized as follows. In Section 2, the matrix normal mixture model is explained and then the penalized matrix normal mixture model with the mean and precision matrices is introduced. Next, a new EM algorithm for the precision matrices of the proposed model, a model selection method, k-fold cross-validation, and the asymptotic theory are developed. In Section 3, the proposed method is applied to both simulation data and real data. In the simulation data experiment, the clustering performance is compared with that of the K-means clustering method [13], the spectral clustering method [14], and [9]'s method. The consistency of the estimates for the proposed model is shown, using the averaged spectral norm of the difference (SL) and the averaged Frobenius norm of the difference (FL) [3]. In the real data experiment, the clustering performance is also compared with that of the K-means clustering method, the spectral clustering method and the method of [9]. In Section 4, we summarize the results and discuss future work. The Appendices A-C provides a comparative overview table and the proofs of the penalized maximum likelihood estimator of the mean matrix for the proposed model and the asymptotic theory.

Method
A mixture model using matrix normal distribution was previously introduced by [5].
The matrix normal mixture model [15][16][17] is reviewed is shown in Section 2.1. Then, our penalized matrix normal mixture model, estimation, asymptotic theory, and model selection are presented in Sections 2.2-2.4. The p × q random matrix X has a matrix normal distribution, with mean matrix M and row covariance U and column covariance V, denoted by X ∼ N p,q (M, U, V), if the p.d.f. of X is given by the following: where M ∈ R p×q , U ∈ R p×p , V ∈ R q×q and |A|, tr(A) are the determinant value and the trace of a matrix A, respectively. The p.d.f. (1) can be converted to the p.d.f. of the vectorization of X as follows: where vec is the vectorization operator and ⊗ is the Kronecker product. The matrix normal distribution is a generalization of the matrix variate, i.e., it can effectively handle three-way data. In particular, the separable covariance structure has an important advantage, which is reducing the number of parameters. For example, given the p × q random matrix X, the dimension of inseparable covariance matrix Σ is pq × pq. However, the dimension of separable covariance matrix Σ = V ⊗ U is p × p + q × q.

Maximum Likelihood Estimation
The log-likelihood function for given i.i.d observations X 1 , X 2 , . . . , X n of N p,q (M, U, V) is as follows: The maximum likelihood estimators (m.l.e.) for M, U, V are as follows: Note that m.l.e.,Û andV are of the no closed-form solution, and aV and bÛ, which is a constant multiple ofÛ andV, are also solutions. However, aV ⊗ bÛ must be the same asV ⊗Û [15].Û andV can be obtained by the iterative algorithm of (2) until the convergence criterion is satisfied. The criterion is set as . . , X n from the mixture of matrix normal distributions with G groups, the p.d.f. of the matrix normal mixture is as follows: where Θ j is the set of parameters corresponding to the jth group, i.e., Θ j = (M j , U j , V j ), π 1 , . . . , π G are the weights of belonging to the group g, and f is the matrix normal distribution with mean M j , row covariance U j , column covariance V j . Then, the log-likelihood function of MNMM is as follows: It is difficult to obtain the maximum log-likelihood value by solving the first derivative Equation (3) because the p.d.f is defined as the sum in logarithm. So, the EM algorithm is widely used to estimate parameters by maximizing the conditional expectation for the latent variable z, given the data instead of the log-likelihood function [18].
In the E-step, the posterior probability of the latent variables z ij is defined by Bayes' theorem as follows:τ In the M-step, the parameters,Θ (t+1) , that maximize the expected conditional loglikelihood given data are found as follows: We obtain the following estimates by using the partial derivative and some algebra.
Similar to the m.l.e. of the matrix normal distribution,Û j andV j are obtained by updating the previously estimated parameters in the EM algorithm until a convergence criterion is satisfied.

Penalized Matrix Normal Mixture Model (PMNMM)
Let M g(k,l) be the (k, l)th element of the mean matrix of group g, M g . If M g(k,l) is the same for all groups, i.e., M 1(k,l) = . . . = M G(k,l) , then the (k, l)th element of the mean matrix cannot differentiate the clusters. In this case, when the data are standardized, the (k, l)th element of the standardized mean matrix of each cluster will be 0, and is known to be the noise variable (non-informative) for clustering [2]. Therefore, penalizing on the elements of mean matrix of the matrix normal mixture was proposed by [9] for identifying the underlying spatial structure and improving the signal-to-noise ratio (SNR) of the local field potentials (LFP) signals. However, in order to improve the clustering performance in high-dimension/low sample size data, the regularization of the precision matrix is inevitable because the number of parameters required to estimate the multiple covariance matrices becomes very large. In this paper, the method of [9] is extended by further regularizing the precision matrix to cope better with the curse of dimensionality and reflect the proper conditional correlation structure for matrix data.

Penalized Log-Likelihood Function of PMNMM
The penalized log-likelihood function is considered as follows: where λ 1 , λ 2 , λ 3 > 0 are the tuning parameters, and |A| 1 is the L1 norm, which is the sum of all absolute values of elements of A. Setting λ 2 , λ 3 = 0 is equivalent to the penalized matrix normal mixture model with lasso regularization of the mean matrix proposed in [9]. Additionally, setting λ 1 , λ 2 , λ 3 = 0 is equivalent to the non-regularized matrix mixture model proposed in [5].

EM Algorithm of PMNMM
The EM algorithm is employed as in Section 2.1.3. The E-step is the same as Equation (4). In the M-step, the parameter estimates can be obtained by solving the constraint optimization problem as follows: where U j &V j > 0 means that U j and V j are symmetric and positive definite matrices. The estimate of M can be obtained by the subgradient approach as shown in Appendix B as follows: matrix with all components of 1, and sign(A) (k,l) is the function that extracts the sign of a real number, and abs(A) is the absolute value of A. Then, we focus on the constraint optimization problem with respect to each U and V. To maximize the log-likelihood with the constraint, the problem is redefined as the constraint optimization with respect to U −1 j and V −1 j as follows: and However, for Equations (6)-(9), the subgradient approach is not possible because the solution of the constraint optimization problem must satisfy the conditions that U j and V j are symmetric and positive definite matrices. So, the graphical lasso algorithm [12] is applied to maximize the following constraint optimization function with respect to positive definite matrix Σ.
where S is the sample covariance matrix. Therefore, we can apply the algorithm by substi- ij to estimate U −1 j and V −1 j , respectively. The graphical lasso algorithm is implemented with R package glasso [12]. Then the m.l.e. of U −1 and V −1 are obtained by iterating E-step and M-step until the penalized log-likelihood converges. To prevent the convergence of the parameter estimation from prematurely stopping, Aitken's acceleration [19] is used. Aitken's acceleration computes the acceleration coefficient at iteration t as follows: where l (t) is the penalized log-likelihood value at iteration t. The accelerated maximum of the penalized log-likelihood at iteration t is as follows: The iteration stops when 0 ≤ (l (t) − l (t) )/|l (t) | ≤ for very small positive number . We set = 0.001.
The iterative parameter estimation procedure needs the initial starting values. So the K-means clustering method is used to obtain the initial parameter values. For the fixed number of clusters G, the vectorized data are divided into G clusters, using K-means. Then, based on a group label that is predicted, the m.l.e. are obtained for each cluster; we use those values as the initial value set for the EM algorithm. Additionally, in the process of estimating the parameters that maximize the objective function, the EM algorithm has to be run multiple times because local maximum values may exist.

Asymptotic Theory
The consistency of the penalized likelihood estimator for the mean of the meanregularized matrix normal mixture model [9] under the following mild conditions was shown by the direct application of Theorem 1 in [20]. That is, the parameter space Ψ d 1 ,d 2 must be defined as follows: . . , min{p, q} and h = 1, . . . , G} to guarantee the identifiability of the matrix normal mixture model. Since Theorem 1 in [20] is applicable to all lasso type penalized likelihood estimators, the penalized likelihood estimators of the proposed model are also consistent under the condition (10). We further establish the asymptotic distributions of the estimators, and prove their consistency in another way. We derive asymptotic distributions of the proposed estimators, using the techniques in [4]. The proof is given in Appendix C. Theorem 1. Let X 1 , . . . , X n be a random sample from a mixture matrix normal distribution. Let M j ,Ũ * j , andṼ * j be the penalized maximum likelihood estimator of M j , U −1 j , and V −1 j , respectively, j = 1, . . . , G. Assume that the true parameter value M j , U j , in which H is a p × q matrix, H(l, m) is the (l, m)th element of H, W 1 is a random p × q matrix such that vec(W 1 ) ∼ N(0, Λ 1 ), and Λ 1 is such that the following holds: in which L is a p × p matrix, L(l, m) is the (l, m)th element of L, W 2 is a random p × p matrix such that vec(W 2 ) ∼ N(0, Λ 2 ), and Λ 2 is such that the following holds: where x * i and m * i is the ith row of the matrix X and M j , respectively, in which R is a q × q matrix, R(l, m) is the (l, m)th element of R, W 3 is a random q × q matrix such that vec(W 3 ) ∼ N(0, Λ 3 ), and Λ 3 is such that the following holds: where x i and m i is the ith column of the matrix X and M j , respectively.
Therefore, the proposed estimatorM j is √ n consistent. The proof forŨ * j andṼ * j is same as forM j .

Model Selection
In the penalized mixture model, the selection of the optimal hyperparmeters is very important. The proposed penalized matrix normal mixture model has four hyperparameters (G, λ 1 , λ 2 , λ 3 ). A grid search method is used to select the optimal set of (G, λ 1 , λ 2 , λ 3 ) with the largest penalized log-likelihood. However, even if the grid search method is applied, a model fitted with a fixed dataset is often not reliable due to the overfitting problem. Generally, the k-fold cross-validation (CV) method is used to solve these problems. In k-fold CV, the data X = {X 1 , X 2 , . . . , X n } are split into K mutually exclusive subsets denoted by Y 1 , Y 2 , . . . , Y K . Then, one of the K subsets is selected as the test data, the model with the remaining data is applied, and the penalized log-likelihood value for the test data is calculated with the estimated parameters. After using each of the subsets as test data in turn, the final score is obtained by averaging the K penalized log-likelihood calculated values as follows: where Y −k is the dataset excluding Y k . For example, in 4-fold cross-validation, the mean of the penalized log-likelihood is obtained as shown in Figure 1. Therefore, using the k-fold cross-validation method, we can select the optimal parameter set with the largest mean of the penalized log-likelihood among different combinations of grid parameter values.

Simulation Studies
In Scenarios 1 and 2, we generate 20 × 20 data from two matrix normal distribu- , each with a different type of mean structure. One has a rectangle shape and the other has a cross shape, as in Figure 2. We fix G = 2 in the simulation study, and the ranges of the tuning parameter λ 1 , λ 2 , λ 3 are set to be λ 1 = 20, 10, 5, 1, 0.5, 0.1, λ 2 = 1, 0.1, 0.01, 0.001, λ 3 = 1, 0.1, 0.01, 0.001, and the optimal (λ 1 , λ 2 , λ 3 ) set is determined with the 4-fold CV method. The sample size n is set to be 100, 200, 300, and 1000. To compare the performance for different degrees of sparsity, we consider different precision structures for Scenario 1 and Scenario 2, banding and AR(1) model, respectively: Scenario 1: Banding precision structure The proposed model is compared with the conventional methods: K-means, spectral clustering, and the model of [9]. The criteria for performance are ACC and adjusted rand index (ARI) [21]. ACC shows the closeness of a predicted value to a standard or known value, and ARI is a clustering evaluation method that indicates perfect clustering if it is 0 and random clustering if it is 0. Additionally, the two norms are considered to show the consistency of the parameter estimates of the model: the averaged spectral norm and Frobenius norm of the difference between the estimated precision matrix and the truth.
where A s is the largest singular value of matrix A, and A F is the Frobenius norm.
Since the clustering results depend on the generated data, each scenario is run 50 times. ACC, ARI, SL, and FL are calculated each time, and their average results are reported in Tables 1-4, respectively. Figures 3-6 show the violin plots of the ACC and ARI for each method with a box plot and underlying distribution by sample size in Scenario 1 and 2.      In Scenario 1, Figures 3 and 4 and Table 1 show that the proposed model has the highest ACC (0.995) and ARI (0.98) for sample size 100; the method of [9] follows next. In addition, as n increases, the performance of PMNMM and the method of [9] become similar. To see if there is any significant difference among the means of the ACC and ARI, the Kruskal-Wallis test [22] is applied. The p-value of the test for significant difference among the means of ACC and ARI is 0, so we can confirm that the ACC and ARI means of the models differ under the significance level α = 0.05. Then, to compare pairwise difference, the Dunn test [23] is applied. The null hypothesis of the Dunn test is that there is no difference between the two groups. As a result, PMNMM and the method of [9] is better than both K-means and the spectral method (p-value = 0), and the spectral method is better than K-means (p-value = 0) in all sample sizes. For a small sample case (n = 100), the proposed model shows better performance (p-value = 0) because there are many zero-valued elements of the precision matrix in the outside of the band, which are estimated to be relatively better than other methods. For the other sample size, p-values of the Dunn test between the ACC of PMNMM and the method of [9] is 0.9436, 1, and 1, respectively, and 0.8309, 1, 1 for ARI, respectively. Because the null hypothesis cannot be rejected, the performance of the two models is considered to be the same. As shown in Figures 3 and 4, the ACC and ARI values of the K-means method, spectral method and the method of [9] are more spread than those of PMNMM. This implies that the proposed method produces more stable clustering results. Table 2 shows the mean measured errors between estimated parameters and true parameters for 50 runs in Scenario 1. As the sample size increases, the FL value of the estimated mean matrix decreases, and similarly, the SL and FL values of the estimated precision matrix decrease. To see if there is any significant difference among values for each sample size, the Kruskal-Wallis test and the Dunn test are applied. With the Kruskal-Wallis test, the p-values for the means of FL (mean), SL (precision), and FL (precision) are 0, so the means of measured errors for three norms differ among different sample sizes under the significance level α = 0.05. Next, with the Dunn test, we can confirm that the means of each error decrease as the sample size increases, as p-values are smaller than the significance level α = 0.05 for two different increasing sample sizes. This implies that the proposed method has consistency.
In Scenario 2, similar to the previous process, we can check if there is any significant difference among the means of the ACC and ARI. However, under the AR(1) precision structure, the results are slightly different. For the sample size n = 100, the p-values of the Kruskal-Wallis test are 0.19 and 0.17 for ACC and ARI, respectively, so we cannot reject the null hypothesis that the means of the ACC and ARI for all models are the same. For the other sample sizes, the p-value of the Kruskal-Wallis test is 0, so the Dunn test is applied. For n = 200, the p-value of the Dunn test between PMNMM and spectral is 0.3911, so the null hypothesis cannot be rejected. On the contrary, the p-value of the Dunn test between PMNMM and the method of [9] is 0, so PMNMM is better than the method of [9]. The sparse precision matrix is not shown clearly for the data with n = 100 sample size in Scenario 2, and this hinders PMNMM from exploring sparsity in the precision matrix, showing similar performance as the conventional method. However, for n = 300, the p-values of the Dunn test between any pair of methods are 0. We can confirm that PMNMM is better than the K-means and spectral methods, as well as the method of [9]. For n = 1000, PMNMM and the method of [9] have the same performance (p-values = 1, 1 for ACC and ARI, respectively), and are better than the other two methods, but the performance of the other two methods is the same (p-values = 0.3433, 0.3294). Table 4 shows the mean measured errors between estimated parameters and true parameters in scenario 2. Table 4 shows similar results as Table 2, and the Kruskal-Wallis test and the Dunn test are applied in the same way as before. As a result, for the Kruskal-Wallis test, the p-values for the test of significant difference among the means of each measured error are 0, and for the Dunn test, the p-values for all pairwise comparisons are 0, so all null hypotheses are rejected under the significance level α = 0.05. This implies that the proposed method has consistency.
In the simulation studies, the two precision structures are considered: banding and AR(1). The experimental results of Scenario 1 showed that the performance of the proposed method is better than the other methods in all sample size cases; we can confirm the consistency of the proposed method through the measured errors with the increasing sample size. The penalized matrix normal mixture model in the banding precision matrix showed good performance, even with a very small sample size. However, in the AR(1) precision structure of Scenario 2, a larger number of samples was needed to achieve good performance, and the value of the measured error was higher when compared with Scenario 1. This implies that the proposed model works better for data with a more sparse precision structure.

• Datasets
We evaluated the proposed model on the CIFAR 10 and Fashion-MNIST datasets: The CIFAR 10 dataset (https://www.cs.toronto.edu/~kriz/cifar.html, accessed on 21 September 2021) contains well-known image data. There are 60,000 images of 32 × 32 × 3 size with 10 types-airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck-and each type has 6000 images, respectively. The dataset is divided into a training set with 50,000 images and a test set with 10,000 images. -Fashion-MNIST The Fashion-MNIST dataset (https://github.com/zalandoresearch/fashion-mnist, accessed on 21 September 2021) is a grayscale image dataset, designed to replace the original MNIST dataset. There are 70,000 images of 28 × 28 size with 10 types-T-shirt/top, trouser, pullover, dress, coat, sandal, shirt, sneaker, bag, ankle bootand each type has 7000 images, respectively. The dataset is divided into a training set with 60,000 images and a test set with 10,000 images.

CIFAR 10 Dataset
In the experiment, two types of images are selected: frog and ship (Figure 7). For the performance comparison of the proposed method, the training images are transformed to grayscale, and 500 images are randomly selected from the training set for each type. Inaddition, the K-means and spectral methods, as well as the method of [9] are the conventional methods to be compared with PMNMM; the number of clusters G is set to 2; and the ranges of the tuning parameter λ 1 , λ 2 , λ 3 are set to λ 1 = 1, 0.5, 0.1, λ 2 = 1, 0.5, 0.1, 0.05, 0.01, λ 3 = 1, 0.5, 0.1, 0.05, 0.01. Using the training set, the optimal set of (λ 1 , λ 2 , λ 3 ) selected through the 4-fold CV method is (1, 0.01, 0.01), and the performance of the proposed model is evaluated on the testing set. A total of 500 images are randomly selected from the test set, and the models are fitted to obtain ACC and ARI, respectively. This process is repeated 50 times to calculate the average ACC and ARI of each model, and the results are shown in Table 5. Then, the Kruskal-Wallis test and the Dunn test are applied to see if there is any significant difference between the methods. As a result, the p-value of the Kruskal-Wallis test is 0, the ACC and ARI of the PMNMM are higher than those of K-means and spectral method with the Dunn test (p-value = 0), but PMNMM and the method of [9] have the same performance. The null hypothesis of the Dunn test between the K-means and spectral methods is not rejected (p-value = 0.075, 0.085 for ACC and ARI, respectively) under the significance level α = 0.05.  To compare PMNMM with [9]'s method further, a more extreme low sample situation is set up by randomly selecting 10 images from the training set for each type. The results are shown in Table 6.  6 show that the performance does not improve significantly even if the sample size increases because the two groups are difficult to be distinguished. There is also little performance difference between the two methods because the weight of the penalty imposed on the precision matrix is small.

Fashion Dataset
We consider applying the methods to the fashion data, which can be distinguished more clearly. Similar to the CIFAR 10 data application, two types of images are selectedshirt and sandal-and 500 images are randomly selected from the training set for each type (Figure 8). The ranges of the tuning parameter λ 1 , λ 2 , λ 3 are set to λ 1 = 1, 0.5, 0.1, λ 2 = 5, 3, 1, 0.5, 0.1, λ 3 = 5, 3, 1, 0.5, 0.1. Using the training set, the optimal set of (λ 1 , λ 2 , λ 3 ) selected through the 4-fold CV method is (0.1, 0.1, 0.1). The results for ACC and ARI are shown in Table 7. It is confirmed that PMNMM and the method of [9] have almost the same performance for the sample size n = 1000. Then, the Kruskal-Wallis and Dunn tests are applied; the p-value of the Kruskal-Wallis test is 0; the ACC and ARI of the PMNMM are higher than those of K-means and spectral methods with the Dunn test (p-value = 0); and the null hypothesis of the Dunn test between the K-means and spectral methods is not rejected (p-value = 1) under the significance level α = 0.05.  To compare PMNMM with the method of [9] further, a more extreme low sample situation is set up by randomly selecting 10 images from the training set for each type. The results are shown in Table 8. Table 8. Performance comparison of PMNMM and the method of [9] for the fashion dataset (The values in parentheses are the standard deviations of the 50 runs' results).  Table 8 shows that the proposed method works better than the method of [9] that imposes a penalty only on the mean matrix for extremely small sample size data.

Conclusions
We have proposed a new penalized mixture model with the penalties imposed both on the mean matrix and precision matrix of the matrix normal mixture model. The conventional Gaussian mixture model for matrix data was challenging because it has to estimate one big covariance structure; therefore, the calculation speed becomes very slow as the dimension increases. On the contrary, the proposed method is a parsimonious model since it reduces a large number of parameters by regularizing the row and column variance matrices of the matrix normal distribution, especially imposing a lasso penalty on the precision matrix. Therefore, the proposed method can reflect the sparse conditional correlation structure and work even with high-dimensional data with a small sample size. The asymptotic distributions of the parameter estimators were derived, and their consistency was proved. We have shown the superior performance of the proposed model by comparing it with three conventional methods (method of [9], K-means method, and spectral method) through both synthetic and real data experiments. A performance comparison was done with ACC and ARI, and PMNMM was significantly better than the K-means and spectral methods by a nonparametric test. In the simulation studies, we evaluated the clustering performance for two different degree of sparsity of precision matrix. For the banding precision structure, the ACC performance of the proposed method reached 0.995 and 0.999, with the small sample size, n = 100 and 200, respectively, which are higher than the other methods. For the AR(1) precision structure, it is similar to that of [9] with n = 100 and higher than those of the other methods with n = 200, 300, and 1000. In conclusion, when the true precision matrix is sparse, that is, the variables have strong conditionally independent structure, the proposed model is well fitted to a small sample size data. For the real datasets, the ACC performance of the proposed method reaches 0.74 and 0.95 for clustering two types of CIFAR 10 and Fashion-MNIST images, respectively, which are higher than or equal to those of the other methods.
In the proposed model, calculation speed is an important factor, as mentioned above. Thus, the graphical lasso algorithm [12] was applied. A faster algorithm to estimate a large precision matrix would be beneficial for the proposed model to improve the calculation speed. In addition, to cluster a 3D image or higher dimensional data, new penalized mixture models for the tensor structure need to be developed. A penalized matrix non-Gaussian (such as skew-normal, skew-t) mixture model is worth developing for clustering more general types of matrix data.

Data Availability Statement:
The data presented in this study are available on https://www.cs. toronto.edu/~kriz/cifar.html, accessed on 21 September 2021 and https://github.com/zalandoresearch/ fashion-mnist, accessed on 21 September 2021. The R code for the proposed method is available on request from the first author.

Acknowledgments:
The authors thank two reviewers for their valuable comments for the improvement of the manuscript.

Conflicts of Interest:
The authors declare no conflict of interest. Table   Table A1. A comparative overview table between the proposed method and the conventional methods. r: the number of mean parameters values that are shrunken to zero (0 ≤ r ≤ Gpq); r * : the number of mean and precision parameters values that are shrunken to zero; (r ≤ r * ≤ Gp(p + 1)/2 + Gq(q + 1)/2) + Gpq); NA: not applicable Appendix B

Proof of Equation (5).
using the formula ∂ ∂X tr(AX T BX) = BXA T + B T XA (KB Petersen, 2012 [24]), we obtain the following: Some algebraic manipulations are applied as follows: is the estimate of M j for the non-penalty model, the following Define V 2n (L) as follows: Note the following: where σ j (·) denotes the ith-largest eigenvalue of a matrix. Since we conclude the following: On the other hand, Together with the fact that is the m.l.e. of U j for non-penalized matrix normal mixture model. Since both V 2 (L) and nV 2n (L) are convex and V 2 (L) has a unique minimum, the following holds: Since we conclude the following: On the other hand,