Robust Classification via Finite Mixtures of Matrix Variate Skew-t Distributions

Abbas Mahdavi; Narayanaswamy Balakrishnan; Ahad Jamalizadeh

doi:10.3390/math12203260

,

and

¹

Department of Statistics, Vali-e-Asr University of Rafsanjan, Rafsanjan 7718897111, Iran

²

Department of Mathematics and Statistics, McMaster University, Hamilton, ON L8S 4K1, Canada

³

Department of Statistics, Faculty of Mathematics & Computer, Shahid Bahonar University of Kerman, Kerman 7616914111, Iran

^*

Author to whom correspondence should be addressed.

Mathematics2024, 12(20), 3260;https://doi.org/10.3390/math12203260

This article belongs to the Special Issue New Advances and Applications in Image Processing and Computer Vision

Version Notes

Order Reprints

Abstract

Analysis of matrix variate data is becoming increasingly common in the literature, particularly in the field of clustering and classification. It is well known that real data, including real matrix variate data, often exhibit high levels of asymmetry. To address this issue, one common approach is to introduce a tail or skewness parameter to a symmetric distribution. In this regard, we introduce here a new distribution called the matrix variate skew-t distribution (MVST), which provides flexibility, in terms of heavy tail and skewness. We then conduct a thorough investigation of various characterizations and probabilistic properties of the MVST distribution. We also explore extensions of this distribution to a finite mixture model. To estimate the parameters of the MVST distribution, we develop an EM-type algorithm that computes maximum likelihood (ML) estimates of the model parameters. To validate the effectiveness and usefulness of the developed models and associated methods, we performed empirical experiments, using simulated data as well as three real data examples, including an application in skin cancer detection. Our results demonstrate the efficacy of the developed approach in handling asymmetric matrix variate data.

Keywords:

ECME algorithm; image segmentation; mixture models; matrix variate distributions; skewed distributions; truncated normal distribution; truncated t distribution

MSC:

60E05; 62E15; 62F10; 60B20

1. Introduction

The advent of modern data-collection technologies, such as electronic sensors, cell phones and web browsers, has resulted in an abundance of multivariate data sources. Much of these data can be represented as matrix variate (three-way) data, with two ways associated to the row and column dimension of each matrix variate observation and the third one representing subjects (see [1]). Matrix data can occur in different application domains, such as spatial multivariate data, longitudinal data on multiple response variables or spatio-temporal data. For this reason, statistical methods that can effectively utilize three-way data have become increasingly popular. The matrix variate normal (MVN) distribution is one of the most commonly used matrix variate elliptical distributions. However, for many real phenomena, the tails of the MVN distribution are lighter than required, with a direct impact on the corresponding model. In particular, in robust statistical analysis, heavy-tailed distributions are essential, and these include slash and t distributions. Matrix variate t (MVT) distribution has been discussed by [2], and some distributional properties of it have also been studied by [2].

Flexibility and robustness are often lacking in symmetric models when dealing with highly asymmetric data. To address this issue, a recognized method is to add a tail or skewness parameter to a symmetric distribution. Several formulations have been discussed in the literature in the form of continuous mixtures of normal variables, where a mixing variable operates on the mean or on the variance, or on both the mean and the variance of a multivariate normal variable. A general formulation was presented in [3], which encompasses a large number of existing constructions involving continuous mixtures of normal variables. Given a real-valued function

r (u, w)

and a positive-valued function

s (u, w)

, a generalized mixture of a p variate normal distribution is given by

\begin{matrix} Y \overset{d}{=} ξ + r (U, W) γ + s (U, W) X, \end{matrix}

(1)

where

\overset{d}{=}

denotes equality in distribution,

ξ \in R^{p}

,

γ \in R^{p}

,

X \sim N_{p} (0, Σ)

and U and W are univariate random variables, with

(X, U, W)

being mutually independent.

The representation in (1) can be extended to the matrix variate case as

\begin{matrix} Y \overset{d}{=} M + r (U, W) Λ + s (U, W) Z, \end{matrix}

(2)

where

M

and

Λ

are

n \times p

matrices representing the location and skewness, respectively,

Z \sim N_{n \times p} (0, Σ, Ψ)

and

(Z, U, W)

are mutually independent. It is worth noting that the univariate nature of functions

r (u, w)

and

s (u, W)

simplifies the stochastic representation in (2), leading to more suitable properties for

Y

, and it also facilitates easier parameter estimation. Furthermore, the representation in (1) can be considered, after rearranging into a vector (denoted by

Vec (Y)

), as

\begin{matrix} Vec (Y) \overset{d}{=} Vec (M) + r (U, W) Vec (Λ) + s (U, W) Vec (Z) . \end{matrix}

(3)

In this work, we introduce and study in detail finite mixtures of a new simple matrix variate skew-t (FM-MVST) distribution, based on (2), for dealing with clustering and classification of asymmetric and heavy-tailed matrix variate data. The proposed model’s simplicity in both density function and stochastic representation leads to a convenient strategy for parameter estimation using the expectation–conditional maximization either (ECME; [4]) algorithm, which is a variant of the EM algorithm [5]. In addition, using simulated and real datasets, we show how the proposed EM algorithm can be implemented for determining the ML estimates of the model parameters for the finite mixture of the proposed model.

The rest of this paper is organized as follows. Section 2 discusses the relevant previous research. Section 3 presents the formulation of the MVST distribution and discusses how the ECME algorithm can be proposed for ML estimation of model parameters. In Section 4, the finite mixture of MVST distributions is defined, and then, the implementation of the EM algorithm for fitting the FM-MVST model is presented. The proposed methods are illustrated by two simulation studies in Section 5 and also by the analysis of three real data datasets in Section 6. Finally, some concluding remarks and possible avenues for future work are outlined in Section 7.

2. Related Studies

To address highly asymmetric data and to have more flexibility, the functions

r (u, w)

and

s (u, w)

in (2) can also be considered as p-dimensional random variables. This approach may introduce complexity in both density-function and parameter-estimation issues. For instance, ref. [6] extended the scale and shape mixtures of multivariate skew normal distributions to a matrix variate setting and studied special cases and their properties. Here, we concentrate on simplifying the proposed model and the associated estimation procedure by focusing on the univariate case of

r (u, w)

and

s (u, w)

. However, it should be noted that the proposed model here is distinct from the model presented in [6], and it cannot be viewed as a special case of it.

Based on (2), various cases have been introduced. For example, ref. [7] introduced a matrix variate skew-t distribution using the following matrix variate normal variance–mean mixture representation:

Y \overset{d}{=} M + W Λ + W^{1 / 2} Z,

(4)

where

M

and

Λ

are

n \times p

matrices representing the location and skewness, respectively,

Z \sim N_{n \times p} (0, Σ, Ψ)

and

W \sim I G (ν / 2, ν / 2)

with

I G (\cdot)

denoting the inverse gamma distribution. Herein, we denote the random variable with representation (4) by

Y \sim M V S T I G (M, Σ, Ψ, Λ, ν)

and its resulting density function is given by

\begin{matrix} f_{M V S T I G} (Y; M, Σ, Ψ, Λ, ν) \\ = \frac{2 {(ν / 2)}^{ν / 2} exp {tr [Σ^{- 1} (Y - M) Ψ^{- 1} Λ^{⊤}]}{{(2 π)}^{n p / 2} {| Σ |}^{p / 2} {| Ψ |}^{n / 2} Γ (ν / 2)} {(\frac{δ (Y; M, Σ, Ψ) + ν}{ρ (Σ, Ψ, Λ)})}^{- \frac{ν + n p}{4}} \\ \times κ_{- (ν + n p) / 2} (\sqrt{ρ (Σ, Ψ, Λ) (δ (Y; M, Σ, Ψ) + ν)}), Y \in R^{n \times p}, \end{matrix}

(5)

where

\begin{matrix} δ (Y; M, Σ, Ψ) & = tr [Σ^{- 1} (Y - M) Ψ^{- 1} {(Y - M)}^{⊤}], \\ ρ (Σ, Ψ, Λ) & = tr [Σ^{- 1} Λ Ψ^{- 1} Λ^{⊤}], \end{matrix}

and

κ_{x}

is the modified Bessel function of the third kind with index x. From (4), some other matrix variate skew distributions can be introduced by assuming different distributions for W. For more details, one may refer to [8].

Ref. [9] introduced a new family of matrix variate distributions, based on the matrix variate mean mixture of normal (MVMMN) distributions, as

\begin{matrix} Y \overset{d}{=} M + W Λ + Z . \end{matrix}

(6)

Based on (6), three special cases, including the restricted matrix variate skew-normal (RMVSN), exponentiated MVMMN (MVMMNE) and mixed-Weibull MVMMN (MVMMNW) distributions, have been studied by using half-normal, exponential and Weibull distributions for W, respectively. Several other skew matrix variate distributions have also been discussed in the literature; see [10,11,12].

One common statistical challenge faced by researchers is identifying sub-populations or clusters within multivariate data. Recently, researchers have explored the use of finite-mixture models for matrix variate data in applications such as image analysis, genetics and neuroscience. These models offer a flexible framework for capturing complex patterns in the data and can provide insights into the underlying sub-populations or clusters; see [8,13,14,15,16,17,18].

3. Methodology

3.1. The Model

An

n \times p

-variate random matrix

Y

is said to have a matrix variate skew-t (MVST) distribution, with

n \times p

location matrix

M

,

n \times n

and

p \times p

scale matrices

Σ

and

Ψ

,

n \times p

shape matrix

Λ

and flatness parameters

ν

, if its probability density function (pdf) is

\begin{matrix} f_{M V S T} (Y; θ) \\ = \frac{2 {(ν / 2)}^{ν / 2} Γ (\frac{ν + n p}{2}) {| Σ |}^{- p / 2} {| Ψ |}^{- n / 2}}{{(2 π)}^{n p / 2} Γ (ν / 2) \sqrt{ρ (Σ, Ψ, Λ) + 1}} {(\frac{δ (Y; M, Σ, Ψ) + ν - Δ^{2} (Y; θ)}{2})}^{- \frac{ν + n p}{2}} \\ \times T_{(ν + n p)} (Δ (Y; θ) \sqrt{\frac{ν + n p}{δ (Y; M, Σ, Ψ) + ν - Δ^{2} (Y; θ)}}), Y \in R^{n \times p}, \end{matrix}

(7)

where

θ = (M, Σ, Ψ, Λ, ν)

denotes all the model parameters,

Δ (Y; θ) = tr [Σ^{- 1} (Y - M) Ψ^{- 1} Λ^{⊤}] / \sqrt{ρ (Σ, Ψ, Λ) + 1}

and

T_{ν} (\cdot)

denotes the cumulative distribution function (cdf) of the student’s t distribution with

ν

degrees of freedom. The MVST distribution reduces to the RMVSN distribution [9] when

ν \to \infty

.

Moreover, the MVST distribution possesses the stochastic representation

\begin{matrix} Y \overset{d}{=} M + W^{- 1 / 2} (U Λ + Z), \end{matrix}

(8)

where

Z \sim N_{n \times p} (0, Σ, Ψ)

,

W \sim Γ (ν / 2, ν / 2)

and

U \sim T N (0, 1) I_{(0, \infty)}

. Herein,

T N (μ, σ^{2}) I_{A}

represents a doubly truncated normal distribution defined in the interval

A = {a_{1} < x < a_{2}}

, and

I_{A}

denotes the indicator function of set

A

. It is important to highlight that the MVSTIG distribution, represented by (4), utilizes a single mixing random variable. In contrast, the proposed MVST distribution, represented by (8), incorporates two mixing random variables, W and U, which can significantly increase the model’s flexibility.

From (8), it is easy to show that

\begin{matrix} E (Y) & = M + \frac{Γ (\frac{ν - 1}{2})}{Γ (\frac{ν}{2})} (\frac{ν}{\sqrt{π}}) Λ, \\ Vec (Y) & \sim r S T_{n p} (Vec (M), Ψ \otimes Σ, Vec (Λ)), \end{matrix}

where ⊗ is the Kronecker product and

r S T_{p}

denotes the p-variate restricted skew-t distribution (see [19,20]).

The stochastic representation given in (8) not only facilitates random number generation, but also enables the implementation of the EM algorithm for determining the maximum likelihood (ML) estimates of the parameters of the MVST distribution. This leads to the hierarchical representation

\begin{matrix} Y | (γ, w) & \sim N_{n \times p} (M + γ Λ, w^{- 1} Σ, Ψ), \\ γ | w & \sim T N (0, w^{- 1}) I_{(0, \infty)}, \\ W & \sim Γ (ν / 2, ν / 2), \end{matrix}

(9)

where

γ = W^{- 1 / 2} U

and W are treated as latent variables. Then,

(Y, W, γ)

has the joint pdf as

\begin{matrix} f_{Y, W, γ} (Y, w, γ) = 2 w^{1 / 2} ϕ_{n \times p} (Y; M + γ Λ, w^{- 1} Σ, Ψ) ϕ (w^{1 / 2} γ) g (w; ν / 2, ν / 2), \end{matrix}

(10)

where

ϕ (\cdot)

and

ϕ_{n \times p} (\cdot; M, Σ, Ψ)

are the pdfs of

N (0, 1)

and

N_{n \times p} (M, Σ, Ψ)

, respectively, and

g (\cdot; α, β)

denotes the pdf of the gamma distribution with mean

α / β

.

Integrating out W and

γ

, respectively, from (10), we obtain the joint pdfs

\begin{matrix} f_{Y, γ} (Y, γ) & = \frac{2 {(ν / 2)}^{ν / 2} Γ (\frac{ν + n p + 1}{2}) {| Σ |}^{- p / 2} {| Ψ |}^{- n / 2}}{{(2 π)}^{(n p + 1) / 2} Γ (ν / 2)} \\ {(\frac{δ (Y; M, Σ, Ψ) + (ρ (Σ, Ψ, Λ) + 1) γ^{2} - 2 η (Y; θ) γ + ν}{2})}^{- \frac{ν + n p + 1}{2}}, \end{matrix}

(11)

where

η (Y; θ) = tr [Σ^{- 1} (Y - M) Ψ^{- 1} Λ^{⊤}]

, and

\begin{matrix} f_{Y, W} (Y, w) & = \frac{2 {(ν / 2)}^{ν / 2} {| Σ |}^{- p / 2} {| Ψ |}^{- n / 2}}{{(2 π)}^{n p / 2} Γ (ν / 2) \sqrt{ρ (Σ, Ψ, Λ) + 1}} w^{\frac{ν + n p}{2} - 1} \\ \times exp \{- \frac{w (δ (Y; M, Σ, Ψ) + ν - Δ^{2} (Y; θ))}{2}\} Φ (w^{1 / 2} Δ (Y; θ)), \end{matrix}

(12)

where

Φ (\cdot)

denotes the cdf of the standard normal distribution.

Dividing (10) by (11), we obtain

\begin{matrix} W | (Y, γ) \\ \sim Γ (\frac{ν + n p + 1}{2}, \frac{δ (Y; M, Σ, Ψ) + (ρ (Σ, Ψ, Λ) + 1) γ^{2} - 2 η (Y; θ) γ + ν}{2}) . \end{matrix}

(13)

Additionally, dividing (10) by (12), we obtain

\begin{matrix} γ | (Y, w) \sim T N (\frac{η (Y; θ)}{ρ (Σ, Ψ, Λ) + 1}, \frac{w^{- 1}}{ρ (Σ, Ψ, Λ) + 1}) I_{(0, \infty)} . \end{matrix}

(14)

From (7) and (12), it is easy to see that

\begin{matrix} f (w | Y) = C w^{\frac{ν + n p}{2} - 1} exp \{- \frac{w (δ (Y; M, Σ, Ψ) + ν - Δ^{2} (Y; θ))}{2}\} Φ (w^{1 / 2} Δ (Y; θ)), \end{matrix}

(15)

where

C = \frac{{(\frac{δ (Y; M, Σ, Ψ) + ν - Δ^{2} (Y; θ)}{2})}^{\frac{ν + n p}{2}}}{Γ (\frac{ν + n p}{2}) T_{(ν + n p)} (Δ (Y; θ) \sqrt{\frac{ν + n p}{δ (Y; M, Σ, Ψ) + ν - Δ^{2} (Y; θ)}})} .

(16)

Furthermore, by using (7) and (11), it can be shown that

\begin{matrix} γ | Y \sim T t (\frac{η (Y; θ)}{ρ (Σ, Ψ, Λ) + 1}, \frac{δ (Y; M, Σ, Ψ) + ν - Δ^{2} (Y; θ)}{(ρ (Σ, Ψ, Λ) + 1) (ν + n p)}, ν + n p) I_{(0, \infty)}, \end{matrix}

(17)

where

T t (μ, σ^{2}, ν) I_{A}

represents a doubly truncated t distribution with

ν

degrees of freedom defined in the interval

A = {a_{1} < x < a_{2}}

. From the conditional density in (15), we find

\begin{matrix} E (W | Y) = C_{0} \frac{T_{(ν + n p + 2)} (Δ (Y; θ) \sqrt{C_{2}})}{T_{(ν + n p)} (Δ (Y; θ) \sqrt{C_{0}})}, \end{matrix}

(18)

where

\begin{matrix} C_{0} = \frac{ν + n p}{δ (Y; M, Σ, Ψ) + ν - Δ^{2} (Y; θ)}, C_{2} = \frac{ν + n p + 2}{δ (Y; M, Σ, Ψ) + ν - Δ^{2} (Y; θ)} . \end{matrix}

Additionally, using the law of iterated expectations, we can obtain

\begin{matrix} E (γ W | Y) = \frac{η (Y; θ)}{ρ (Σ, Ψ, Λ) + 1} E (W | Y) + \frac{1}{\sqrt{ρ (Σ, Ψ, Λ) + 1}} ζ (Y) \end{matrix}

(19)

and

\begin{matrix} E (γ^{2} W | Y) & = \frac{1}{ρ (Σ, Ψ, Λ) + 1} + \frac{η^{2} (Y; θ)}{{(ρ (Σ, Ψ, Λ) + 1)}^{2}} E (W | Y) \\ + \frac{η (Y; θ)}{{(ρ (Σ, Ψ, Λ) + 1)}^{3 / 2}} ζ (Y), \end{matrix}

(20)

where

\begin{matrix} ζ (Y) & = \frac{Γ (\frac{ν + n p + 1}{2})}{\sqrt{2 π} T_{(ν + n p)} (Δ (Y; θ) \sqrt{C_{0}}) Γ (\frac{ν + n p}{2})} {(\frac{δ (Y; M, Σ, Ψ) + ν}{2})}^{- \frac{ν + n p + 1}{2}} \\ \times {(\frac{δ (Y; M, Σ, Ψ) + ν - Δ^{2} (Y; θ)}{2})}^{\frac{ν + n p}{2}} . \end{matrix}

3.2. Parameter Estimation via the ECME Algorithm

Suppose

Y = (Y_{1}, \dots, Y_{N})

constitutes a set of

n \times p

-dimensional observed samples of size N arising from the MVST model. In the EM framework, the latent variables are

w = (w_{1}, \dots, w_{N})

and

γ = (γ_{1}, \dots, γ_{N})

. With these, the complete data are given by

Y_{c} = (Y, w, γ)

.

According to (10), the log likelihood function of

θ

corresponding to the complete data

Y_{c}

, excluding additive constants and terms that do not involve parameters of the model, is given by

\begin{matrix} ℓ_{c} (θ ∣ Y_{c}) & = \frac{1}{2} \sum_{i = 1}^{N} {ν log (\frac{ν}{2}) - 2 log Γ (\frac{ν}{2}) - p log | Σ | - n log | Ψ | \\ + 2 η (Y_{i}; θ) γ_{i} w_{i} - (ρ (Σ, Ψ, Λ) + 1) γ_{i}^{2} w_{i} \\ - (δ (Y_{i}; M, Σ, Ψ) + ν) w_{i} + (ν + n p - 1) log w_{i}} . \end{matrix}

(21)

In the kth iteration, the E step requires the calculation of the so-called Q function, which is the conditional expectation of (21), given the observed data

Y

and the current estimate

{\hat{θ}}^{(k)}

, where the superscript ^(k) denotes the updated estimates at the kth step of the iterative process. To evaluate the Q function, we then need the following conditional expectations:

\begin{matrix} {\hat{w}}_{i}^{(k)} = E (W_{i} ∣ Y_{i}, {\hat{θ}}^{(k)}), {\hat{κ}}_{1 i}^{(k)} = E (γ_{i} W_{i} ∣ Y_{i}, {\hat{θ}}^{(k)}), {\hat{κ}}_{2 i}^{(k)} = E (γ_{i}^{2} W_{i} ∣ Y_{i}, {\hat{θ}}^{(k)}), \end{matrix}

(22)

which have explicit expressions as given earlier, but also the expectation

{\hat{κ}}_{3 i}^{(k)} = E (log W_{i} ∣ Y_{i}, {\hat{θ}}^{(k)}),

(23)

which is difficult to evaluate explicitly. So, we perform the ECME algorithm, which replaces the conditional maximization Q function (CMQ) step with the conditional maximization log likelihood (CML) step, to avoid computing the expectation in (23).

Substituting (22) and (23) into (21), we obtain the following expression for the Q function:

\begin{matrix} Q (θ ∣ {\hat{θ}}^{(k)}) & = \frac{1}{2} \sum_{i = 1}^{N} {ν log (\frac{ν}{2}) - 2 log Γ (\frac{ν}{2}) - p log | Σ | - n log | Ψ | \\ + 2 η (Y_{i}; θ) {\hat{κ}}_{1 i}^{(k)} - (ρ (Σ, Ψ, Λ) + 1) {\hat{κ}}_{2 i}^{(k)} \\ - (δ (Y_{i}; M, Σ, Ψ) + ν) {\hat{w}}_{i}^{(k)} + (ν + n p - 1) {\hat{κ}}_{3 i}^{(k)}} . \end{matrix}

(24)

The CMQ steps are implemented, to update estimates of

θ

in the order of

M

,

Σ

,

Ψ

,

Λ

and

ν

by maximizing, one by one, the Q function obtained in the E step. After some algebraic manipulations, they are summarized in the following CMQ and CML steps:

CMQ step 1: Fixing $Λ = {\hat{Λ}}^{(k)}$ , we update ${\hat{M}}^{(k)}$ by maximizing (24) with respect to $M$ , leading to

${\hat{M}}^{(k + 1)} = \frac{\sum_{i = 1}^{N} {\hat{w}}_{i}^{(k)} Y_{i} - {\hat{Λ}}^{(k)} \sum_{i = 1}^{N} {\hat{κ}}_{1 i}^{(k)}}{\sum_{i = 1}^{N} {\hat{w}}_{i}^{(k)}};$
CMQ step 2: Fixing $M = {\hat{M}}^{(k + 1)}$ , $Ψ = {\hat{Ψ}}^{(k)}$ and $Λ = {\hat{Λ}}^{(k)}$ , we then update ${\hat{Σ}}^{(k)}$ by maximizing (24) over $Σ$ , yielding

$\begin{matrix} {\hat{Σ}}^{(k + 1)} & = \frac{1}{N p} {\sum_{i = 1}^{N} {\hat{w}}_{i}^{(k)} (Y_{i} - {\hat{M}}^{(k + 1)}) {\hat{Ψ}}^{- 1 (k)} {(Y_{i} - {\hat{M}}^{(k + 1)})}^{⊤} \\ + {\hat{Λ}}^{(k)} {\hat{Ψ}}^{- 1 (k)} {\hat{Λ}}^{⊤ (k)} \sum_{i = 1}^{N} {\hat{κ}}_{2 i}^{(k)} - \sum_{i = 1}^{N} {\hat{κ}}_{1 i}^{(k)} (Y_{i} - {\hat{M}}^{(k + 1)}) {\hat{Ψ}}^{- 1 (k)} {\hat{Λ}}^{⊤ (k)} \\ - {\hat{Λ}}^{(k)} {\hat{Ψ}}^{- 1 (k)} \sum_{i = 1}^{N} {\hat{κ}}_{1 i}^{(k)} {(Y_{i} - {\hat{M}}^{(k + 1)})}^{⊤}}; \end{matrix}$
CMQ step 3: Fixing $M = {\hat{M}}^{(k + 1)}$ , $Σ = {\hat{Σ}}^{(k + 1)}$ and $Λ = {\hat{Λ}}^{(k)}$ , we update ${\hat{Ψ}}^{(k)}$ by maximizing (24) over $Ψ$ , yielding

$\begin{matrix} {\hat{Ψ}}^{(k + 1)} & = \frac{1}{N n} {\sum_{i = 1}^{N} {\hat{w}}_{i}^{(k)} {(Y_{i} - {\hat{M}}^{(k + 1)})}^{⊤} {\hat{Σ}}^{- 1 (k + 1)} (Y_{i} - {\hat{M}}^{(k + 1)}) \\ + {\hat{Λ}}^{⊤ (k)} {\hat{Σ}}^{- 1 (k + 1)} {\hat{Λ}}^{(k)} \sum_{i = 1}^{N} {\hat{κ}}_{2 i}^{(k)} - \sum_{i = 1}^{N} {\hat{κ}}_{1 i}^{(k)} {(Y_{i} - {\hat{M}}^{(k + 1)})}^{⊤} {\hat{Σ}}^{- 1 (k + 1)} {\hat{Λ}}^{(k)} \\ - {\hat{Λ}}^{⊤ (k)} {\hat{Σ}}^{- 1 (k + 1)} \sum_{i = 1}^{N} {\hat{κ}}_{1 i}^{(k)} (Y_{i} - {\hat{M}}^{(k + 1)})}; \end{matrix}$
CMQ step 4: Fixing $M = {\hat{M}}^{(k + 1)}$ , we obtain ${\hat{Λ}}^{(k + 1)}$ by maximizing (24) over $Λ$ , yielding

$\begin{matrix} {\hat{Λ}}^{(k + 1)} = \frac{\sum_{i = 1}^{N} {\hat{κ}}_{1 i}^{(k)} (Y_{i} - {\hat{M}}^{(k + 1)})}{\sum_{i = 1}^{N} {\hat{κ}}_{2 i}^{(k)}} . \end{matrix}$

An update of

{\hat{ν}}^{(k)}

can be achieved by directly maximizing the constrained actual log likelihood function. This gives rise to the following CML step:

CML step: Update ${\hat{ν}}^{(k)}$ by optimizing the following constrained log likelihood function:

${\hat{ν}}^{(k + 1)} = arg max_{ν} \sum_{i = 1}^{N} log f_{M V S T} (Y_{i}; {\hat{M}}^{(k + 1)}, {\hat{Σ}}^{(k + 1)}, {\hat{Ψ}}^{(k + 1)}, {\hat{Λ}}^{(k + 1)}, ν) .$

4. Fitting Finite Mixtures of MVST Distributions

4.1. The Model

We consider N independent random variables

Y_{1}, \dots, Y_{N}

observed from a G-component mixture of MVST distributions, whose pdf is given by

\begin{matrix} f (Y_{i}; Θ) = \sum_{g = 1}^{G} π_{g} f_{M V S T} (Y_{i}; θ_{g}), \end{matrix}

where

0 \leq π_{g} \leq 1

,

\sum_{i = 1}^{G} π_{g} = 1

and

Θ

is the set containing all the parameters of the considered mixture model. To pose this mixture model as an incomplete data problem, we introduce allocation variables

Z_{i} = (Z_{i 1}, \dots, Z_{i G})

, where a particular element

Z_{i g}

is equal to 1 if

Y_{i}

belongs to group g and is equal to zero, otherwise. Observe that

Z_{i}

follows a multinomial random vector with one trial and cell probabilities

π_{1}, \dots, π_{G}

, denoted by

Z_{i} \sim M (1; π_{1}, \dots, π_{G})

. The hierarchical representation in (9), originally designed for the single distribution, can be extended to the mixture modeling framework, as follows:

\begin{matrix} Y_{i} | (γ_{i}, w_{i}, Z_{i g} = 1) & \sim N_{n \times p} (M_{g} + γ_{i} Λ_{g}, w_{i}^{- 1} Σ_{g}, Ψ_{g}), \\ γ_{i} | (w_{i}, Z_{i g} = 1) & \sim T N (0, w_{i}^{- 1}) I_{(0, \infty)}, \\ W_{i} | Z_{i g} = 1 & \sim Γ (ν_{g} / 2, ν_{g} / 2), \\ Z_{i g} & \sim M (1; π_{1}, \dots, π_{G}) . \end{matrix}

(25)

It then follows from the hierarchical structure in (25), on the basis of the observed data

Y = (Y_{1}, \dots, Y_{N})

and the latent data

w = (w_{1}, \dots, w_{N})

,

γ = (γ_{1}, \dots, γ_{N})

and

Z = (Z_{1}, \dots, Z_{N})

, excluding additive constants, the complete data log likelihood function of

Θ

based on the complete data

Y_{c} = (Y, w, γ, Z)

is

\begin{matrix} ℓ_{c} (θ ∣ Y_{c}) & = \frac{1}{2} \sum_{i = 1}^{N} \sum_{g = 1}^{G} Z_{i g} {2 log π_{g} + ν_{g} log (\frac{ν_{g}}{2}) - 2 log Γ (\frac{ν_{g}}{2}) - p log | Σ_{g} | \\ - n log | Ψ_{g} | + 2 η (Y_{i}; θ_{g}) γ_{i} w_{i} - (ρ (Σ_{g}, Ψ_{g}, Λ_{g}) + 1) γ_{i}^{2} w_{i} \\ - (δ (Y_{i}; M_{g}, Σ_{g}, Ψ_{g}) + ν_{g}) w_{i} + (ν_{g} + n p - 1) log w_{i}} . \end{matrix}

(26)

The expected value of (26) to start the E step, given the current parameter

Θ^{(k)}

, requires some conditional expectations, including

\begin{matrix} {\hat{z}}_{i g}^{(k)} & = E (Z_{i g} | Y_{i}, {\hat{Θ}}^{(k)}) = \frac{{\hat{π}}_{g}^{(k)} f_{M V S T} (Y_{i}; {\hat{θ}}_{g}^{(k)})}{f (Y_{i}; {\hat{Θ}}^{(k)})}, \\ {\hat{w}}_{i g}^{(k)} & = E (W_{i} ∣ Y_{i}, Z_{i g} = 1, {\hat{Θ}}^{(k)}) = E (W_{i} ∣ Y_{i}, {\hat{θ}}_{g}^{(k)}), \\ {\hat{κ}}_{1 i g}^{(k)} & = E (γ_{i} W_{i} ∣ Y_{i}, Z_{i g} = 1, {\hat{Θ}}^{(k)}) = E (γ_{i} W_{i} ∣ Y_{i}, {\hat{θ}}_{g}^{(k)}), \\ {\hat{κ}}_{2 i g}^{(k)} & = E (γ_{i}^{2} W_{i} ∣ Y_{i}, Z_{i g} = 1, {\hat{Θ}}^{(k)}) = E (γ_{i}^{2} W_{i} ∣ Y_{i}, {\hat{θ}}_{g}^{(k)}) \end{matrix}

(27)

and

{\hat{κ}}_{3 i g}^{(k)} = E (log W_{i} ∣ Y_{i}, Z_{i g} = 1, {\hat{Θ}}^{(k)}) = E (log W_{i} ∣ Y_{i}, {\hat{θ}}_{g}^{(k)})

for which we utilize the CML step, as mentioned in the preceding section. Consequently, the conditional expectation of the complete data log likelihood is obtained as

\begin{matrix} Q (θ ∣ {\hat{θ}}^{(k)}) & = \frac{1}{2} \sum_{i = 1}^{N} \sum_{g = 1}^{G} {\hat{z}}_{i g}^{(k)} {2 log π_{g} + ν_{g} log (\frac{ν_{g}}{2}) - 2 log Γ (\frac{ν_{g}}{2}) - p log | Σ_{g} | \\ - n log | Ψ_{g} | + 2 η (Y_{i}; θ_{g}) {\hat{κ}}_{1 i g}^{(k)} - (ρ (Σ_{g}, Ψ_{g}, Λ_{g}) + 1) {\hat{κ}}_{2 i g}^{(k)} \\ - (δ (Y_{i}; M_{g}, Σ_{g}, Ψ_{g}) + ν_{g}) {\hat{w}}_{i g}^{(k)} + (ν_{g} + n p - 1) {\hat{κ}}_{3 i g}^{(k)}} . \end{matrix}

(28)

Thus, the implementation of the ECME algorithm proceeds as follows:

E step: Given $Θ$ $= {\hat{Θ}}^{(k)}$ , compute ${\hat{z}}_{i g}^{(k)}$ , ${\hat{w}}_{i g}^{(k)}$ , ${\hat{κ}}_{1 i g}^{(k)}$ and ${\hat{κ}}_{2 i g}^{(k)}$ given in (27), for $i = 1, \dots, N$ and $g = 1, \dots, G$ ;
CM step 1: Calculate

${\hat{π}}_{g}^{(k + 1)} = \frac{1}{N} \sum_{i = 1}^{N} {\hat{z}}_{i g}^{(k)};$
CM step 2: Update ${\hat{M}}_{g}^{(k)}$ as

$\begin{matrix} {\hat{M}}_{g}^{(k + 1)} = \frac{\sum_{i = 1}^{N} {\hat{z}}_{i g}^{(k)} {\hat{w}}_{i g}^{(k)} Y_{i} - {\hat{Λ_{g}}}^{(k)} \sum_{i = 1}^{N} {\hat{z}}_{i g}^{(k)} {\hat{κ}}_{1 i g}^{(k)}}{\sum_{i = 1}^{N} {\hat{z}}_{i g}^{(k)} {\hat{w}}_{i g}^{(k)}}; \end{matrix}$
CM step 3: Update ${\hat{Σ}}_{g}^{(k)}$ as

$\begin{matrix} {\hat{Σ}}_{g}^{(k + 1)} & = \frac{1}{p \sum_{i = 1}^{N} {\hat{z}}_{i g}^{(k)}} {\sum_{i = 1}^{N} {\hat{z}}_{i g}^{(k)} {\hat{w}}_{i g}^{(k)} (Y_{i} - {\hat{M}}_{g}^{(k + 1)}) {\hat{Ψ}}_{g}^{- 1 (k)} {(Y_{i} - {\hat{M}}_{g}^{(k + 1)})}^{⊤} \\ + {\hat{Λ}}_{g}^{(k)} {\hat{Ψ}}_{g}^{- 1 (k)} {\hat{Λ}}_{g}^{⊤ (k)} \sum_{i = 1}^{N} {\hat{z}}_{i g}^{(k)} {\hat{κ}}_{2 i g}^{(k)} - \sum_{i = 1}^{N} {\hat{z}}_{i g}^{(k)} {\hat{κ}}_{1 i g}^{(k)} (Y_{i} - {\hat{M}}_{g}^{(k + 1)}) {\hat{Ψ}}_{g}^{- 1 (k)} {\hat{Λ}}_{g}^{⊤ (k)} \\ - {\hat{Λ}}_{g}^{(k)} {\hat{Ψ}}_{g}^{- 1 (k)} \sum_{i = 1}^{N} {\hat{z}}_{i g}^{(k)} {\hat{κ}}_{1 i g}^{(k)} {(Y_{i} - {\hat{M}}_{g}^{(k + 1)})}^{⊤}}; \end{matrix}$
CM step 4: Update ${\hat{Ψ}}_{g}^{(k)}$ as

$\begin{matrix} {\hat{Ψ}}_{g}^{(k + 1)} & = \frac{1}{n \sum_{i = 1}^{N} {\hat{z}}_{i g}^{(k)}} {\sum_{i = 1}^{N} {\hat{z}}_{i g}^{(k)} {\hat{w}}_{i g}^{(k)} {(Y_{i} - {\hat{M}}_{g}^{(k + 1)})}^{⊤} {\hat{Σ}}_{g}^{- 1 (k + 1)} (Y_{i} - {\hat{M}}_{g}^{(k + 1)}) \\ + {\hat{Λ}}_{g}^{⊤ (k)} {\hat{Σ}}_{g}^{- 1 (k + 1)} {\hat{Λ}}_{g}^{(k)} \sum_{i = 1}^{N} {\hat{z}}_{i g}^{(k)} {\hat{κ}}_{2 i g}^{(k)} - \sum_{i = 1}^{N} {\hat{z}}_{i g}^{(k)} {\hat{κ}}_{1 i g}^{(k)} {(Y_{i} - {\hat{M}}_{g}^{(k + 1)})}^{⊤} {\hat{Σ}}_{g}^{- 1 (k + 1)} {\hat{Λ}}_{g}^{(k)} \\ - {\hat{Λ}}_{g}^{⊤ (k)} {\hat{Σ}}_{g}^{- 1 (k + 1)} \sum_{i = 1}^{N} {\hat{z}}_{i g}^{(k)} {\hat{κ}}_{1 i g}^{(k)} (Y_{i} - {\hat{M}}_{g}^{(k + 1)})}; \end{matrix}$
CM step 5: Update ${\hat{Λ}}_{g}^{(k)}$ as

$\begin{matrix} {\hat{Λ}}_{g}^{(k + 1)} = \frac{\sum_{i = 1}^{N} {\hat{z}}_{i g}^{(k)} {\hat{κ}}_{1 i g}^{(k)} (Y_{i} - {\hat{M}}_{g}^{(k + 1)})}{\sum_{i = 1}^{N} {\hat{z}}_{i g}^{(k)} {\hat{κ}}_{2 i g}^{(k)}}; \end{matrix}$
CML step: Update ${\hat{ν}}^{(k)} = ({\hat{ν}}_{1}, \dots, {\hat{ν}}_{G})$ by optimizing the constrained log likelihood function as

${\hat{ν}}^{(k + 1)} = arg max_{ν} \sum_{i = 1}^{N} log (\sum_{g = 1}^{G} {\hat{π}}_{g}^{(k + 1)} f_{M V S T} (Y_{i}; {\hat{M}}_{g}^{(k + 1)}, {\hat{Σ}}_{g}^{(k + 1)}, {\hat{Ψ}}_{g}^{(k + 1)}, {\hat{Λ}}_{g}^{(k + 1)}, ν_{g})) .$

4.2. Initialization

In order to speed up the convergence process, it is important to establish a set of reasonable starting values. To start the ECME algorithm for fitting the FM-MVST model, an intuitive scheme to partition data into G components

{Y_{g}^{(0)}}_{g = 1}^{G}

is to create an initial partition of data

{Vec (Y_{i})}_{i = 1}^{N}

, using the K-means algorithm [21,22]. This yields a validate estimate of

{\hat{z}}_{i g}^{(0)}

, which, in turn, yields

{\hat{π}}_{g}^{(0)} = \sum_{i = 1}^{N} {\hat{z}}_{i g}^{(0)} / N

. Then, we compute the sample mean, covariance matrix of rows and covariance matrix of columns of

Y_{g}^{(0)}

as good initial estimates for

{\hat{M}}_{g}^{(0)}

,

{\hat{Σ}}_{g}^{(0)}

and

{\hat{Ψ}}_{g}^{(0)}

, as follows:

\begin{matrix} {\hat{M}}_{g}^{(0)} & = \frac{\sum_{i = 1}^{N} {\hat{z}}_{i g}^{(0)} Y_{i}}{\sum_{i = 1}^{N} {\hat{z}}_{i g}^{(0)}}, \\ {\hat{Σ}}_{g}^{(0)} & = \frac{\sum_{i = 1}^{N} \sum_{j = 1}^{p} {\hat{z}}_{i g}^{(0)} (y_{i j} - m_{g j}^{(0)}) {(y_{i j} - m_{g j}^{(0)})}^{⊤}}{p \sum_{i = 1}^{N} {\hat{z}}_{i g}^{(0)}}, \\ {\hat{Ψ}}_{g}^{(0)} & = \frac{\sum_{i = 1}^{N} \sum_{r = 1}^{n} {\hat{z}}_{i g}^{(0)} {(y_{i . r} - m_{g . r}^{(0)})}^{⊤} (y_{i . r} - m_{g . r}^{(0)})}{n \sum_{i = 1}^{N} {\hat{z}}_{i g}^{(0)}}, \end{matrix}

where

m_{g j}^{(0)}

and

m_{g . r}^{(0)}

denote the j-th column and r-th row of

M_{g}^{(0)}

, respectively, and

y_{i j}

and

y_{i . r}

are the j-th column and r-th row of

Y_{i}

, respectively. The initial component skewness matrices,

Λ_{g}^{(0)}

, are taken as the values randomly selected in the interval

(- 1, 1)

. Finally, we initialize

{\hat{ν}}_{g}^{(0)}

by taking it to be as small as 5 or 10.

4.3. Identifiability

Model identifiability is key to securing unique and consistent estimates of model parameters. With regard to the mixtures of MVST distributions, the estimates of

Σ_{g}

and

Ψ_{g}

are only unique up to a strictly positive constant. To resolve this issue, a constraint needs to be placed, such as setting the trace of

Σ_{g}

equal to n [13] or fixing

| Σ_{g} | = 1

[23]. Herein, we set the first diagonal element of

Σ_{g}

as 1 [8]. This scaling procedure can be implemented at each iteration or at convergence, and either method has minimal impact on the final estimates and classifications achieved. To obtain the final parameter estimates, the resulting

Σ_{g}

is divided by the first diagonal element of

Σ_{g}

, and then

Ψ_{g}

is multiplied by the first diagonal element of

Σ_{g}

.

5. Empirical Study

5.1. Finite-Sample Properties of ML Estimators

Here, we conducted a simulation study for examining the accuracy of the parameter estimates obtained by using the proposed ECME algorithm in Section 3.2. We generated 500 Monte Carlo samples of sizes

N = 250

, 500, 1000 and 2000 from the two-component FM-MVST model, in the two scenarios (low- and moderate-dimensional) described in Appendix A. Scenario I was characterized by matrices of size

3 \times 4

, indicating that each matrix consisted of 3 rows and 4 columns. This structure allowed for a total of 12 elements within each matrix. On the other hand, Scenario II generated matrices of size

10 \times 2

. In this case, each matrix was composed of 10 rows and 2 columns, resulting in a total of 20 elements per matrix. This increased number of elements was beneficial for examining the performance of the proposed algorithm in accurately recovering true parameters in moderate dimensional scenarios.

The accuracy of the obtained parameter estimates was assessed by the average of the root mean squared error (RMSE) of the elements of each estimated parameter. The results shown in Table 1 indicate the good performance of the proposed estimation method. Regardless of the considered scenario, it can be seen that the RMSE values all tended to zero with increasing sample size, indicating the satisfactory asymptotic properties of the ML estimates obtained by the proposed ECME algorithm.

Table 1. Average RMSE based on 500 replications for the evaluation of ML estimates.

5.2. Comparison of Classification Accuracy

To examine the classification accuracy of the FM-MVST model, we generated 1000 samples from each of the scenarios given in Appendix A. In each scenario, we compared the FM-MVST model described in Section 4 with finite mixtures of matrix variate normal (FM-MVN) and matrix variate t (FM-MVT) distributions, which were readily available in the R package MixMatrix. We also implemented the EM algorithm described in [8], to fit finite mixtures of MVSTIG (FM-MVSTIG) distributions. Additionally, the finite mixtures of the reduced RMVSN (FM-RMVSN) distributions were fitted as a sub-model of the FM-MVST model.

Model performance was assessed by comparing the classification accuracy and model selection criteria for all the fitted models. For classification accuracy, we report the adjusted rand index (ARI; [24]), which took the value of 1 when the two partitions perfectly matched, and the misclassification rate (MCR) of the map clustering for each model. ARI serves as a measure of the similarity between two data clusters, which can provide insights into the robustness of clustering results. By contrast, MCR focuses on the accuracy of classification, emphasizing the proportion of misclassified instances, thus offering a more direct assessment of predictive performance. Furthermore, the Bayesian information criterion (BIC; [25]) value was also reported as a model selection criterion. BIC incorporates both the goodness-of-fit and the complexity of the model, penalizing for overfitting, which can be particularly relevant when evaluating clustering algorithms.

We ran 100 simulations for each scenario and computed the classification accuracy and model selection criteria for each simulation. Table 2 presents the average BIC, ARI and MCR values along with their standard errors (Std), and the results are illustrated via the box plots shown in Figure 1. As one would expect, the model selection criteria selected the true model from which the data were generated. This outcome highlights the effectiveness of the selected metrics in distinguishing between models based on their ability to capture the underlying data structure. The consistency observed across the simulations further strengthens the case for the reliability of these model selection criteria in practical applications.

Table 2. Simulation results, based on 100 replications, for performance comparison of four mixture models in two scenarios.

Figure 1. Box plots of BIC, ARI and MCR values for the competing models in two scenarios: (a) Scenario I and (b) Scenario II.

6. Real Data Analysis

In this section, we illustrate the results of applying the proposed methodology to three well-known real datasets.

6.1. Landsat Data

The first application concerned the Landsat data (LSD), originally obtained by NASA, and available at Irvine machine learning repository (http://archive.ics.uci.edu/ml, accessed on 1 September 2024). Multi-spectral satellite imagery allows for multiple observations over a spatial grid, resulting in matrix-valued observations. The LSD comprises lines that consist of four spectral values representing nine pixel neighborhoods in a satellite image. Essentially, each line corresponds to a

4 \times 9

observation matrix. Additionally, every observation matrix in the LSD is classified into one of six distinct categories: red soil, cotton crop, gray soil, damp gray soil, soil with vegetation stubble and very damp gray soil. For our analysis, we concentrated on three specific categories: red soil, gray soil and soil with vegetation stubble, which had sizes of 461, 397 and 237, respectively.

Table 3 presents a summary of the ML fitting results, including the maximized log likelihood values, BIC, ARI and MCR of the four fitted models. The results reveal that the log likelihood value for the FM-MVN distribution was lower than that for the FM-MVT distribution, indicating a poorer fit. In contrast, the skewed distributions (FM-MVST, FM-MVSTIG and FM-RMVSN) outperformed their respective models. Particularly noteworthy was the superior performance of the FM-MVST model. The estimated tailedness parameters were

{\hat{ν}}_{1} = 0.47

,

{\hat{ν}}_{2} = 0.44

and

{\hat{ν}}_{3} = 0.58

, indicating a distribution of matrix observations characterized by long-tailed behavior.

Table 3. Summary results from fitting various models to the LSD data.

6.2. Apes Data

The second application considered the apes dataset included in the shapes R package [26]. The description of the dataset, taken from [27], is as follows. In an investigation to assess the cranial differences between the sexes of apes, 29 male and 30 female adult gorillas (Gorilla), 28 male and 26 female adult chimpanzees (Pan) and 30 male and 24 female adult orangutans (Pongo) were studied. Eight landmarks were chosen in the midline plane of each skull. These landmarks were anatomical landmarks and were located by an expert biologist. The dataset was stored as a list with two components: an array of coordinates in eight landmarks in two dimensions for each skull (

8 \times 2

observation matrix and

N = 167

), and a vector of group labels (

G = 6

).

All the competing models were fitted for

G = 6

, and their fitting results are reported in Table 4. It is clear that the FM-MNV model provided the worst fitting performance, whereas FM-MVST was the best model. Similarly to the analyses in the previous section, this may be an indication that the components of FM-MVN were not skewed and heavy-tailed enough to adequately model the data. On a related note, the estimated tailedness parameters were

{\hat{ν}}_{1} = 0.45

,

{\hat{ν}}_{2} = 2.20

,

{\hat{ν}}_{3} = 1.61

,

{\hat{ν}}_{4} = 2.80

,

{\hat{ν}}_{5} = 0.52

and

{\hat{ν}}_{6} = 1.61

, highlighting the presence of clusters with high levels of tailed behavior.

Table 4. Summary results from fitting various models to the apes data.

6.3. Melanoma Data

The performance of the FM-MVST model in skin cancer detection was demonstrated in the third and final application. The objective of the skin cancer detection project was to develop a framework for analyzing and assessing the risk of melanoma, using dermatological photographs taken with a standard consumer-grade camera. Segmentation of the lesion is a crucial step for developing a skin cancer detection framework. The objective, then, was to find the border of the skin lesion. It was important that this step was performed accurately, because many features used to assess the risk of melanoma are derived based on the lesion border. The set of images included images extracted from the public databases DermIS and DermQuest, along with manual segmentations (ground truth) of the lesions, available at https://uwaterloo.ca/vision-image-processing-lab/research-demos/skin-cancer-detection, accessed on 1 September 2024.

A skin image in

100 \times 70

pixels is displayed in Figure 2a. The next objective was to segment the image into two labels. We considered all the pixels of three numerical RGB components denoting red, green and blue intensities and a grayscale intensity, such as

y_{i} \in {[0, 255]}^{4}

, which could be transformed into

{[0, 1]}^{4}

. Upon considering each pixel as a

2 \times 2

matrix, each pixel was then be grouped into

G = 2

clusters, where every cluster was assumed to have a different distribution.

Figure 2. Segmentation of lesion: (a) original, (b) ground truth, and (c–g) segmented images obtained using different models.

It follows from Table 5 that the FM-MVST model provided the best fit, in terms of BIC, as well as the lowest misclassification error for the binary classification of each pixel. The estimates of the tailedness parameters were

{\hat{ν}}_{1} = 1.96

and

{\hat{ν}}_{2} = 0.92

, signifying the appropriateness of the use of heavy-tailed t distributions. Furthermore, the superiority of the FM-MVST model is reflected visually in Figure 2c–g, which depict the comparative segmentation performance of the fitted models in grayscale. The figures illustrate differences in identifying the lesion area, and they indicate that the proposed model exhibits a clearer boundary and a more consistent region of the lesion.

Table 5. Summary results from fitting various models to the melanoma data.

It is noteworthy that all the model selection criteria applied to the three datasets strongly favored the proposed FM-MVST model. The datasets varied significantly, in terms of dimensional and structural characteristics, which underscores the flexibility and effectiveness of the FM-MVST model. This model demonstrated a superior ability to accurately capture the skewness and leptokurtic features present in the data, outperforming the alternative models. The adaptability of the FM-MVST model across diverse datasets not only showcases its robustness but also reinforces its potential as a valuable tool for data analysis in various applications.

7. Concluding Remarks

We have introduced here a new family of matrix variate distributions that can capture both skewness and heavy-tailedness simultaneously. This MVST model was based on a stochastic representation that facilitated our developing an ECME algorithm for the maximum likelihood estimation of the model parameters. We evaluated the effectiveness and efficiency of the proposed algorithm through two simulation studies. Additionally, we used the proposed approach to analyze three real data datasets, demonstrating its capability in modeling asymmetric matrix variate data. Future developments of this approach could include accommodating censored data, determining the optimal number of mixing distributions for the clustering problem and including different distributions for the variables W and U in the stochastic representation. Another interesting extension could involve incorporating the FM-MVST distribution into a mixture-of-regression framework. We are currently looking into these problems, and we hope to report our findings in a future paper.

Author Contributions

Conceptualization, A.J.; Methodology, A.M. and A.J.; Software, A.M.; Investigation, A.J.; Writing—original draft, A.M.; Writing—review & editing, N.B.; Supervision, N.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

The parameters used to generate the data in Section 5.1 are given in the following Table. Here,

1_{p}

is used to denote the vector of length p with all its entries as 1 and

I_{p}

to denote the p-dimensional identity matrix.

Table A1. The parameters used in the generation of data (scenarios I and II).

Scenario	Parameter	Component 1	Component 2
I	$π_{g}$	0.3	0.7
	$M_{g}$	$[\begin{matrix} - 1 & 1 & - 1 & 2 \\ 0 & 2 & - 1 & 0 \\ 0 & 0 & 0 & - 1 \end{matrix}]$	$[\begin{matrix} 0 & 2 & 0 & 1 \\ 0 & 2 & 0 & - 1 \\ 0 & 1 & 1 & - 1 \end{matrix}]$
	$Σ_{g}$	$[\begin{matrix} 1.0 & 0.0 & 0.0 \\ 0.0 & 0.7 & - 0.1 \\ 0.0 & - 0.1 & 1.0 \end{matrix}]$	$[\begin{matrix} 1.0 & 0.1 & 0.2 \\ 0.1 & 0.5 & - 0.5 \\ 0.2 & - 0.5 & 1.4 \end{matrix}]$
	$Ψ_{g}$	$[\begin{matrix} 0.7 & 0.0 & 0.0 & 0.0 \\ 0.0 & 1.0 & - 0.5 & 0.5 \\ 0.0 & - 0.5 & 1.5 & 0.1 \\ 0.0 & 0.5 & 0.1 & 1.0 \end{matrix}]$	$[\begin{matrix} 1.0 & 0.5 & 0.0 & 0.0 \\ 0.5 & 1.0 & 0.5 & 0.5 \\ 0.0 & 0.5 & 1.0 & 0.1 \\ 0.0 & 0.5 & 0.1 & 1.0 \end{matrix}]$
	$Λ_{g}$	$[\begin{matrix} 1 & - 2 & 0 & 1 \\ 1 & - 2 & 0 & 1 \\ 1 & - 2 & 0 & 1 \end{matrix}]$	$[\begin{matrix} 0 & 1 & - 1 & 0 \\ 0 & 1 & - 1 & - 1 \\ 1 & 1 & 0 & - 1 \end{matrix}]$
	$ν_{g}$	3	5
II	$π_{g}$	0.4	0.6
	$M_{g}$	$[\begin{matrix} - 1 & - 1 \\ 0 & 1 \end{matrix}]$ $\otimes 1_{5}$	$[\begin{matrix} 0 & 0 \\ 2 & 1 \end{matrix}]$ $\otimes 1_{5}$
	$Σ_{g}$	$[\begin{matrix} 5.0 & - 0.5 \\ - 0.5 & 1.0 \end{matrix}]$ $\otimes I_{5}$	$[\begin{matrix} 2.0 & 0.1 \\ 0.1 & 0.5 \end{matrix}]$ $\otimes I_{5}$
	$Ψ_{g}$	$[\begin{matrix} 0.5 & 0.0 \\ 0.0 & 0.5 \end{matrix}]$	$[\begin{matrix} 1.0 & 0.5 \\ 0.5 & 1.0 \end{matrix}]$
	$Λ_{g}$	$[\begin{matrix} - 2 & 1 \\ - 2 & 1 \end{matrix}]$ $\otimes 1_{5}$	$[\begin{matrix} 1 & 2 \\ 1 & 2 \end{matrix}]$ $\otimes 1_{5}$
	$ν_{g}$	4	4

References

Kroonenberg, P.M. Applied Multiway Data Analysis; John Wiley & Sons: Hoboken, NJ, USA, 2008. [Google Scholar]
Dickey, J.M. Matricvariate generalizations of the multivariate t distribution and the inverted multivariate t distribution. Ann. Math. Stat. 1967, 38, 511–518. [Google Scholar] [CrossRef]
Arellano-Valle, R.B.; Azzalini, A. A formulation for continuous mixtures of multivariate normal distributions. J. Multivar. Anal. 2021, 185, 104780. [Google Scholar] [CrossRef]
Liu, C.; Rubin, D.B. The ECME algorithm: A simple extension of EM and ECM with faster monotone convergence. Biometrika 1994, 81, 633–648. [Google Scholar] [CrossRef]
Dempster, A.P.; Laird, N.M.; Rubin, D.B. Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. (Methodol.) 1977, 39, 1–22. [Google Scholar] [CrossRef]
Rezaei, A.; Yousefzadeh, F.; Arellano-Valle, R.B. Scale and shape mixtures of matrix variate extended skew normal distributions. J. Multivar. Anal. 2020, 179, 104649. [Google Scholar] [CrossRef]
Gallaugher, M.P.; McNicholas, P.D. A matrix variate skew-t distribution. Stat 2017, 6, 160–170. [Google Scholar] [CrossRef]
Gallaugher, M.P.; McNicholas, P.D. Finite mixtures of skewed matrix variate distributions. Pattern Recognit. 2018, 80, 83–93. [Google Scholar] [CrossRef]
Naderi, M.; Bekker, A.; Arashi, M.; Jamalizadeh, A. A theoretical framework for Landsat data modeling based on the matrix variate mean-mixture of normal model. PLoS ONE 2020, 15, e0230773. [Google Scholar] [CrossRef]
Chen, J.T.; Gupta, A.K. Matrix variate skew normal distributions. Statistics 2005, 39, 247–253. [Google Scholar] [CrossRef]
Domínguez-Molina, J.A.; González-Farías, G.; Ramos-Quiroga, R.; Gupta, A.K. A matrix variate closed skew-normal distribution with applications to stochastic frontier analysis. Commun. Stat.—Theory Methods 2007, 36, 1691–1703. [Google Scholar] [CrossRef]
Zhang, L.; Bandyopadhyay, D. A graphical model for skewed matrix-variate non-randomly missing data. Biostatistics 2020, 21, e80–e97. [Google Scholar] [CrossRef]
Viroli, C. Finite mixtures of matrix normal distributions for classifying three-way data. Stat. Comput. 2011, 21, 511–522. [Google Scholar] [CrossRef]
Thompson, G.Z.; Maitra, R.; Meeker, W.Q.; Bastawros, A.F. Classification with the matrix-variate-t distribution. J. Comput. Graph. Stat. 2020, 29, 668–674. [Google Scholar] [CrossRef]
Tomarchio, S.D.; Punzo, A.; Bagnato, L. Two new matrix-variate distributions with application in model-based clustering. Comput. Stat. Data Anal. 2020, 152, 107050. [Google Scholar] [CrossRef]
Tomarchio, S.D.; Gallaugher, M.P.; Punzo, A.; McNicholas, P.D. Mixtures of matrix-variate contaminated normal distributions. J. Comput. Graph. Stat. 2022, 31, 413–421. [Google Scholar] [CrossRef]
Tomarchio, S.D. Matrix-variate normal mean-variance Birnbaum–Saunders distributions and related mixture models. Comput. Stat. 2024, 39, 405–432. [Google Scholar] [CrossRef]
Naderi, M.; Tamandi, M.; Mirfarah, E.; Wang, W.L.; Lin, T.I. Three-way data clustering based on the mean-mixture of matrix-variate normal distributions. Comput. Stat. Data Anal. 2024, 199, 108016. [Google Scholar] [CrossRef]
Lin, T.I.; Wu, P.H.; McLachlan, G.J.; Lee, S.X. A robust factor analysis model using the restricted skew-t distribution. Test 2015, 24, 510–531. [Google Scholar] [CrossRef]
Lee, S.X.; McLachlan, G.J. Finite mixtures of canonical fundamental skew t-distributions: The unification of the restricted and unrestricted skew t-mixture models. Stat. Comput. 2016, 26, 573–589. [Google Scholar] [CrossRef]
Macqueen, J. Some methods for classification and analysis of multivariate observations. In Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, Davis, CA, USA, 21 June–18 July 1965; University of California Press: Berkeley, CA, USA, 1967. [Google Scholar]
Lloyd, S. Least squares quantization in PCM. IEEE Trans. Inf. Theory 1982, 28, 129–137. [Google Scholar] [CrossRef]
Sarkar, S.; Zhu, X.; Melnykov, V.; Ingrassia, S. On parsimonious models for modeling matrix data. Comput. Stat. Data Anal. 2020, 142, 106822. [Google Scholar] [CrossRef]
Hubert, L.; Arabie, P. Comparing partitions. J. Classif. 1985, 2, 193–218. [Google Scholar] [CrossRef]
Schwarz, G. Estimating the dimension of a model. Ann. Stat. 1978, 461–464. [Google Scholar] [CrossRef]
Dryden, I.L. shapes Package; Version 1.2.6; Contributed package; R Foundation for Statistical Computing: Vienna, Austria, 2021. [Google Scholar]
Dryden, I.; Mardia, K. Statistical Shape Analysis: With Applications in R; Wiley Series in Probability and Statistics; Wiley: Hoboken, NJ, USA, 2016. [Google Scholar]

Figure 1. Box plots of BIC, ARI and MCR values for the competing models in two scenarios: (a) Scenario I and (b) Scenario II.

Figure 2. Segmentation of lesion: (a) original, (b) ground truth, and (c–g) segmented images obtained using different models.

Table 1. Average RMSE based on 500 replications for the evaluation of ML estimates.

Scenario	N	$π_{1}$	$M_{1}$	$M_{2}$	$Σ_{1}$	$Σ_{2}$	$Ψ_{1}$	$Ψ_{2}$	$Λ_{1}$	$Λ_{2}$	$ν_{1}$	$ν_{2}$
I	250	0.031	0.169	0.119	0.058	0.045	0.128	0.078	0.174	0.118	1.638	3.629
	500	0.020	0.129	0.087	0.044	0.041	0.093	0.064	0.126	0.088	1.215	3.128
	1000	0.015	0.091	0.059	0.032	0.031	0.066	0.052	0.088	0.056	0.803	2.712
	2000	0.011	0.062	0.044	0.027	0.028	0.051	0.043	0.062	0.044	0.702	2.230
II	250	0.035	0.260	0.193	1.348	0.376	0.785	0.440	0.257	0.187	2.544	2.426
	500	0.022	0.176	0.130	1.257	0.338	0.771	0.425	0.179	0.129	2.297	2.219
	1000	0.014	0.120	0.097	1.206	0.312	0.764	0.423	0.124	0.097	1.962	1.897
	2000	0.011	0.088	0.068	1.184	0.302	0.753	0.401	0.090	0.070	1.460	1.388

Table 2. Simulation results, based on 100 replications, for performance comparison of four mixture models in two scenarios.

Scenario	Model	BIC	Std	ARI	Std	MCR	Std
	FM-MVN	55,419.78	831.20	0.82	0.20	0.05	0.06
	FM-MVT	45,004.95	787.39	0.90	0.17	0.03	0.05
I	FM-RMVSN	40,103.90	962.33	0.95	0.18	0.02	0.05
	FM-MVSTIG	38,215.01	859.47	0.97	0.08	0.01	0.02
	FM-MVST	38,170.98	804.07	0.98	0.05	0.01	0.01
	FM-MVN	85,450.81	705.80	0.91	0.15	0.07	0.06
	FM-MVT	76,011.44	720.48	0.93	0.14	0.08	0.05
II	FM-RMVSN	72,917.04	703.48	0.94	0.12	0.05	0.02
	FM-MVSTIG	69,892.52	694.08	0.95	0.10	0.04	0.02
	FM-MVST	69,839.90	673.31	0.97	0.09	0.02	0.01

Table 3. Summary results from fitting various models to the LSD data.

Model	G	Log Likelihood	BIC	ARI	MCR
FM-MVN		−114,954.90	231,799.40	0.67	0.14
FM-MVT		−113,169.30	228,228.10	0.69	0.13
FM-RMVSN	3	−111,213.50	225,107.40	0.76	0.09
FM-MVSTIG		−110,920.90	224,543.20	0.79	0.07
FM-MVST		−110,836.60	224,374.60	0.82	0.06

Table 4. Summary results from fitting various models to the apes data.

Model	G	Log Likelihood	BIC	ARI	MCR
FM-MVN		−7773.14	17,204.51	0.51	0.41
FM-MVT		−7609.66	16,877.54	0.56	0.32
FM-RMVSN	6	−6158.09	14,522.03	0.60	0.28
FM-MVSTIG		−6097.66	14,431.87	0.63	0.27
FM-MVST		−5970.42	14,177.41	0.67	0.25

Table 5. Summary results from fitting various models to the melanoma data.

Model	G	Log Likelihood	BIC	ARI	MCR
FM-MVN		63,941.64	−127,723.90	0.76	0.14
FM-MVT		65,509.39	−130,859.40	0.82	0.13
FM-RMVSN	2	65,601.15	−130,945.50	0.91	0.12
FM-MVSTIG		65,788.76	−131,303.10	0.93	0.11
FM-MVST		67,241.98	−134,209.50	0.95	0.09

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Robust Classification via Finite Mixtures of Matrix Variate Skew-t Distributions

Abstract

1. Introduction

3. Methodology

3.1. The Model

3.2. Parameter Estimation via the ECME Algorithm

4. Fitting Finite Mixtures of MVST Distributions

4.1. The Model

4.2. Initialization

4.3. Identifiability

5. Empirical Study

5.1. Finite-Sample Properties of ML Estimators

5.2. Comparison of Classification Accuracy

6. Real Data Analysis

6.1. Landsat Data

6.2. Apes Data

6.3. Melanoma Data

7. Concluding Remarks

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A

References

Article Metrics

Citations

Article Access Statistics