A Bayesian Predictive Discriminant Analysis with Screened Data

Kim, Hea-Jung

doi:10.3390/e17096481

Open AccessArticle

A Bayesian Predictive Discriminant Analysis with Screened Data

by

Hea-Jung Kim

Department of Statistics, Dongguk University-Seoul, Pil-Dong 3Ga, Chung-Gu, Seoul 100-715, Korea

Entropy 2015, 17(9), 6481-6502; https://doi.org/10.3390/e17096481

Submission received: 3 April 2015 / Revised: 25 August 2015 / Accepted: 17 September 2015 / Published: 21 September 2015

(This article belongs to the Special Issue Inductive Statistical Methods)

Download

Browse Figures

Versions Notes

Abstract

:

In the application of discriminant analysis, a situation sometimes arises where individual measurements are screened by a multidimensional screening scheme. For this situation, a discriminant analysis with screened populations is considered from a Bayesian viewpoint, and an optimal predictive rule for the analysis is proposed. In order to establish a flexible method to incorporate the prior information of the screening mechanism, we propose a hierarchical screened scale mixture of normal (HSSMN) model, which makes provision for flexible modeling of the screened observations. An Markov chain Monte Carlo (MCMC) method using the Gibbs sampler and the Metropolis–Hastings algorithm within the Gibbs sampler is used to perform a Bayesian inference on the HSSMN models and to approximate the optimal predictive rule. A simulation study is given to demonstrate the performance of the proposed predictive discrimination procedure.

Keywords:

Bayesian predictive discriminant analysis; hierarchical model; MCMC method; optimal rule; scale mixture; screened observation

MSC Classification:

62H30; 62F15

Graphical Abstract

1. Introduction

The topic of analyzing multivariate screened data has received a great deal of attention over the last few decades. In the standard multivariate problem, an analysis of data generated from a p-dimensional screened random vector

x \overset{d}{=} [v | v_{0} \in C_{q})]

is our issue of interest, where the

p \times 1

random vector v and the

q \times 1

random vector

v_{0}

(called the screening vector) are jointly distributed with the correlation matrix

C o r r (v, v_{0}^{⊤}) \neq 0 .

Thus, we observe x only when the unobservable screening vector

v_{0}

belongs to a known subset

C_{q}

of its space

R^{q}

, such that

0 \leq P (v_{0} \in C_{q}) \leq 1 .

That is, x is subject to the screening scheme or hidden truncation (or simply truncation if

v = v_{0}

). Model parameters underlying the joint distribution of v and

v_{0}

are then estimated from the screened data (i.e., observations of x) using the conditional density

f (x | v_{0} \in C_{q}) .

The screening of sample (or sample selection) arises as open in practice as a result of controlling observability of the outcome of interest in the study. For example, the dataset consists of the Otis IQ test scores (the values of x) of freshmen of a college. These students had been screened in the college admission process, which examines whether their prior school grade point average (GPA) and the Scholastic Aptitude Test (SAT) scores (i.e., the screening values denoted by

v_{0}

) are satisfactory. What the true vale of screening vector variable

v_{0}

of each student is may not be available due to a college regulation. The observations available are the IQ values of x, the screened data. For the application with real screened data, one can refer to that with the student aid grants data given by [1] and that with the U.S. labor market data given by [2], as well. A variety of methods have been suggested for analyzing such screened data. See, e.g., [3,4,5,6], for various distributions for modeling and analyzing screened data; see, [7,8] for the estimative classification analysis with screened data; and see, e.g., [1,2,9,10], for the regression analysis with screened response data.

The majority of existing methods rely on the fact that v and

v_{0}

are jointly multivariate normal, and the screened observation vector x is subject to a univariate screening scheme defined by an open interval

C_{q}

with

q = 1 .

In many practical situations, however, the screened data are generated from a non-normal joint distribution of v and

v_{0}

, having a multivariate screening scheme defined by a q-dimensional (

q > 1

) rectangle region

C_{q}

of

v_{0} .

In this case, a difficulty in applications with the screened data is that the empirical distribution of the screened data is skewed; its parametric model involves a complex density; and hence, standard methods of analysis cannot be used. See [4,6] for the conditional densities,

f (x | v_{0} \in C_{q})

, useful for fitting the rectangle screened data generated from a non-normal joint distribution of v and

v_{0} .

In this article, we develop yet another multivariate technique applicable for analyzing the rectangle screened data: we are interested in constructing a Bayesian predictive discrimination procedure for the data. More precisely, we consider a Bayesian multivariate technique for sorting, grouping and prediction of multivariate data generated from K rectangle screened populations. In the standard problem, a training sample

D = {(z_{i}, x_{i}), i = 1, \dots, n}

is available, where, for each

i = 1, \dots, n

,

x_{i}

is a

p \times 1

rectangle screened observation vector coming from one of K populations and taking values in

R^{p}

, and

z_{i}

is a categorical response variable representing the population membership, so that

z_{i} = k

implies that the predictor

x_{i}

belongs to the k-th rectangle screened population (denoted by

π_{k}

),

k = 1, \dots, K .

Using the training sample

D

, the goal of the predictive discriminant analysis is to predict population membership of a new screened observation x based on the posterior probability of x belonging to

π_{k} .

The posterior probability is given by:

\begin{matrix} p (z = k | D, x) \propto p (x | D, z = k) p (z = k | D), k = 1, \dots, K, \end{matrix}

(1)

where z is the the population membership of x,

p (z = k | D)

is the prior probability of

π_{k}

updated by the training sample

D

and:

\begin{matrix} p (x | D, z = k) = \int p (x | Θ_{k}) p (Θ_{k} | D, z = k) d Θ_{k}, \end{matrix}

(2)

p (x | Θ_{k}) = p (x | v_{0} \in C_{q}, z = k)

and

p (Θ_{k} | D, z = k)

, respectively, denote the predictive density, the probability density of x and the posterior density of parameters

Θ_{k}

associated with

π_{k} .

One of the first and most applied predictive approaches by [11] is the case of unscreened and normally-distributed populations

π_{k}

with unknown parameters

Θ_{k} = {μ_{k}, Σ_{k}}

, namely

π_{k} : N_{p} (μ_{k}, Σ_{k})

for

k = 1, \dots, K .

This is called a Bayesian predictive discriminant analysis with normal populations (

B P D A_{N}

) in which a multivariate Student t distribution is obtained for Equation (2).

A practical example where the predictive discriminant analysis with the rectangle screened populations (

π_{k}

’s) is applicable is in the discrimination between passed and failed pairs of applicants in a college admission process (the second screening process). Consider the case where college admission officers wish to set up an objective criterion (with a predictor vector x) for admitting students for matriculation; however, the admission officers must first ensure that a student with observation x has passed the first screening process. The first screening scheme may be defined by the q-dimensional region

C_{q}

of the random vector

v_{0}

(consisting of SAT scores, high-school GPA, and so on), so that only the students who satisfy

v_{0} \in C_{q}

can proceed to the admission process. In this case, we encounter a crucial problem for applying the normal classification by [11]; given the screening scheme

v_{0} \in C_{q}

, the assumption of the multivariate normal population distribution for

[x | z = k] \overset{d}{=} [x | π_{k}]

,

k = 1, 2, \dots, K

is not valid. The work in [7,12] found that the normal classification shows a lack of robustness to the departure from the normality of the population distribution, and hence, the performance of the normal classification can be very misleading, if used with the continuous, but non-normal or screened normal input vector x.

Thus, the predictive density in Equation (2) has two specific features to be considered for Bayesian predictive discrimination with the rectangle screened populations, one about the prior distribution of the parameters

Θ_{k}

and the other about the distributional assumption of the population model with density

p (x | Θ_{k}) .

For the unscreened populations case, there have been a variety of studies that are concerned with the two considerations. See, for example, [11,13,14] for the choice of the prior distributions of

Θ_{k}

, and see [15,16] for copious references to the literature on the predictive discriminant analysis with non-normal population models. Meanwhile, for deriving Equation (2) of the rectangle screened observation x, we need to develop a population model with density

p (x | Θ_{k})

that uses the screened sample information in order to maintain consistency with the underlying theory associated with the populations

π_{k}

generating the screened sample. Then, we propose a Bayesian hierarchical approach to flexibly incorporate the prior knowledge about

Θ_{k}

with the non-normal sample information, which is the main contribution of this paper to the literature on Bayesian predictive discriminant analysis.

The rest of this paper is organized as follows. Section 2 considers a class of screened scale mixture of normal (SSMN) population models, which well accounts for the screening scheme conducted through a q-dimensional rectangle region

C_{q}

of an external scale mixture of normal vector,

v_{0} .

Section 3 proposes a hierarchical screened scale mixture of normal (HSSMN) model to derive the predictive density Equation (2) and proposes an optimal rule for Bayesian predictive discriminant analysis (BPDA) with the SSMN populations (abbreviated as

B P D A_{S S M N}

). Approximation of the rule is studied in Section 4 by using an MCMC method applied to the HSSMN model. In Section 5, a simulation study is done to check the convergence of the MCMC method and the performance of the

B P D A_{S S M N}

by making a comparison between the

B P D A_{S S M N}

and the

B P D A_{N} .

Finally, concluding remarks are given in Section 6.

2. The SSMN Population Distributions

Assume that the joint distribution of respective

q \times 1

and

p \times 1

vector variables

v_{0}

and v, associated with

π_{k}

, is

F \in F

, where:

\begin{matrix} F = \{F : N_{s} (μ_{k}^{*}, κ (η) Σ_{k}^{*}), η \sim g (η) with κ (η) > 0, and η > 0\}, \end{matrix}

(3)

k = 1, \dots, K

,

s = (q + p)

, η is a mixing variable with the pdf

g (η)

,

κ (η)

is a suitably-chosen weight function and

μ_{k}^{*}

and

Σ_{k}^{*}

are partitioned corresponding to the orders of

v_{0}

and v:

\begin{matrix} v^{*} = (\begin{matrix} v_{0} \\ v \end{matrix}), μ_{k}^{*} = (\begin{matrix} μ_{0 k} \\ μ_{k} \end{matrix}), Σ_{k}^{*} = (\begin{matrix} Σ_{0 k} & Δ_{k}^{⊤} \\ Δ_{k} & Σ_{k} \end{matrix}) . \end{matrix}

(4)

Notice that

F

defined by Equation (3) denotes a class of scale mixture of multivariate normal (SMN) distributions (see, e.g., [17,18] for details), equivalently denoted as

S M N_{s} (μ_{k}^{*}, Σ_{k}^{*}, κ (η), G)

in the remainder of the paper, where

G = G (η)

denote the cdf of

η .

Given the joint distribution

[v^{*} | π_{k}] \sim S M N_{s} (μ_{k}^{*}, Σ_{k}^{*}, κ (η), G)

, the SSMN distribution is defined by the following screening scheme:

[x | π_{k}] \overset{d}{=} [v | v_{0} \in C_{q} (α, β), π_{k}] \sim S S M N_{p} (C_{q} (α, β); μ_{k}^{*}, Σ_{k}^{*}, κ (η), G),

(5)

where

C_{q} (α, β) = {v_{0} \in R^{q} | α \leq v_{0} \leq β}

is a q-dimensional rectangle screening region in the space of

v_{0} \in R^{q} .

Here,

α = {(α_{1}, \dots, α_{q})}^{T}

,

β = {(β_{1}, \dots, β_{q})}^{T}

, and

α_{j} < β_{j}

for

j = 1, \dots, q

. This region contains the cases of

C_{q} (α, \infty)

and

C_{q} (- \infty, β)

as special cases.

The pdf of x is given by:

f (x | μ_{k}^{*}, Σ_{k}^{*}, π_{k}) = \frac{\int_{0}^{\infty} h_{p} (x | μ_{k}^{*}, Σ_{k}^{*}, κ (η), G) d G (η)}{\int_{0}^{\infty} {\bar{Φ}}_{q} (C_{q} (α, β); μ_{0 k}, κ (η) Σ_{0 k}) d G (η)}, x \in R^{p},

(6)

where:

h_{p} (x | μ_{k}^{*}, Σ_{k}^{*}, κ (η)) = ϕ_{p} (x; μ_{k}, κ (η) Σ_{k}) {\bar{Φ}}_{q} (C_{q} (α, β); μ_{v_{0 k} | x}, κ (η) Σ_{v_{0 k} | x}),

μ_{v_{0 k} | x} = μ_{0 k} + Δ_{k}^{⊤} Σ_{k}^{- 1} (x - μ_{k})

and

Σ_{v_{0 k} | x} = Σ_{0 k} - Δ_{k}^{⊤} Σ_{k}^{- 1} Δ_{k} .

Here,

ϕ_{q} (\cdot μ, Σ)

and

{\bar{Φ}}_{q} (C_{q} (α, β); μ, Σ)

, respectively, denote the pdf and the probability of the rectangle region of a random vector

w \sim N_{q} (μ, Σ)

. The latter is equivalent to

P r (w \in C_{q} (α, β)) .

One particular member of the class of SSMN distributions is the rectangle-screened normal (RSN) distribution defined by Equation (5) and Equation (6), for which

G (η)

is degenerate with

κ (η) = 1

. The work in [4,8] studied properties of the distribution and denoted it as the

R S N_{p} (C_{q} (α, β); μ_{k}^{*}, Σ_{k}^{*})

distribution. Another member of the class is the rectangle-screened p-variate Student t distributions (

R S t_{p}

) considered by [8]. Its pdf is given by:

f (x | μ_{k}^{*}, Σ_{k}^{*}, π_{k}) = t_{p} (x | μ_{k}, Σ_{k}, ν) \frac{{\bar{T}}_{q} (C_{q} (α, β); μ_{v_{0 k} | x}, Γ_{v_{0 k} | x}, ν + p)}{{\bar{T}}_{q} (C_{q} (α, β); μ_{0 k}, Σ_{0 k}, ν)}, x \in R^{p},

(7)

where

t_{p} (\cdot | a, B, c)

and

{\bar{T}}_{p} (C_{p}; a, B, c)

are the respective pdf and probability of a rectangle region

C_{p}

of the p-variate Student t distribution with the location vector

a

, the scale matrix B, the degrees of freedom c and:

Γ_{v_{0 k} | x} = {(ν + p)}^{- 1} \{ν + {(x - μ_{k})}^{⊤} Σ_{k}^{- 1} (x - μ_{k})\} Σ_{v_{0 k} | x} .

Similar to the RSN distributions, the density Equation (7) of

[x | π_{k}] \sim R S t_{p} (C_{q} (α, β); μ_{k}^{*}, Σ_{k}^{*}, ν)

is obtained by taking

κ (η) = 1 / η

and

η \sim G a m m a (ν / 2, ν / 2)

, i.e.,

g (η) = \frac{{(ν / 2)}^{ν / 2}}{Γ (ν / 2)} η^{ν / 2 - 1} exp \{- \frac{ν}{2} η\}, η > 0 .

The stochastic representations of the RSN and

R S t_{p}

distributions are immediately obtained by applying the following lemma, for which detailed proof can be found in [4].

Lemma 1. Suppose

[x | π_{k}] \sim S S M N_{p} (C_{q} (α, β); μ_{k}^{*}, Σ_{k}^{*}, κ (η), G)

. Then, it has the following stochastic representation in a hierarchical fashion,

\begin{matrix} [x | η, π_{k}] & \overset{d}{=} & μ_{k} + Δ_{k} Σ_{0 k}^{- 1} Z_{C_{q} (a_{k}, b_{k})} + {(Σ_{k} - Δ_{k} Σ_{0 k}^{- 1} Δ_{k}^{⊤})}^{1 / 2} Z_{p}, \end{matrix}

(8)

\begin{matrix} η & \sim & G (η) w i t h κ (η) > 0, η > 0, \end{matrix}

(9)

where

Z_{p} \sim N_{p} (0, κ (η) I_{p})

and

Z_{C_{q} - μ_{0 k}} \overset{d}{=} [Z_{q} | Z_{q} \in C_{q} (a_{k}, b_{k})]

are conditionally independent and

Z_{q} \sim N_{q} (0, κ (η) Σ_{0 k})

. Here,

a_{k} = α - μ_{0 k}

and

b_{k} = β - μ_{0 k} .

Lemma 1 provides the following: (i) an intrinsic structure of the SSMN population distributions, which reveals a type of departure from the SMN law because the distribution of

[x | π_{k}]

reduces to the SMN distribution if

Δ_{k} = 0

(i.e.,

C o v (v_{0}, v | π_{k}) = 0

); (ii) the representation provides a convenient device for random number generation; (iii) it leads to a simple and direct construction of a HSSMN model for the BPDA with the SSMN populations, i.e.,

[x | π_{k}] \sim S S M N_{p} (C_{q} (α, β); μ_{k}^{*}, Σ_{k}^{*}, κ (η), G)

.

3. The HSSMN Model

3.1. The Hierarchical Model

For a Bayesian predictive discriminant analysis, suppose we have K rectangle screened populations

π_{k} (k = 1, \dots, K)

, each specified by the

S S M N_{p} (C_{q} (α, β); μ_{k}^{*}, Σ_{k}^{*}, κ (η), G)

distribution. Let

D_{k} = {x_{k 1}, \dots, x_{k n_{k}}}

be a training sample obtained from the rectangle screened population

π_{k}

with the

S S M N_{p} (C_{q} (α, β); μ_{k}^{*}, Σ_{k}^{*}, κ (η), G)

distribution, where the parameters (

μ_{k}^{*}, Σ_{k}^{*})

are unknown. The predictive discrimination analysis is to assess the relative predictive odds ratio or posterior probability that a screened multivariate observation x belongs to one of K populations,

π_{k} .

As noted by Equation (6), however, a complex likelihood function of

D_{k}

prevents us from choosing reasonable priors of the model parameters and obtaining the predictive density of x given by Equation (2). These problems are solved if we use the following hierarchical representation of the population models.

According to Lemma 1, we may rewrite the SSMN model for Equations (8) and (9) by a three-level hierarchy given by:

\begin{matrix} [x_{k i} | η_{k i}, f_{k i}, π_{k}] & \overset{d}{=} & μ_{k} + Λ_{k} f_{k i} + ε_{k i}, ε_{k i} \overset{i n d}{\sim} N_{p} (0, κ (η_{k i}) Ψ_{k}), i = 1, \dots, n_{k}, \end{matrix}

(10)

\begin{matrix} f_{k i} & \overset{i n d}{\sim} & N_{q} (0, κ (η_{k i}) Σ_{0 k}) I (f_{k i} \in C_{q} (a_{k}, b_{k})), κ (η_{k i}) > 0, \end{matrix}

\begin{matrix} η_{k i} & \overset{i . i . d .}{\sim} & G (η) w i t h η_{k i} > 0, \end{matrix}

where

Λ_{k} = Δ_{k} Σ_{0 k}^{- 1}

,

Ψ_{k} = Σ_{k} - Δ_{k} Σ_{0 k}^{- 1} Δ_{k}^{⊤}

, G is the scale mixing distribution of the independent

η_{k i}

’s,

f_{k i}

and

ε_{k i}

are independent conditional on

η_{k i}

and

N_{q} (0, κ (η_{k i}) Σ_{0 k}) I (f_{k i} \in C_{q} (a_{k}, b_{k}))

denotes a truncated

N_{q} (0, κ (η_{k i}) Σ_{0 k})

distribution having the truncated space

f_{k i} \in C_{q} (a_{k}, b_{k}) .

The first stage model in Equation (10) may be written in a compact form by defining the following vector and matrix notations,

\begin{matrix} X_{k} = (x_{k 1} - μ_{k}, \dots, x_{k n_{k}} - μ_{k}), F_{k} = (f_{k 1}, \dots, f_{k n_{k}}), \\ E_{k} = (ε_{k 1}, \dots, ε_{k n_{k}}), η_{k} = {(η_{k 1}, \dots, η_{k n_{k}})}^{⊤} . \end{matrix}

Then, the three-level hierarchy of the model Equation (10) can be expressed as:

\begin{matrix} X_{k} & = & Λ_{k} F_{k} + E_{k}, v e c (E_{k}) \sim N_{p n_{k}} (0, D (κ (η_{k})) \otimes Ψ), \end{matrix}

(11)

\begin{matrix} v e c (F_{k}) & \sim & N_{q n_{k}} (0, D (κ (η_{k})) \otimes Σ_{0 k}) I (f_{k i} \in C_{q} (a_{k}, b_{k})), Cov (v e c (F_{k}), v e c {(E_{k})}^{⊤} | η_{k}) = O, \end{matrix}

\begin{matrix} η_{k i} & \overset{i . i . d .}{\sim} & g (η), i = 1, \dots, n_{k}, \end{matrix}

where

A \otimes B

denotes the Kronecker product of two matrices

A

and

B

,

v e c (F_{k}) = {(f_{k 1}^{⊤}, \dots, f_{k n_{k}}^{⊤})}^{⊤}

,

v e c (E_{k}) = {(ε_{k 1}^{⊤}, \dots, ε_{k n_{k}}^{⊤})}^{⊤}

, and

D (κ (η_{k})) = d i a g {κ (η_{k 1}), \dots, κ (η_{k n_{k}})}

is an

n_{k} \times n_{k}

diagonal matrix of the scale mixing functions. Note that the hierarchical population model Equation (11) adopts a robust discriminant modeling by the use of the scale mixture of normal, such as the SMN and the truncated SMN, and thus, it enables us to avoid the anomaly generated from the non-normal sample information.

The Bayesian analysis of the model in Equation (11) begins with the specification of the prior distributions of the unknown parameters. When the prior information is not available, a convenient strategy of avoiding improper posterior distribution is to use proper priors with their hyperparameters being fixed as appropriate quantities to reflect the flatness (or diffuseness) of priors (i.e., limiting non-informative priors). For convenience, but not always optimal, we suppose that

μ_{k}

,

μ_{0 k}

,

(Λ_{k}, Ψ_{k})

and

Σ_{0 k}

of the model in Equation (11) are independent a priori; prior distributions for

μ_{k}

and

μ_{0 k}

are normal; an inverse Wishart prior distribution for

Σ_{0 k}

; and a generalized natural conjugate family (see [19]) of prior distributions for

(Λ_{k}, Ψ_{k})

, so that we adopt the normal prior density for the

Λ_{k}

conditional on the matrix

Ψ_{k}

,

\begin{matrix} P (Λ_{k} | Ψ_{k}) & \sim & | Ψ_{k} |^{- q / 2} exp \{- \frac{1}{2} t r [Ψ_{k}^{- 1} (Λ_{k} - Λ_{0 k}) H_{k} {(Λ_{k} - Λ_{0 k})}^{⊤}]\}, \\ Ψ_{k} & \sim & I W_{p} (R_{k}, τ_{k}), \end{matrix}

where

W \sim I W_{m} (V, ν)

denotes the inverse Wishart distribution whose pdf

I W_{m} (W; V, ν)

is:

I W_{m} (W; V, ν) \propto {| W |}^{- ν / 2} exp \{- \frac{1}{2} t r (W^{- 1} V)\}, V > 0 .

Note that if

Λ_{k} = (λ_{k 1}, \dots, λ_{k q})

,

λ_{k} \equiv v e c (Λ_{k}) = {(λ_{k 1}^{⊤}, \dots, λ_{k q}^{⊤})}^{⊤}

and

λ_{0 k} \equiv v e c (Λ_{0 k})

, then:

t r [Ψ_{k}^{- 1} (Λ_{k} - Λ_{0 k}) H_{k} {(Λ_{k} - Λ_{0 k})}^{⊤}] = {(λ_{k} - λ_{0 k})}^{⊤} {(H_{k}^{- 1} \otimes Ψ_{k})}^{- 1} (λ_{k} - λ_{0 k}) .

This prior elicitation of the parameters, along with the three-level hierarchical model Equation (11), produces a hierarchical screened scale mixture of normal population model, which is referred to as HSSMN(

Θ (k)

) in the rest of this paper, where

Θ (k) = {μ_{k}, μ_{0 k}, Λ_{k}, Ψ_{k}, F_{k}, Σ_{0 k}, η_{k}} .

The HSSMN(

Θ (k)

) model is defined as follows.

\begin{matrix} [x_{k i} | F_{k}, μ_{k}, Ψ_{k}, η_{k}] & \overset{i n d}{\sim} & N_{p} (μ_{k} + Λ_{k} f_{k i}, κ (η_{k i}) Ψ_{k}), i = 1, \dots, n_{k}, \end{matrix}

(12)

\begin{matrix} [f_{k i} | Ψ_{k}, Σ_{0 k}, μ_{0 k}, η_{k}] & \overset{i n d}{\sim} & N_{q} (0, κ (η_{k i}) Σ_{0 k}) I (f_{k i} \in C_{q} (a_{k}, b_{k})), i = 1, \dots, n_{k}, \end{matrix}

\begin{matrix} μ_{k} & \sim & N_{p} (θ_{k}, Ω_{k}), \end{matrix}

\begin{matrix} μ_{0 k} & \sim & N_{q} (θ_{0 k}, Ω_{0 k}), \end{matrix}

\begin{matrix} [λ_{k} | Ψ_{k}] & \sim & N_{p q} (λ_{0 k}, H_{k}^{- 1} \otimes Ψ_{k}), \end{matrix}

\begin{matrix} Ψ_{k} & \sim & I W_{p} (R_{k}, τ_{k}), τ_{k} > 2 p, \end{matrix}

\begin{matrix} Σ_{0 k} & \sim & I W_{q} (Q_{k}, γ_{k}), γ_{k} > 2 q, \end{matrix}

\begin{matrix} η_{k i} & \overset{i n d}{\sim} & g (η), i = 1, \dots, n_{k}, \end{matrix}

where

λ_{k} \equiv v e c (Λ_{k})

,

λ_{0 k} \equiv v e c (Λ_{0 k})

and hyperparameters

(θ_{k}, θ_{0 k}, Ω_{k}, Ω_{0 k}, Λ_{0 k}, H_{k}, R_{k}, τ_{k}, Q_{k}, γ_{k})

are fixed as appropriate quantities to reflect the flatness of priors.

The last distributional specification is omitted in the RSN distribution case. For the HSSMN(

Θ (k)

) model for the

R S t_{ν}

distribution, we may set

η_{k i} \overset{i n d}{\sim} G a m m a (ν / 2, ν / 2)

,

ν \sim G a m m a (1, 0.1) I (ν > 2)

, a truncated Gamma distribution (see, e.g., [20]). See, for example, [21,22] and the references therein for other choices of the prior distribution of

ν .

3.2. Posterior Distributions

Based on the HSSMN(

Θ (k)

) model structure with the likelihood and the prior distributions in Equation (12), the joint posterior distribution of

Θ (k)

is given by:

\begin{matrix} p (Θ (k) | D_{k}) & \propto & (\prod_{i = 1}^{n_{k}} {| κ (η_{k i}) Ψ_{k} |}^{- 1 / 2}) exp \{- \frac{1}{2} t r [Ψ_{k}^{- 1} (X_{k} - Λ_{k} F_{k}) D {(κ (η_{k}))}^{- 1} {(X_{k} - Λ_{k} F_{k})}^{⊤}]\} \end{matrix}

\begin{matrix} \times & | Ψ_{k} |^{- (q + τ_{k}) / 2} exp \{- \frac{1}{2} t r [Ψ_{k}^{- 1} G_{k}]\} \prod_{i = 1}^{n_{k}} \frac{ϕ_{q} (f_{k i}; 0, κ (η_{k i}) Σ_{0 k})}{{\bar{Φ}}_{q} (C_{q} (a_{k}, b_{k}); 0, κ (η_{k i}) Σ_{0 k})} (\prod_{i = 1}^{n_{k}} g (η_{k i})) \end{matrix}

\begin{matrix} \times & I W_{q} (Σ_{0 k}; Q_{k}, γ_{k}) ϕ_{p} (μ_{k}; θ_{k}, Ω_{k}) ϕ_{q} (μ_{0 k}; θ_{0 k}, Ω_{0 k}), \end{matrix}

(13)

where:

\begin{matrix} (\prod_{i = 1}^{n_{k}} {| κ (η_{k i}) Ψ_{k} |}^{- 1 / 2}) exp \{- \frac{1}{2} t r [Ψ_{k}^{- 1} (X_{k} - Λ_{k} F_{k}) D {(κ (η_{k}))}^{- 1} {(X_{k} - Λ_{k} F_{k})}^{⊤}]\} \\ \propto & \prod_{i = 1}^{n_{k}} ϕ_{p} (x_{k i}; μ_{k} + Λ_{k} f_{k i}, κ (η_{k i}) Ψ_{k}), \end{matrix}

G_{k} = (Λ_{k} - Λ_{0 k}) H_{k} {(Λ_{k} - Λ_{0 k})}^{⊤} + R_{k}

and

g (η_{k i})

’s denote the densities of the mixing variables

η_{k i}

’s. Note that the joint posterior of Equation (13) is not simplified in an analytic form of the known density and, thus, intractable for the posterior inference. Instead, we derived each of conditional posterior distribution of

μ_{k}

,

μ_{0 k}

,

λ_{k} \equiv v e c (Λ_{k})

,

Σ_{0 k}

,

F_{k}

,

Ψ_{k}

and

η_{k i}

’s, which will be useful for posterior inference based on Markov chain Monte Carlo methods (MCMC). All of the full conditional posterior distributions are as follows (see the Appendix for their derivations):

(1) The full conditional distribution of

μ_{k}

is a p-variate normal given by:

[μ_{k} {| Θ (k)}_{\ μ_{k}}, D_{k}] \sim N_{p} (μ_{μ_{k}}, Σ_{μ_{k}}),

(14)

where

μ_{μ_{k}} = Σ_{μ_{k}} (Ω_{k}^{- 1} θ_{k} + \sum_{i = 1}^{n_{k}} Ψ_{k}^{- 1} (x_{k i} - Λ_{k} f_{k i}) / κ (η_{k i}))

and

Σ_{μ_{k}} = {(\sum_{i = 1}^{n_{k}} \frac{1}{κ (η_{k i})} Ψ_{k}^{- 1} + Ω_{k}^{- 1})}^{- 1} .

(2) The full conditional density of

μ_{0 k}

is given by:

\begin{matrix} p (μ_{0 k} | Θ {(k)}_{\ μ_{0 k}}, D_{k}) \propto \frac{ϕ_{q} (μ_{0 k}; θ_{0 k}, Ω_{0 k})}{\prod_{i = 1}^{n_{k}} {\bar{Φ}}_{q} (C_{q} (α, β); μ_{0 k}, κ (η_{k i}) Σ_{0 k})} . \end{matrix}

(15)

(3) The full conditional posterior distribution of

λ_{k}

is given by:

[λ_{k} {| Θ (k)}_{\ λ_{k}}, D_{k}] \sim N_{p q} (μ_{λ_{k}}, Σ_{λ_{k}}),

(16)

where:

\begin{matrix} μ_{λ_{k}} & = & v e c (Λ_{k}^{*}), Λ_{k}^{*} = (X_{k} D {(κ (η_{k}))}^{- 1} F_{k}^{⊤} + Λ_{0 k} H_{k}) Q_{k}^{- 1} \\ Σ_{λ_{k}} & = & Q_{k}^{- 1} \otimes Ψ_{k}, a n d Q_{k} = F_{k} D {(κ (η_{k}))}^{- 1} F_{k}^{⊤} + H_{k} . \end{matrix}

(4) The full conditional posterior distribution of

Ψ_{k}

is an inverse-Wishart distribution:

[Ψ_{k} {| Θ (k)}_{\ Ψ_{k}}, D_{k}] \sim I W_{p} (V_{k}, ν_{k}) ν_{k} > 2 p,

(17)

where

V_{k} = (X_{k} - Λ_{k} F_{k}) D {(κ (η_{k}))}^{- 1} {(X_{k} - Λ_{k} F_{k})}^{⊤} + (Λ_{k} - Λ_{0 k}) H_{k} {(Λ_{k} - Λ_{0 k})}^{⊤} + R_{k}

and

ν_{k} = n_{k} + q + τ_{k} .

(5) The full conditional posterior distribution of

f_{k i}

is the q-variate truncated normal given by:

[f_{k i} {| Θ (k)}_{\ f_{k i}}, D_{k}] \overset{i n d}{\sim} N_{q} (μ_{f_{k i}}, κ (η_{k i}) Σ_{f_{k i}}) I (f_{k i} \in C_{q} (a_{k}, b_{k})), i = 1, \dots, n_{k},

(18)

where

μ_{f_{k i}} = Σ_{f_{k i}} Λ_{k}^{⊤} Ψ_{k}^{- 1} (x_{k i} - μ_{k})

and

Σ_{f_{k i}} = {(Σ_{0 k}^{- 1} + Λ_{k}^{⊤} Ψ_{k}^{- 1} Λ_{k})}^{- 1} .

(6) The full conditional posterior density of

Σ_{0 k}

is given by:

p (Σ_{0 k} | Θ {(k)}_{\ Σ_{0 k}}, y_{k}) \propto I W_{q} (Σ_{0 k}; Q_{k}, γ_{k}) \prod_{i = 1}^{n_{k}} \frac{ϕ_{q} (f_{k i}; 0, κ (η_{k i}) Σ_{0 k})}{{\bar{Φ}}_{q} (C_{q} (a_{k}, b_{k}); 0, κ (η_{k i}) Σ_{0 k})} .

(19)

(7) The full conditional posterior densities of

η_{k i}

’s are given by:

\begin{matrix} p (η_{k i} | Θ {(k)}_{\ η_{k i}}, y_{k}) & \propto & κ {(η_{k i})}^{- \frac{p}{2}} exp \{- \frac{z_{k i}^{⊤} Ψ_{k}^{- 1} z_{k i}}{2 κ (η_{k i})}\} \end{matrix}

(20)

\begin{matrix} \times & \frac{ϕ_{q} (f_{k i}; 0, κ (η_{k i}) Σ_{0 k})}{{\bar{Φ}}_{q} (C_{q} (a_{k}, b_{k}); 0, κ (η_{k i}) Σ_{0 k})} g (η_{k i}), i = 1, \dots, n_{k}, \end{matrix}

where

z_{k i} = x_{k i} - μ_{k} - Λ_{k} f_{k i}

and

η_{k i}

’s are independent.

Based on the above full conditional posterior distributions and the stochastic representations of the SSMN in Lemma 1, one can easily obtain Bayes estimates of the k-th SSMN population mean

μ_{π_{k}} = E [x | π_{k}]

and covariance matrix

Σ_{π_{k}} = C o v (x | π_{k})

,

k = 1, \dots, K .

Specifically, the mean and covariance matrix of an observation x belonging to

π_{k} : S S M N_{p} (C_{q} (α, β); μ_{k}^{*}, Σ_{k}^{*}, κ (η), G)

, which are used for calculating their Bayes estimates via Rao–Blackwellization, are given by:

\begin{matrix} μ_{π_{k}} & = & μ_{k} + Ω_{21 k} Ω_{22 k}^{- 1} ξ_{k} \end{matrix}

(21)

\begin{matrix} Σ_{π_{k}} & = & Ω_{22 k} - Ω_{21 k} (Ω_{11 k}^{- 1} - Ω_{11 k}^{- 1} T_{k} Ω_{11 k}^{- 1}) Ω_{21 k}^{⊤}, \end{matrix}

where

Ω_{21 k} = κ (η) Δ_{k}

,

Ω_{11 k} = κ (η) Σ_{0 k}

,

Ω_{22 k} = κ (η) Σ_{k}

,

ξ_{k} = \int_{C_{q} (a_{k}, b_{k})} \frac{z}{ζ_{k} {(2 π)}^{q / 2} {| Ω_{11 k} |}^{1 / 2}} exp {- \frac{z^{⊤} Ω_{11 k}^{- 1} z}{2}} d z,

ζ_{k} = {\bar{Φ}}_{q} (C_{q} (α, β); μ_{0 k}, Ω_{11 k})

,

T_{k} = P_{k} - ξ_{k} ξ_{k}^{⊤}

and:

P_{k} = \int_{C_{q} (a_{k}, b_{k})} \frac{z z^{⊤}}{ζ_{k} {(2 π)}^{q / 2} {| Ω_{11 k} |}^{1 / 2}} exp {- \frac{z^{⊤} Ω_{11 k}^{- 1} z}{2}} d z .

We see that these moments of Equation (21) agree with the formula for the mean and covariance matrix of the untruncated marginal distribution of a general multivariate truncated distribution given by [23]. Readers are referred to [24] with the R package tmvtnorm and [25] with the R package mvtnorm for implementing calculations of

ξ_{k}

and

P_{k}

involved in the first and second moments.

When the sampling information, i.e., the observed training samples, is augmented by the proper information of prior knowledge, the anomalies of the maximum likelihood estimate of the SSMN model, investigated by [16], would disappear in the HSSMN

(Θ (k))

model. Furthermore, note that the conditional distribution of

λ_{k}

in Equation (16) is a

p q

-dimensional one; and hence, its Gibbs sampling needs to be performed by using the inverse of the matrix of order

p q

, which may cause computational costs in implementing the MCMC method. For large q, a more computationally-convenient Gibbs sampler can be considered based on the full conditional posterior distributions of

λ_{k j}

,

j = 1, \dots, p

, than the Gibbs sampler with

λ_{k}

in Equation (16), where

λ_{k} \equiv V e c (Λ_{k})

and

Λ_{k} \equiv (λ_{k 1}, \dots, λ_{k q}) .

For this purpose, we defined the following notations: for

j = 1, \dots, p

,

{\tilde{λ}}_{k} (j) = (E_{j} \otimes I_{p}) λ_{k}, {\tilde{θ}}_{k} (j) = (E_{j} \otimes I_{p}) μ_{λ_{k}},

{\tilde{Ω}}_{k} (j) = (E_{j} \otimes I_{p}) Σ_{λ_{k}} {(E_{j} \otimes I_{p})}^{⊤}, E_{i} = {(e_{j}, e_{1}, \dots, e_{j - 1}, e_{j + 1}, \dots, e_{q})}^{⊤},

where

e_{j}

denotes the j-th column of

I_{q}

, namely an elementary vector with unity for its j-th element and zeros elsewhere. Furthermore, we consider the following partitions:

{\tilde{λ}}_{k} (j) = (\begin{matrix} λ_{k j} \\ λ_{k j}^{*} \end{matrix}), {\tilde{θ}}_{k} (j) = (\begin{matrix} {\tilde{θ}}_{k} (1 j) \\ {\tilde{θ}}_{k} (2 j) \end{matrix}), a n d {\tilde{Ω}}_{k} (j) = (\begin{matrix} {\tilde{Ω}}_{k 11} (j) & {\tilde{Ω}}_{k 12} (j) \\ {\tilde{Ω}}_{k 21} (j) & {\tilde{Ω}}_{k 22} (j) \end{matrix}),

where the orders of

λ_{k j}^{*}

,

{\tilde{θ}}_{k} (1 j)

,

{\tilde{Ω}}_{k 11} (j)

and

{\tilde{Ω}}_{k 21} (j)

are

(p - 1) q \times 1

,

q \times 1

,

q \times q

and

(p - 1) q \times q

, respectively. Under these partitions, the conditional property of a multivariate normal distribution leads to the full conditional posterior distributions of

λ_{k j}

given by:

\begin{matrix} [λ_{k j} {| Θ (k)}_{\ λ_{k j}}, y_{k}] \sim N_{q} (μ_{λ_{k j}}, Σ_{λ_{k j}}), \end{matrix}

(22)

for

j = 1, \dots, q

, where:

μ_{λ_{k j}} = {\tilde{θ}}_{k} (1 j) + {\tilde{Ω}}_{k 12} {\tilde{Ω}}_{k 22}^{- 1} (λ_{k j}^{*} - {\tilde{θ}}_{k} (2 j)) a n d Σ_{λ_{k j}} = {\tilde{Ω}}_{k 11} (j) - {\tilde{Ω}}_{k 12} (j) {\tilde{Ω}}_{k 22} {(j)}^{- 1} {\tilde{Ω}}_{k 21} (j) .

When p is large, we may partition

λ_{k j}

into two vectors with smaller dimensions, say

λ_{k} = {(λ_{k j} {(1)}^{⊤}, λ_{k j} (2),^{⊤})}^{⊤}

, then use their full conditional normal distributions for the Gibbs sampler.

Now, the posterior sampling can be implemented by using all of the conditional posterior Equations (14)–(20). The Gibbs sampler and Metropolis–Hastings algorithm within the Gibbs sampler may be used to obtain posterior samples of all of the unknown parameters

Θ (k)

. Note that in the case where the

p q

-dimensional matrix is too large to manipulate for computation, the Gibbs sampler can be modified by replacing the full conditional posterior Equation (16) with Equation (22). That is, as indicated by Equation (22), the modified Gibbs sampler based on Equation (22) would be more convenient for numerical computation than the first one using Equation (16). The detailed Markov chain Monte Carlo algorithm with Gibbs sampling is discussed in the next subsection.

3.3. Markov Chain Monte Carlo Sampling Scheme

It is not complicated to construct an MCMC sampling scheme working with

Θ (k) = {μ_{k}, μ_{0 k}, Λ_{k}, Ψ_{k}, F_{k}, Σ_{0 k}, η_{k}}

, since a routine Gibbs sampler would work to generate posterior samples of

(μ_{k}, Λ_{k}, Ψ_{k}, F_{k})

based on each of their full conditional posterior distributions obtained in Section 3.2. In the posterior sampling of

μ_{0 k}

,

Σ_{0 k}

and

η_{k}

, Metropolis–Hastings within the Gibbs algorithm would be used, since their conditional posterior densities do not have explicit forms of known distributions as in Equation (15), Equation (19) and Equation (20).

Here, for simplicity, we considered the MCMC algorithm based on the HSSMN

(Θ (k))

model with a known screening scheme, in which

μ_{0 k}

and

Σ_{0 k}

are assumed to be known. The extension to the general HSSMN

(Θ (k))

model with unknown

μ_{0 k}

and

Σ_{0 k}

can be made without difficulty.

The MCMC algorithm starts with some initial values

μ_{k}^{[0]}

,

λ_{k}^{[0]}

,

Ψ_{k}^{[0]}

,

F_{k}^{[0]}

and

η_{k}^{[0]} .

The detailed posterior sampling steps are as follows:

Step 1: generate $μ_{k}$ by using the full conditional posterior distribution in Equation (14).
Step 2: generate $λ_{k}$ by using the full conditional posterior distribution in Equation (16).
Step 3: generate inverse-Wishart random matrix $ψ_{k}$ by using the full conditional posterior distribution in Equation (17).
Step 4: generate independent q-variate truncated normal random variables $f_{k i}$ by using the full conditional posterior distribution in Equation (18).
Step 5: given the current values ${μ_{k}, Λ_{k}, Ψ_{k}, F_{k}}$ , we independently generate a candidate $η_{k i}$ from a proposal density $q (η_{k i}^{*} | η_{k i}) = g (η_{k i}^{*})$ , as suggested by [26], which is used for a Metropolis–Hastings algorithm. Then, accept the candidate value with the acceptance rate:

$α (η_{k i}, η_{k i}^{*}) = min \{\frac{p (Θ (k) | η_{k i}^{*})}{p (Θ (k) | η_{k i}))}, 1\}$

$i = 1, \dots, n_{k} .$ Because the target density is proportional to $p (Θ (k) | η_{k i}) g (η_{k i})$ and $p (Θ (k) | η_{k i})$ is uniformly bounded for $η_{i k} > 0$ where:

$\begin{matrix} p (Θ (k) | η_{k i}) & = & ϕ_{p} (x_{k i}; μ_{k} + Λ_{k} f_{k i}, κ (η_{k i}) Ψ_{k}) \frac{ϕ_{q} (f_{k i}; 0, κ (η_{k i}) Σ_{0 k})}{{\bar{Φ}}_{q} (C_{q} (a_{k}, b_{k}); 0, κ (η_{k i}) Σ_{0 k})} \end{matrix}$

and $g (\cdot)$ is the density of mixing variable $η_{i k}$ . Note that $η_{k} = {(η_{k 1}, \dots, η_{k n_{k}})}^{⊤} .$

When one conducts a posterior inference of the HSSMN (

Θ (k)

) model using the samples obtained from the MCMC sampling algorithm, the following points should be noted.

(i): See, e.g., [18], for the sampling method for $η_{k i}$ from various mixing distributions $g (η_{k i})$ of the SMN distributions, such as the multivariate t, multivariate $l o g i t$ , multivariate $s t a b l e$ and multivariate $e x p o n e n t i a l$ $p o w e r$ models.
(ii): Suppose the HSSMN( $Θ (k)$ ) model involves unknown $μ_{0 k} .$ Then, as indicated by the full conditional posterior of $μ_{0 k}$ in Equation (15), the complexity of the conditional distribution prevents us from using straightforward Gibbs sampling. Instead, we may use a simple random walk Metropolis algorithm that uses a normal proposal density $q (μ_{0 k}^{*} | μ_{0 k}) = q (| μ_{0 k}^{*} - μ_{0 k} |)$ to sample from the conditional distribution of $μ_{0 k}^{*}$ ; that is, given the current point is $μ_{0 k}$ , the candidate point is $μ_{0 k}^{*} \sim N_{q} (μ_{0 k}, D)$ , where a diagonal matrix D should be turned, so that the acceptance rate of the candidate point is around 0.25 (see, e.g., [26]).
(iii): When the HSSMN( $Θ (k)$ ) model involves unknown $Σ_{0 k}$ : The MCMC sampling algorithm, using the full conditional posterior Equation (19) is not straightforward, because the conditional posterior density is unknown and complex. Instead, we may apply a Metropolized hit-and-run algorithm, described by [27], to sample from the conditional posterior of $Σ_{0 k} .$
(iv): One can easily calculate the posterior estimate of $Θ_{k} = (μ_{k}^{*}, Σ_{k}^{*})$ by using that of $Θ (k)$ , because the re-parameterizing relations are $Ψ = Σ_{k} - Δ_{k} Σ_{0 k}^{- 1} Δ_{k}^{⊤}$ and $Λ_{k} = Δ_{k} Σ_{0 k}^{- 1} .$

4. The Predictive Classification Rule

Suppose we have K populations

π_{k}

,

k = 1, \dots, K

, each specified by the HSSMN

(Θ (k))

model. For each of the populations, we have the screened training sample

D_{k}

comprised of a set of independent observations

{x_{k i}, i = 1, \dots, n_{k}}

whose population level is

z_{k i} = k .

Let x be assigned to one of the K populations, with prior probability

p_{k}

of belonging to

π_{k}

,

\sum_{k = 1}^{K} p_{k} = 1 .

Then, the predictive density of x given

D

under the HSSMN

(Θ (k))

model with the space

Θ (k) \in Θ (k)

is:

\begin{matrix} p (x | D, z = k) = \int_{Θ (k)}^{} p (x | Θ (k)) p (Θ (k) | D) d Θ (k), k = 1, \dots, K, \end{matrix}

(23)

and the posterior probability that x belongs to

π_{k}

, i.e.,

p (z = k | D, x) = p (x \in π_{k} | D, x)

, is:

\begin{matrix} p (x \in π_{k} | D, x) = \frac{p (x | D, z = k) p (z = k | D)}{\sum_{j = 1}^{K} p (x | D, z = j) p (z = j | D)}, k = 1, \dots, K, \end{matrix}

(24)

where

D = ⋃_{k = 1}^{K} D_{k}

,

p (x | Θ (k))

is equal to Equation (6) and

p (Θ (k) | D)

is the joint posterior density given in Equation (13). We see from Equation (24) that the total posterior probability of misclassifying x from

π_{i}

to

π_{j}

,

i \neq j

is defined by:

\begin{matrix} T P M (j) = \sum_{i \neq j; i = 1}^{K} \frac{p (x | D, z = i) p (z = i | D)}{\sum_{ℓ = 1}^{K} p (x | D, z = ℓ) p (z = ℓ | D)} . \end{matrix}

(25)

We minimize the misclassification error at this point if we choose j, so as to minimize Equation (25); that is, we select k that gives the maximum posterior probability

p (x \in π_{k} | D, x)

(see, e.g., Theorem 6.7.1 of [28] (p. 234). Thus, an optimal Bayesian predictive discrimination rule that minimizes the classification error is to classify x into

π_{k}

, if

x \in R_{k}

, where the optimal classification region is given by:

\begin{matrix} R_{k} : p (x | D, z = k) p (z = k | D) > p (x | D, z = j) p (z = j | D), for all j \neq k; k = 1, \dots, K, \end{matrix}

(26)

p (z = k | D)

is the posterior probability of population

π_{k}

given the dataset

D .

If we assume the values of

p_{k}

’s are a priori known, then

p (z = k | D) = p_{k} .

Since we are unable to obtain an analytic solution of Equation (26), a numerical approach is required. Thus, we used the MCMC method of the previous section to draw samples from the posterior density of the parameters,

p (Θ (k) | D)

, to approximate the predictive density, Equation (23), by:

\begin{matrix} p (x | D, z = k) \approx \frac{1}{N_{k} - M} \sum_{t = M + 1}^{N_{k}} p (x | Θ_{k}^{t}), k = 1, \dots, K, \end{matrix}

(27)

where

Θ_{k}^{t}

’s are posterior samples generated from the MCMC process under the HSSMN

(Θ (k))

model and M and

N_{k}

are the burn-in period and run length, respectively.

If we assume Dirichlet priors for

p_{k}

, that is:

[p_{1}, \dots, p_{K - 1}] \sim D i r i c h l e t (d_{1}, \dots, d_{K - 1}; d_{K})

(see, e.g., [19] (p. 143) for the distributional properties), then:

\begin{matrix} p (z = k | D) = E [p_{k} | D] = \frac{d_{k} + n_{k}}{\sum_{j}^{K} d_{j} + n_{j}}, k = 1, \dots, K - 1 \end{matrix}

(28)

and

p (z = K | D) = 1 - \sum_{j = k}^{K - 1} p (z = k | D) .

Thus, the posterior probabilities in Equation (24) and the minimum error classification region

R_{k}

in Equation (26) can be generated within the MCMC scheme, which uses Equation (27) to approximate the predictive densities involved in Equation (24) and Equation (26).

5. Simulation Study

This section presents results of a simulation study to show the convergence of the MCMC algorithm and the performance of the

B P D A_{S S M N}

. Simulation of the training sample observations, model estimation by the MCMC algorithm and a comparison of classification results among three BPDA methods were implemented by coding the R package program. The three methods consist of two proposed

B P D A_{S S M N}

methods (i.e.,

B P D A_{R S N}

and

B P D A_{R S t}

for classifying RSN and RSt populations) and

B P D A_{N}

by [11] (for classifying unscreened normal populations).

5.1. A Simulation Study: Convergence of the MCMC Algorithm

This simulation study considers inference of the HSSMN

(Θ (k))

model with a two-dimensional case by generating a training sample of one thousand observations,

n_{k} = 1000

, from each population

π_{k}

,

k = 1, 2, 3 .

We considered the following specific choice of parameters, i.e.,

μ_{k} = {(μ_{k 1}, μ_{k 2})}^{⊤}

,

Σ_{0 k}

,

μ_{0 k}

,

C_{q} (α, β)

,

λ_{k} = V e c (Λ_{k}^{⊤}) = {(λ_{k 1}, \dots, λ_{k 4})}^{⊤}

and

Ψ_{k} = Σ_{k} - Δ_{k} Σ_{0 k}^{- 1} Δ_{k} = {ψ_{k i j}}

matrices, for generating a synthetic data from

π_{k}

,

\begin{matrix} μ_{k} & = & (\begin{matrix} 1 + k \\ - 2 + k \end{matrix}), Σ_{0 k} = (\begin{matrix} 7 + ε_{k} & - 2 \\ - 2 & 4 + ε_{k} \end{matrix}), α = (\begin{matrix} - 0.5 \\ - 0.5 \end{matrix}), β = (\begin{matrix} 4 \\ 4 \end{matrix}), \\ μ_{0 k} & = & {(0, 0)}^{⊤}, Δ_{k} = (\begin{matrix} 2 & 1 \\ 2 & 0 \end{matrix}), Σ_{k} = (\begin{matrix} 3 + ε_{k} & 0 \\ 0 & 1 + ε_{k} \end{matrix}), a n d ε_{k} = 0.1 \times k . \end{matrix}

Based on the above parameter values with

p = q = 2

, we simulated 200 sets of three training samples of each size

n_{k} = 1000

from three populations

π_{k}

,

k = 1, 2, 3 .

Two cases of screened populations were assumed, that is

π_{k} : R S N_{p} (C_{q} (α, β); μ_{k}^{*}, Σ_{k}^{*})

and

π_{k} : R S t_{p} (C_{q} (α, β); μ_{k}^{*}, Σ_{k}^{*}, ν = 5) .

The respective datasets were generated by using the stochastic representation of each population (see Lemma 1 for the representation). Given a generated training sample, corresponding population parameters were estimated by using the MCMC algorithm based on the HSSMN

(Θ (k))

model, Equation (12), for each screened population,

π_{k}

, distribution. We used

μ_{k} = 0

,

Δ_{k} = I_{2}

and

Ψ_{k} = I_{2}

as the initial values of the MCMC algorithm. To satisfy an objective Bayesian perspective considered by [29], we need to specify the hyper-parameters

(θ_{k}, δ_{k}, Ω_{k}, Ω_{0 k}, H_{k}, R_{k}, Q_{k}, τ_{k}, γ_{k})

of the HSSMN

(Θ (k))

model, so as to be insensitive to changes of the priors. Thus, we assumed that we have no information about the parameters. To specify this, we adopted

θ_{k} = 0

,

δ_{k} = 0

,

Ω_{k} = 10^{3} I_{p}

,

Ω_{0 k} = 10^{3} I_{q}

,

H_{k} = 10^{- 3} I_{q}

,

R_{k} = 10^{- 3} I_{p}

,

Q_{k} = 10^{- 3} I_{q}

,

τ_{k} = 10^{- 3} + p + 1

and

γ_{k} = 10^{- 3} + q + 1

(see, e.g., [18]).

The MCMC samplers were based on 20,000 iterations as burn-in, followed by a further 20,000 iterations with a thinning size of 10. Thus, the final MCMC samples with a size of 2000 were obtained for each HSSMN

(Θ (k))

model. Table 1 only provides posterior summaries for the parameters of the

π_{1}

distribution for the sake of saving space. From Column 4–Column 9 of the table list, the mean and three quantiles of 200 sets of posterior samples, which were obtained from the MCMC method, were repeatedly applied to the 200 sets of training sample of size

n_{1} = 1000 .

Then, the remaining two columns of the table list formal convergence test results of the MCMC algorithm. In estimating the Monte Carlo error (MC error) in Column 5, we used the batch mean method with 50 batches, see e.g., [30] (pp. 39–40). The low values of the MC errors indicate that the variability of each estimate due to the simulation is well controlled. The table also compares the MCMC results with the true parameter values (listed in Column 3): (i) each parameter value in Column 3 is located in the credible interval (2.5% quantile, 97.5% quantile); (ii) for each parameter, we see that the difference between its true value and corresponding posterior mean is less than 2 × the standard error (s.e.). Thus, the posterior summaries, obtained by using the weakly informative priors, indicate that the MCMC method based on the HSSMN

(Θ (1))

model performs well in estimating the population parameters, regardless of the SSMN models (RSN and RSt) considered.

Table 1. Posterior summaries of 200 Markov chain Monte Carlo (MCMC) results for the

π_{1}

models.

**Table 1.** Posterior summaries of 200 Markov chain Monte Carlo (MCMC) results for the $π_{1}$ models.
Model ( $π_{1}$ )	Parameter	True	Mean	MC Error	s.e.	2.5%	Median	97.5%	$R_{c}$	p-Value
RSN	$μ_{11}$	2.000	1.966	0.003	0.064	1.882	1.964	2.149	1.014	0.492
	$μ_{12}$	−1.000	−0.974	0.002	0.033	−1.023	−0.974	−0.903	1.011	0.164
	$λ_{11}$	0.312	0.320	0.008	0.159	0.046	0.322	0.819	1.021	0.944
	$λ_{12}$	0.406	0.407	0.007	0.164	0.030	0.417	0.872	1.018	0.107
	$λ_{13}$	0.250	0.253	0.004	0.083	0.082	0.256	0.439	1.019	0.629
	$λ_{14}$	0.125	0.133	0.004	0.067	0.003	0.133	0.408	1.017	0.761
	$ψ_{111}$	1.968	2.032	0.005	0.130	1.743	2.008	2.265	1.034	0.634
	$ψ_{112}$	−0.625	−0.627	0.002	0.098	−0.821	−0.617	−0.405	1.022	0.778
	$ψ_{122}$	0.500	0.566	0.001	0.039	0.465	0.557	0.638	1.018	0.445
RSt	$μ_{11}$	2.000	2.036	0.004	0.069	1.867	2.050	2.166	1.015	0.251
	$μ_{12}$	−1.000	−1.042	0.003	0.036	−1.137	−1.054	−0.974	1.012	0.365
	$λ_{11}$	0.312	0.318	0.008	0.072	0.186	0.320	0.601	1.017	0.654
	$λ_{12}$	0.406	0.405	0.006	0.074	0.262	0.414	0.562	1.019	0.712
	$λ_{13}$	0.250	0.255	0.005	0.051	0.113	0.257	0.387	1.023	0.661
	$λ_{14}$	0.125	0.136	0.005	0.055	0.027	0.133	0.301	1.019	0.598
	$ψ_{111}$	1.968	1.906	0.006	0.108	1.781	1.996	2.211	1.023	0.481
	$ψ_{112}$	−0.625	−0.620	0.003	0.101	−0.818	−0.615	−0.422	1.021	0.541
	$ψ_{122}$	0.500	0.459	0.002	0.044	0.366	0.457	0.578	1.016	0.412

Some of the trace plots from an MCMC run are provided in Figure 1. Each plot demonstrates a parallel zone centered near the true parameter value of interest with no obvious tendency or periodicity. These plots and the small MC error values listed in Table 1 convince us of the convergence of the MCMC algorithm. For a formal diagnostic check, we calculated the Brooks and Gelman diagnostic statistic

R_{c}

(adjusted shrinkage factor introduced by [31]) using a MCMC runs with three chains in parallel, each one starting from different initial values. The calculated

R_{c}

value for each parameter is listed in the 10th column of Table 1. Table 1 shows that all of the

R_{c}

values are close to one, indicating the convergence of the MCMC algorithm. For another formal diagnostic check, we applied the Heidelberger–Welch diagnostic tests of [32] to single-chain MCMC runs, which were used to plot Figure 1. They consist of the stationarity test and the half-width test for the MCMC runs of each parameter. The 11th column of Table 1 lists the p-value of the test for the stationarity of the single Markov chain, where all of the p-values are larger than 0.1. Furthermore, all of the the half-width tests, testing the convergence of the Markov chain of a single parameter, were passed. Thus, all of the diagnostic checking methods (formal and informal methods) advocate the convergence of the proposed MCMC algorithm, and hence, we can say that it generates an MCMC sample that comes from the marginal posterior distributions of interest (i.e., the SSMN population parameters). It is seen that the similar estimation results in Table 1 apply to the posterior summaries of the other parameters in

π_{2}

and

π_{3}

distributions. According to these simulation results, we can say that the MCMC algorithm constructed in Section 3.3 provides an efficient method for estimating the SSMN distributions. To achieve this quality of MCMC algorithm for the higher dimensional case (with large p and/or q values), the diagnostic tests, considered in this section, should be used to monitor the convergence of the algorithm; for more details, see [30].

Figure 1. Trace plots of

μ_{11}

,

μ_{12}

,

λ_{11}

and

ψ_{111}

generated from HSSMN

(Θ (1))

of the RSt with

ν = 5

model.

Figure 1. Trace plots of

μ_{11}

,

μ_{12}

,

λ_{11}

and

ψ_{111}

generated from HSSMN

(Θ (1))

of the RSt with

ν = 5

model.

5.2. A Simulation Study: Performance of the Predictive Methods

This simulation study compares the performance of three BPDA methods using training samples generated from three rectangle screened populations,

π_{k}

(

k = 1, 2, 3

). The three methods compared are

B P D A_{R S N}

,

B P D A_{R S t}

with degrees of freedom

ν = 5

, and

B P D A_{N}

(a standard predictive method with no screening). Two different cases of rectangle screened population distributions were used to generate the training samples. One case is the rectangle screened population

π_{k}

with the

R S N_{p} (C_{q} (α, β); μ_{k}^{*}, Σ_{k}^{*})

distribution. The other case is

π_{k}

with the

R S t_{p} (C_{q} (α, β); μ_{k}^{*}, Σ_{k}^{*}, ν = 5)

distribution in order to examine the robustness of

B P D A_{R S t}

in discriminating observations from heavily-tailed empirical distributions. For each case, we obtained 200 sets of training and validation (or testing) samples of each size

n_{k} = 20, 50, 100

generated from the rectangle screened distribution of

π_{k} .

They are denoted by

D_{k} (i)

and

V_{k} (i) (i = 1, \dots, 200) .

The i-th validation sample

V_{k} (i)

that corresponds to the training

D_{k} (i)

sample was simply obtained by setting

V_{k} (i) = D_{k} (i - 1) (i = 1, \dots, 200)

, where

D_{k} (0) = D_{k} (200) .

The parameter values of the screened population distributions of the three populations

π_{k}

were given by:

\begin{matrix} μ_{k}^{*} & = & (\begin{matrix} 0_{q} \\ ε (- 2 + k) * 1_{p} \end{matrix}), Σ_{k}^{*} = (\begin{matrix} I_{q} & Δ^{⊤} \\ Δ & \sqrt{k} * I_{p} \end{matrix}), Δ = ρ J_{p \times q}^{^{'}}, α = a 1_{q}, β = 1_{q} \end{matrix}

for

p = 2, 5

,

q = 2

and

k = 1, 2, 3 .

Further, we assumed that the parameters

μ_{0 k}

and

Σ_{0 k}

of the underlying q-dimensional screening vector

v_{0}

and the rectangle screening region

C_{q} (α, β)

were known as given above. Thus, we may investigate the performance of the BPDA methods by varying the values of correlation ρ, dimension p of the predictor vector, rectangle screened region and differences among the three population means and covariance matrices whose expressions can be found in [4]. Here,

1_{r}

is a

r \times 1

summing vector whose every element is unity, and

J_{p \times q}^{^{'}}

denote a

p \times 2

matrix whose every odd row is equal to (1, 0) and every even row is (0, 1).

Using the training samples, we calculated the approximate predictive densities Equation (27) by the MCMC algorithm proposed in Section 3.3. In this calculation, we assumed that

p_{k} = 1 / 3

, because

n_{1} = n_{2} = n_{3} = n .

Thus, the posterior probabilities in Equation (24) and the minimum error classification region

R_{k}

in Equation (26) can be estimated within the MCMC scheme, which uses Equation (27) to approximate the predictive densities involved in both Equation (24) and Equation (26). Then, we estimated the classification error rates of the three BPDA methods by using the validation samples,

V_{k} (i) (i = 1, \dots, 200) .

To apply the

B P D A_{R S N}

and

B P D A_{R S t}

methods for classifying the simulated training samples, we used the optimal classification rule, which uses Equation (26), while we used the posterior odds ratio given in [11] to implement the

B P D A_{N}

method. Then, we compare the classification results in terms of error rates. The error rate of each population (

E R_{π_{k}}

) and the total error rate (Total

E R

) were estimated by:

Total E R = \sum_{k = 1}^{3} p_{k} E R_{π_{k}} and E R_{π_{k}} = \frac{n_{k}^{*}}{n_{k}}, k = 1, 2, 3,

where

n_{k}^{*}

is the number of misclassified observations out of

n_{k}

validation sample observations from

π_{k} .

For each case of

π_{k}

distributions, the above procedure was implemented on each set of 200 validation samples to evaluate the error rates of the BPDA methods. Here, [Case 1]denotes that the training (and validation) samples were generated from

π_{k} : R S N_{p} (C_{q} (α, β); μ_{k}^{*}, Σ_{k}^{*})

, and [Case 2] indicates that they were generated from

π_{k} : R S t_{p} (C_{q} (α, β); μ_{k}^{*}, Σ_{k}^{*}, ν = 5)

,

k = 1, 2, 3 .

For each case, Table 2 compares the mean of classification error rates obtained from the 200 replicated classifications by using the BPDA methods. The error rates and their standard errors in Table 2 are indicated as follows. (i) Both the

B P D A_{R S N}

and

B P D A_{R S t}

methods work reasonably well in classifying screened observations, compared to the

B P D A_{N}

method. This implies that, in BPDA, they provide better classification results than the

B P D A_{N}

, provided that

π_{k}

’s are screened by a rectangle screening scheme. (ii) The performance of the

B P D A_{S S M N}

methods becomes better as the correlation (ρ) between the screening variables and predictor variables becomes larger. (iii) For a comparison of the error rates with respect to the values of a, we see that the

B P D A_{S S M N}

methods tends to yield better performance in the discrimination of a screened by a small rectangle screened region. (iv) The performance of the three BPDA methods improves when the differences of the mean increases. (v) An increase in the sizes of dimension p and training sample n also tends to yield a better performance of the BPDA methods. (vi) As expected, the performance of the

B P D A_{R S N}

in [Case 1] is better than the other two methods, because the estimates of error rates are not covered by the corresponding two standard errors. Further, a considerable gain in the error rates over the

B P D A_{N}

manifests the utility of the

B P D A_{R S N}

in the discriminant analysis. (vii) As for [Case 2], the table indicates that the performance of the

B P D A_{R S t}

is better than the two other methods. This demonstrates the robustness of the

B P D A_{R S t}

method in the discrimination with screened and heavy tailed data.

Table 2. Classification error rates: the respective standard errors are in parenthesis.

**Table 2.** Classification error rates: the respective standard errors are in parenthesis.
p	n	a	Method	$ρ = 0.5$		$ρ = 0.9$
p	n	a	Method	$ε = 0.1$	$ε = 0.4$	$ε = 0.1$	$ε = 0.4$
					[Case 1]
2	20	0.5	$B P D A_{R S N}$	0.322(0.0025)	0.174(0.0022)	0.281(0.0024)	0.106(0.0020)
			$B P D A_{R S t}$	0.335(0.0025)	0.185(0.0023)	0.306(0.0025)	0.115(0.0021)
			$B P D A_{N}$	0.350(0.0025)	0.206(0.0023)	0.356(0.0025)	0.205(0.0021)
		$- 0.5$	$B P D A_{R S N}$	0.329(0.0027)	0.182(0.0023)	0.301(0.0025)	0.134(0.0021)
			$B P D A_{R S t}$	0.348(0.0024)	0.193(0.0022)	0.319(0.0024)	0.142(0.0021)
			$B P D A_{N}$	0.349(0.0025)	0.201(0.0023)	.349(0.0025)	0.192(0.0020)
	100	0.5	$B P D A_{R S N}$	0.303(0.0016)	0.161(0.0014)	0.266(0.0015)	0.097(0.0013)
			$B P D A_{R S t}$	0.316(0.0017)	0.165(0.0013)	0.275(0.0015)	0.101(0.0013)
			$B P D A_{N}$	0.351(0.0025)	0.186(0.0023)	0.356(0.0025)	0.186(0.0021)
		$- 0.5$	$B P D A_{R S N}$	0.306(0.0015)	0.163(0.0014)	0.282(0.0014)	0.116(0.0013)
			$B P D A_{R S t}$	0.318(0.0017)	0.168(0.0015)	0.291(0.0015)	0.121(0.0013)
			$B P D A_{N}$	0.338(0.0024)	0.172(0.0023)	0.337(0.0026)	0.170(0.0021)
5	20	0.5	$B P D A_{R S N}$	0.318(0.0025)	0.158(0.0022)	0.240(0.0024)	0.101(0.0020)
			$B P D A_{R S t}$	0.327(0.0026)	0.175(0.0023)	0.276(0.0025)	0.114(0.0021)
			$B P D A_{N}$	0.337(0.0026)	0.183(0.0023)	0.332(0.0025)	0.184(0.0020)
		$- 0.5$	$B P D A_{R S N}$	0.321(0.0025)	0.165(0.0023)	0.231(0.0025)	0.109(0.0021)
			$B P D A_{R S t}$	0.330(0.0026)	0.207(0.0023)	0.318(0.0025)	0.141(0.0021)
			$B P D A_{N}$	0.345(0.0026)	0.216(0.0024)	0.346(0.0025)	0.218(0.0021)
	100	0.5	$B P D A_{R S N}$	0.280(0.0015)	0.150(0.0014)	0.233(0.0015)	0.084(0.0012)
			$B P D A_{R S t}$	0.291(0.0016)	0.153(0.0015)	0.249(0.0015)	0.092(0.0013)
			$B P D A_{N}$	0.307(0.0025)	0.186(0.0023)	0.308(0.0025)	0.189(0.0021)
		$- 0.5$	$B P D A_{R S N}$	0.291(0.0016)	0.163(0.0014)	0.239(0.0015)	0.103(0.0013)
			$B P D A_{R S t}$	0.294(0.0016)	0.169(0.0015)	0.253(0.0015)	0.117(0.0013)
			$B P D A_{N}$	0.305(0.0024)	0.175(0.0022)	0.301(0.0025)	0.176(0.0021)
					[Case 2]
2	20	0.5	$B P D A_{R S N}$	0.351(0.0025)	0.189(0.0022)	0.310(0.0025)	0.114(0.0021)
			$B P D A_{R S t}$	0.320(0.0024)	0.175(0.0023)	0.293(0.0024)	0.105(0.0020)
			$B P D A_{N}$	0.367(0.0026)	0.185(0.0023)	0.365(0.0024)	0.191(0.0020)
		$- 0.5$	$B P D A_{R S N}$	0.349(0.0026)	0.192(0.0022)	0.317(0.0024)	0.149(0.0022)
			$B P D A_{R S t}$	0.321(0.0023)	0.183(0.0021)	0.304(0.0023)	0.132(0.0021)
			$B P D A_{N}$	0.356(0.0025)	0.210(0.0023)	0.357(0.0025)	0.199(0.0020)
	100	0.5	$B P D A_{R S N}$	0.313(0.0016)	0.164(0.0015)	0.273(0.0015)	0.098(0.0014)
			$B P D A_{R S t}$	0.306(0.0015)	0.158(0.0013)	0.265(0.0014)	0.091(0.0012)
			$B P D A_{N}$	0.346(0.0023)	0.179(0.0022)	0.341(0.0024)	0.175(0.0022)
		$- 0.5$	$B P D A_{R S N}$	0.321(0.0015)	0.170(0.0014)	0.287(0.0015)	0.119(0.0015)
			$B P D A_{R S t}$	0.310(0.0014)	0.164(0.0013)	0.281(0.0013)	0.112(0.0013)
			$B P D A_{N}$	0.329(0.0025)	0.181(0.0025)	0.327(0.0027)	0.176(0.0022)
5	20	0.5	$B P D A_{R S N}$	0.329(0.0024)	0.181(0.0024)	0.281(0.0023)	0.119(0.0021)
			$B P D A_{R S t}$	0.317(0.0023)	0.164(0.0020)	0.265(0.0021)	0.094(0.0020)
			$B P D A_{N}$	0.340(0.0027)	0.196(0.0024)	0.314(0.0026)	0.152(0.0022)
		$- 0.5$	$B P D A_{R S N}$	0.342(0.0025)	0.205(0.0024)	0.332(0.0024)	0.194(0.0024)
			$B P D A_{R S t}$	0.328(0.0022)	0.171(0.0022)	0.275(0.0022)	0.118(0.0021)
			$B P D A_{N}$	0.351(0.0026)	0.224(0.0025)	0.329(0.0025)	0.175(0.0025)
	100	0.5	$B P D A_{R S N}$	0.284(0.0016)	0.155(0.0018)	0.283(0.0016)	0.154(0.0013)
			$B P D A_{R S t}$	0.271(0.0014)	0.149(0.0014)	0.238(0.0014)	0.086(0.0011)
			$B P D A_{N}$	0.294(0.0026)	0.192(0.0024)	0.274(0.0026)	0.161(0.0024)
		$- 0.5$	$B P D A_{R S N}$	0.289(0.0016)	0.177(0.0015)	0.288(0.0016)	0.175(0.0013)
			$B P D A_{R S t}$	0.278(0.0013)	0.162(0.0013)	0.231(0.0014)	0.107(0.0011)
			$B P D A_{N}$	0.312(0.0025)	0.178(0.0025)	0.270(0.0026)	0.141(0.0022)

6. Conclusions

In this paper, we proposed an optimal predictive method (BPDA) for the discriminant analysis of multidimensional screened data. In order to incorporate the prior information about a screening mechanism flexibly in the analysis, we introduced the SSMN models. Then, we provided the HSSMN

(Θ (k))

model for Bayesian inference of the SSMN populations, where the screened data were generated. Based on the HSSMN

(Θ (k))

model, posterior distributions of

Θ (k)

were derived, and the calculation of the optimal predictive classification rule was discussed by using an efficient MCMC method. Numerical studies with simulated screened observations were given to illustrate the convergence of the MCMC method and the usefulness of the BPDA.

The methodological results of the Bayesian estimation procedure proposed in the paper can be extended to other multivariate linear models that incorporate non-normal errors, a general covariance matrix and truncated random covariates. For example, the seemingly unrelated regression (SUR) model and the factor analysis model (see, e.g., [19]) can be explained in the same framework of the proposed HSSMN

(Θ (k))

in Equation (12). The former is a special case of the HSSMN

(Θ (k))

model in which

Z_{k}

’s are observable as predictors. Therefore, when the regression errors are non-normal, it would be plausible to apply the proposed approach by using the HSSMN

(Θ (k))

model to work with a robust SUR model, whereas the latter is a natural extension of the oblique factor analysis model to the case of that with non-normal measurement errors. The HSSMN

(Θ (k))

model can also be extended to accommodate missing, values as done in the other models by [33,34]. We are hopeful to address these issues, as well, in the near future.

Acknowledgments

The research of Hea-Jung Kim was supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Science, ICT and Future Planning (2013R1A2A2A01004790).

Author Contributions

The author developed a Bayesian predictive method for the discriminant analysis of screened data. For the multivariate technique, the author introduced a predictive discrimination method with the SSMN populations (

B P D A_{S S M N}

) and provided a Bayesian estimation methodology, which is suited to the

B P D A_{S S M N}

. The methodology consists of constructing a hierarchical model for the SSMN populations (HSSMN

(Θ (k))

) and using an efficient MCMC algorithm to estimate the SSMN models, as well as an optimal rule for the

B P D A_{S S M N} .

Conflicts of Interest

The authors declare no conflict of interest.

Appendix

Derivations of the full conditional posterior distributions

(1): The full conditional posterior density of $μ_{k}$ given $μ_{0 k}, λ_{k}, Ψ_{k}, F_{k}, Σ_{0 k}, η_{k}$ and $D_{k}$ is proportional to:

$\begin{matrix} \prod_{i = 1}^{n_{k}} ϕ_{p} (x_{k i}; μ_{k} + Λ_{k} f_{k i}, κ (η_{k i}) Ψ_{k}) ϕ_{p} (μ_{k}; θ_{k}, Ω_{k}) \\ \propto & exp \{- \frac{1}{2} {(μ_{k} - μ_{μ_{k}})}^{⊤} Σ_{μ_{k}}^{- 1} (μ_{k} - μ_{μ_{k}})\} \end{matrix}$

which is a kernel of the $N_{p} (μ_{μ_{k}}, Σ_{μ_{k}})$ distribution.
(2): It is obvious from the joint posterior density in Equation (13).
(3): It is straightforward to see from Equation (13) that the full conditional posterior density of $λ_{k}$ is given by:

$\begin{matrix} p (Λ_{k} | Θ {(k)}_{\ Λ_{k}}, D_{k}) & \propto & exp \{- \frac{1}{2} t r [Ψ_{k}^{- 1} V_{k}]\} \\ \propto & exp \{- \frac{1}{2} t r [Ψ_{k}^{- 1} (Λ_{k} - Λ_{k}^{*}) Q_{k} {(Λ_{k} - Λ_{k}^{*})}^{⊤}]\} \\ \propto & exp \{- \frac{1}{2} {(λ_{k} - μ_{λ_{k}})}^{⊤} Σ_{λ_{k}}^{- 1} (λ_{k} - μ_{λ_{k}})\} . \end{matrix}$

This is a kernel of $N_{p q} (μ_{λ_{k}}, Σ_{λ_{k}})$ , where $V_{k} = (X_{k} - Λ_{k} F_{k}) D {(κ (η_{k}))}^{- 1} {(X_{k} - Λ_{k} F_{k})}^{⊤} + (Λ_{k} - Λ_{0 k}) H_{k} {(Λ_{k} - Λ_{0 k})}^{⊤} + R_{k}$ and $ν_{k} = n_{k} + q + τ_{k} .$
(4): We see from Equation (13) that the full conditional posterior density of $Ψ_{k}$ is given by:

$\begin{matrix} p (Ψ_{k} | Θ {(k)}_{\ Ψ_{k}}, D_{k}) \\ \propto & | Ψ_{k} |^{- (n_{k} + q) / 2} exp \{- \frac{1}{2} t r [Ψ_{k}^{- 1} (X_{k} - Λ_{k} F_{k}) D {(κ (η_{k}))}^{- 1} {(X_{k} - Λ_{k} F_{k})}^{⊤}\} \\ \times & exp \{- \frac{1}{2} t r [Ψ_{k}^{- 1} (Λ_{k} - Λ_{0 k}) H_{k} {(Λ_{k} - Λ_{0 k})}^{⊤}]\} \times I W_{p} (Ψ_{k}; R_{k}, τ_{k}) \\ \propto & | Ψ_{k} |^{- (n_{k} + τ_{k} + q) / 2} exp \{- \frac{1}{2} t r [Ψ_{k}^{- 1} V_{k}]\} . \end{matrix}$

This is a kernel of $I W_{p} (V_{k}, ν_{k}) .$
(5): We see, from Equation (13), that the full conditional posterior densities of $f_{k i}$ ’s are independent, and each density is given by:

$\begin{matrix} p (f_{k i} | Θ {(k)}_{\ f_{k i}}, D_{k}) & \propto & ϕ_{q} (f_{k i}; 0, κ (η_{k i}) Σ_{0 k}) ϕ_{p} (x_{k i}; μ_{k} + Λ_{k} f_{k i}, κ (η_{k i}) Ψ_{k}) I (f_{k i} \in (a_{k}, b_{k})) \\ \propto & exp {- \frac{1}{2 κ (η_{k i})} [f_{k i}^{⊤} (Σ_{0 k}^{- 1} + Λ_{k}^{⊤} Ψ_{k}^{- 1} Λ_{k}) f_{k i} \\ - & 2 f_{k i}^{⊤} Λ_{k}^{⊤} Ψ_{k}^{- 1} (x_{k i} - μ_{k})]} I (f_{k i} \in C_{q} (a_{k}, b_{k})) \\ \propto & exp \{- \frac{1}{2 κ (η_{k i})} {(f_{k i} - μ_{f_{k i}})}^{⊤} Σ_{f_{k i}}^{- 1} (f_{k i} - μ_{f_{k i}})\} I (f_{k i} \in C_{q} (a_{k}, b_{k})) \end{matrix}$

which is a kernel of the q-variate truncated normal $N_{q} (μ_{f_{k i}}, κ (η_{k i}) Σ_{f_{k i}}) I (f_{k i} \in C_{q} (a_{k}, b_{k})) .$
(6): It is obvious from the joint posterior density in Equation (13).
(7): It is obvious from the joint posterior density in Equation (13).

References

Catsiapis, G.; Robinson, C. Sample selection bias with multiple selection rules: An application to student aid grants. J. Econom. 1982, 18, 351–368. [Google Scholar] [CrossRef]
Mohanty, M.S. Determination of participation decision, hiring decision, and wages in a double selection framework: Male-female wage differentials in the U.S. labor market revisited. Contemp. Econ. Policy 2001, 19, 197–212. [Google Scholar] [CrossRef]
Kim, H.J. A class of weighted multivariate normal distributions and its properties. J. Multivar. Anal. 2008, 99, 1758–1771. [Google Scholar] [CrossRef]
Kim, H.J.; Kim, H.-M. A class of rectangle-screened multivariate normal distributions and its applications. Statisitcs 2015, 49, 878–899. [Google Scholar] [CrossRef]
Lin, T.I.; Ho, H.J.; Chen, C.L. Analysis of multivariate skew normal models with incomplete data. J. Multivar. Anal. 2009, 100, 2337–2351. [Google Scholar] [CrossRef]
Arellano-Valle, R.B.; Branco, M.D.; Genton, M.G. A unified view of skewed distributions arising from selections. J. Can. Stat. 2006, 34, 581–601. [Google Scholar]
Kim, H.J. Classification of a screened data into one of two normal populations perturbed by a screening scheme. J. Multivar. Anal. 2011, 102, 1361–1373. [Google Scholar] [CrossRef]
Kim, H.J. A best linear threshold classification with scale mixture of skew normal populations. Comput. Stat. 2015, 30, 1–28. [Google Scholar] [CrossRef]
Marchenko, Y.V.; Genton, M.G. A Heckman selection-t model. J. Am. Stat. Assoc. 2012, 107, 304–315. [Google Scholar] [CrossRef]
Sahu, S.K.; Dey, D.K.; Branco, M.D. A new class of multivariate skew distributions with applications to Bayesian regession models. Can. J. Stat. 2003, 31, 129–150. [Google Scholar] [CrossRef]
Geisser, S. Posterior odds for multivariate normal classifications. J. R. Stat. Soc. B 1964, 26, 69–76. [Google Scholar]
Lachenbruch, P.A.; Sneeringer, C.; Revo, L.T. Robustness of the linear and quadratic discriminant function to certain types of non-normality. Commun. Stat. 1973, 1, 39–57. [Google Scholar] [CrossRef]
Wang, Y.; Chen, H.; Zeng, D.; Mauro, C.; Duan, N.; Shear, M.K. Auxiliary marke-assited classification in the absence of class identifiers. J. Am. Stat. Assoc. 2013, 108, 553–565. [Google Scholar] [CrossRef] [PubMed]
Webb, A. Statistical Pattern Recognition; Wiley: New York, NY, USA, 2002. [Google Scholar]
Aitchison, J.; Habbema, J.D.F.; Key, J.W. A critical comparison of two methods of statistical discrimination. Appl. Stat. 1977, 26, 15–25. [Google Scholar] [CrossRef]
Azzalini, A.; Capitanio, A. Statistical application of the multivariate skew-normal distribution. J. R. Stat. Soc. B 1999, 65, 367–389. [Google Scholar] [CrossRef]
Branco, M.D. A general class of multivariate skew-elliptical distributions. J. Multivar. Anal. 2001, 79, 99–113. [Google Scholar]
Chen, M-H.; Dey, D.K. Bayesian modeling of correlated binary responses via scale mixture of multivariate normal link functions. Indian J. Stat. 1998, 60, 322–343. [Google Scholar]
Press, S.J. Applied Multivariate Analysis, 2nd ed.; Dover: New York, NY, USA, 2005. [Google Scholar]
Reza-Zadkarami, M.; Rowhani, M. Application of skew-normal in classification of satellite image. J. Data Sci. 2010, 8, 597–606. [Google Scholar]
Wang, W.L.; Fan, T.H. Bayesian analysis of multivariate t linear mixed models using a combination of IBF and Gibbs samplers. J. Multivar. Anal. 2012, 105, 300–310. [Google Scholar] [CrossRef]
Wang, W.L.; Lin, T.I. Bayesian analysis of multivariate t linear mixed models with missing responses at random. J. Stat. Comput. Simul. 2015, 85. [Google Scholar] [CrossRef]
Johnson, N.L.; Kotz, S. Distribution in Statistics: Continuous Multivariate Distributions; Wiley: New York, NY, USA, 1972. [Google Scholar]
Wilhelm, S.; Manjunath, B.G. tmvtnorm: Truncated multivariate normal distribution and student t distribution. R J. 2010, 1, 25–29. [Google Scholar]
Genz, A.; Bretz, F. Computation of Multivariate Normal and t Probabilities; Springer: New York, NY, USA, 2009. [Google Scholar]
Chib, S.; Greenberg, E. Understanding the Metropolis-Hastings algorithm. Am. Stat. 1995, 49, 327–335. [Google Scholar]
Chen, H.-M.; Schmeiser, R.W. Performance of the Gibbs, hit-and-run, and metropolis samplers. J. Comput. Gr. Stat. 1993, 2, 251–272. [Google Scholar] [CrossRef]
Anderson, T.W. Introduction to Multivariate Statistical Analysis, 3rd ed.; Wiley: Hoboken, NJ, USA, 2003. [Google Scholar]
Adwards, W.H.; Lindman, H.; Savage, L.J. Bayesian statistical inference for psychological research. Psycol. Rev. 1963, 70, 192–242. [Google Scholar] [CrossRef]
Ntzoufras, I. Bayesian Modeling Using WinBUGS; Wiley: New York, NY, USA, 2009. [Google Scholar]
Brooks, S.; Gelman, A. Alternative methods for monitoring convergence of iterative simulations. J. Comput. Gr. Stat. 1998, 7, 434–455. [Google Scholar]
Heidelberger, P.; Welch, P. Simulation run length control in the presence of an initial transient. Oper. Res. 1992, 31, 1109–1144. [Google Scholar] [CrossRef]
Lin, T.I. Learning from incomplete data via parameterized t mixture models through eigenvalue decomposition. Comput. Stat. Data Anal. 2014, 71, 183–195. [Google Scholar] [CrossRef]
Lin, T.I.; Ho, H.J.; Chen, C.L. Analysis of multivariate skew normal models with incomplete data. J. Multivar. Anal. 2009, 100, 2337–2351. [Google Scholar] [CrossRef]

© 2015 by the author; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kim, H.-J. A Bayesian Predictive Discriminant Analysis with Screened Data. Entropy 2015, 17, 6481-6502. https://doi.org/10.3390/e17096481

AMA Style

Kim H-J. A Bayesian Predictive Discriminant Analysis with Screened Data. Entropy. 2015; 17(9):6481-6502. https://doi.org/10.3390/e17096481

Chicago/Turabian Style

Kim, Hea-Jung. 2015. "A Bayesian Predictive Discriminant Analysis with Screened Data" Entropy 17, no. 9: 6481-6502. https://doi.org/10.3390/e17096481

APA Style

Kim, H.-J. (2015). A Bayesian Predictive Discriminant Analysis with Screened Data. Entropy, 17(9), 6481-6502. https://doi.org/10.3390/e17096481

Article Menu

A Bayesian Predictive Discriminant Analysis with Screened Data

Abstract

1. Introduction

2. The SSMN Population Distributions

3. The HSSMN Model

3.1. The Hierarchical Model

3.2. Posterior Distributions

3.3. Markov Chain Monte Carlo Sampling Scheme

4. The Predictive Classification Rule

5. Simulation Study

5.1. A Simulation Study: Convergence of the MCMC Algorithm

5.2. A Simulation Study: Performance of the Predictive Methods

6. Conclusions

Acknowledgments

Author Contributions

Conflicts of Interest

Appendix

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI