A Bayesian Predictive Discriminant Analysis with Screened Data

In the application of discriminant analysis, a situation sometimes arises where individual measurements are screened by a multidimensional screening scheme. For this situation, a discriminant analysis with screened populations is considered from a Bayesian viewpoint, and an optimal predictive rule for the analysis is proposed. In order to establish a flexible method to incorporate the prior information of the screening mechanism, we propose a hierarchical screened scale mixture of normal (HSSMN) model, which makes provision for flexible modeling of the screened observations. An Markov chain Monte Carlo (MCMC) method using the Gibbs sampler and the Metropolis–Hastings algorithm within the Gibbs sampler is used to perform a Bayesian inference on the HSSMN models and to approximate the optimal predictive rule. A simulation study is given to demonstrate the performance of the proposed predictive discrimination procedure.


Introduction
The topic of analyzing multivariate screened data has received a great deal of attention over the last few decades.In the standard multivariate problem, an analysis of data generated from a p-dimensional screened random vector x d = [v|v 0 ∈ C q )] is our issue of interest, where the p × 1 random vector v and the q × 1 random vector v 0 (called the screening vector) are jointly distributed with the correlation matrix Corr(v, v 0 ) = 0. Thus, we observe x only when the unobservable screening vector v 0 belongs to a known subset C q of its space R q , such that 0 ≤ P (v 0 ∈ C q ) ≤ 1.That is, x is subject to the screening scheme or hidden truncation (or simply truncation if v = v 0 ).Model parameters underlying the joint distribution of v and v 0 are then estimated from the screened data (i.e., observations of x) using the conditional density f (x| v 0 ∈ C q ).
The screening of sample (or sample selection) arises as open in practice as a result of controlling observability of the outcome of interest in the study.For example, the dataset consists of the Otis IQ test scores (the values of x) of freshmen of a college.These students had been screened in the college admission process, which examines whether their prior school grade point average (GPA) and the Scholastic Aptitude Test (SAT) scores (i.e., the screening values denoted by v 0 ) are satisfactory.What the true vale of screening vector variable v 0 of each student is may not be available due to a college regulation.The observations available are the IQ values of x, the screened data.For the application with real screened data, one can refer to that with the student aid grants data given by [1] and that with the U.S. labor market data given by [2], as well.A variety of methods have been suggested for analyzing such screened data.See, e.g., [3][4][5][6], for various distributions for modeling and analyzing screened data; see, [7,8] for the estimative classification analysis with screened data; and see, e.g., [1,2,9,10], for the regression analysis with screened response data.
The majority of existing methods rely on the fact that v and v 0 are jointly multivariate normal, and the screened observation vector x is subject to a univariate screening scheme defined by an open interval C q with q = 1.In many practical situations, however, the screened data are generated from a non-normal joint distribution of v and v 0 , having a multivariate screening scheme defined by a q-dimensional (q > 1) rectangle region C q of v 0 .In this case, a difficulty in applications with the screened data is that the empirical distribution of the screened data is skewed; its parametric model involves a complex density; and hence, standard methods of analysis cannot be used.See [4,6] for the conditional densities, f (x| v 0 ∈ C q ), useful for fitting the rectangle screened data generated from a non-normal joint distribution of v and v 0 .In this article, we develop yet another multivariate technique applicable for analyzing the rectangle screened data: we are interested in constructing a Bayesian predictive discrimination procedure for the data.More precisely, we consider a Bayesian multivariate technique for sorting, grouping and prediction of multivariate data generated from K rectangle screened populations.In the standard problem, a training sample D = {(z i , x i ), i = 1, . . ., n} is available, where, for each i = 1, . . ., n, x i is a p × 1 rectangle screened observation vector coming from one of K populations and taking values in R p , and z i is a categorical response variable representing the population membership, so that z i = k implies that the predictor x i belongs to the k-th rectangle screened population (denoted by π k ), k = 1, . . ., K. Using the training sample D, the goal of the predictive discriminant analysis is to predict population membership of a new screened observation x based on the posterior probability of x belonging to π k .The posterior probability is given by: where z is the the population membership of x, p(z = k |D) is the prior probability of π k updated by the training sample D and: , respectively, denote the predictive density, the probability density of x and the posterior density of parameters Θ k associated with π k .One of the first and most applied predictive approaches by [11] is the case of unscreened and normally-distributed populations π k with unknown parameters This is called a Bayesian predictive discriminant analysis with normal populations (BP DA N ) in which a multivariate Student t distribution is obtained for Equation (2).
A practical example where the predictive discriminant analysis with the rectangle screened populations (π k 's) is applicable is in the discrimination between passed and failed pairs of applicants in a college admission process (the second screening process).Consider the case where college admission officers wish to set up an objective criterion (with a predictor vector x) for admitting students for matriculation; however, the admission officers must first ensure that a student with observation x has passed the first screening process.The first screening scheme may be defined by the q-dimensional region C q of the random vector v 0 (consisting of SAT scores, high-school GPA, and so on), so that only the students who satisfy v 0 ∈ C q can proceed to the admission process.In this case, we encounter a crucial problem for applying the normal classification by [11]; given the screening scheme v 0 ∈ C q , the assumption of the multivariate normal population distribution for [7,12] found that the normal classification shows a lack of robustness to the departure from the normality of the population distribution, and hence, the performance of the normal classification can be very misleading, if used with the continuous, but non-normal or screened normal input vector x.
Thus, the predictive density in Equation ( 2) has two specific features to be considered for Bayesian predictive discrimination with the rectangle screened populations, one about the prior distribution of the parameters Θ k and the other about the distributional assumption of the population model with density p(x|Θ k ).For the unscreened populations case, there have been a variety of studies that are concerned with the two considerations.See, for example, [11,13,14] for the choice of the prior distributions of Θ k , and see [15,16] for copious references to the literature on the predictive discriminant analysis with non-normal population models.Meanwhile, for deriving Equation (2) of the rectangle screened observation x, we need to develop a population model with density p(x|Θ k ) that uses the screened sample information in order to maintain consistency with the underlying theory associated with the populations π k generating the screened sample.Then, we propose a Bayesian hierarchical approach to flexibly incorporate the prior knowledge about Θ k with the non-normal sample information, which is the main contribution of this paper to the literature on Bayesian predictive discriminant analysis.
The rest of this paper is organized as follows.Section 2 considers a class of screened scale mixture of normal (SSMN) population models, which well accounts for the screening scheme conducted through a q-dimensional rectangle region C q of an external scale mixture of normal vector, v 0 .Section 3 proposes a hierarchical screened scale mixture of normal (HSSMN) model to derive the predictive density Equation (2) and proposes an optimal rule for Bayesian predictive discriminant analysis (BPDA) with the SSMN populations (abbreviated as BP DA SSM N ).Approximation of the rule is studied in Section 4 by using an MCMC method applied to the HSSMN model.In Section 5, a simulation study is done to check the convergence of the MCMC method and the performance of the BP DA SSM N by making a comparison between the BP DA SSM N and the BP DA N .Finally, concluding remarks are given in Section 6.

The SSMN Population Distributions
Assume that the joint distribution of respective q × 1 and p × 1 vector variables v 0 and v, associated with π k , is F ∈ F, where: with κ(η) > 0, and η > 0 , k = 1, . . ., K, s = (q + p), η is a mixing variable with the pdf g(η), κ(η) is a suitably-chosen weight function and µ * k and Σ * k are partitioned corresponding to the orders of v 0 and v: Notice that F defined by Equation (3) denotes a class of scale mixture of multivariate normal (SMN) distributions (see, e.g., [17,18] for details), equivalently denoted as SM N s (µ * k , Σ * k , κ(η), G) in the remainder of the paper, where G = G(η) denote the cdf of η.
Given the joint distribution , the SSMN distribution is defined by the following screening scheme: where T , and α j < β j for j = 1, • • • , q.This region contains the cases of C q (α, ∞) and C q (−∞, β) as special cases.
One particular member of the class of SSMN distributions is the rectangle-screened normal (RSN) distribution defined by Equations ( 5) and ( 6), for which G(η) is degenerate with κ(η) = 1.The work in [4,8] studied properties of the distribution and denoted it as the RSN p (C q (α, β); µ * k , Σ * k ) distribution.Another member of the class is the rectangle-screened p-variate Student t distributions (RSt p ) considered by [8].Its pdf is given by: where t p (• |a, B, c) and Tp (C p ; a, B, c) are the respective pdf and probability of a rectangle region C p of the p-variate Student t distribution with the location vector a, the scale matrix B, the degrees of freedom c and: Similar to the RSN distributions, the density Equation (7) of The stochastic representations of the RSN and RSt p distributions are immediately obtained by applying the following lemma, for which detailed proof can be found in [4].
Then, it has the following stochastic representation in a hierarchical fashion, where Lemma 1 provides the following: (i) an intrinsic structure of the SSMN population distributions, which reveals a type of departure from the SMN law because the distribution of the representation provides a convenient device for random number generation; (iii) it leads to a simple and direct construction of a HSSMN model for the BPDA with the SSMN populations, i.e.,

The Hierarchical Model
For a Bayesian predictive discriminant analysis, suppose we have K rectangle screened populations π k (k = 1, . . ., K), each specified by the SSM N p (C q (α, The predictive discrimination analysis is to assess the relative predictive odds ratio or posterior probability that a screened multivariate observation x belongs to one of K populations, π k .As noted by Equation ( 6), however, a complex likelihood function of D k prevents us from choosing reasonable priors of the model parameters and obtaining the predictive density of x given by Equation ( 2).These problems are solved if we use the following hierarchical representation of the population models.
According to Lemma 1, we may rewrite the SSMN model for Equations ( 8) and ( 9) by a three-level hierarchy given by: where is the scale mixing distribution of the independent η ki 's, f ki and ε ki are independent conditional on η ki and N q (0, The first stage model in Equation ( 10) may be written in a compact form by defining the following vector and matrix notations, Then, the three-level hierarchy of the model Equation ( 10) can be expressed as: where A ⊗ B denotes the Kronecker product of two matrices A and B, vec of the scale mixing functions.Note that the hierarchical population model Equation ( 11) adopts a robust discriminant modeling by the use of the scale mixture of normal, such as the SMN and the truncated SMN, and thus, it enables us to avoid the anomaly generated from the non-normal sample information.
The Bayesian analysis of the model in Equation (11) begins with the specification of the prior distributions of the unknown parameters.When the prior information is not available, a convenient strategy of avoiding improper posterior distribution is to use proper priors with their hyperparameters being fixed as appropriate quantities to reflect the flatness (or diffuseness) of priors (i.e., limiting non-informative priors).For convenience, but not always optimal, we suppose that µ k , µ 0k , (Λ k , Ψ k ) and Σ 0k of the model in Equation ( 11) are independent a priori; prior distributions for µ k and µ 0k are normal; an inverse Wishart prior distribution for Σ 0k ; and a generalized natural conjugate family (see [19]) of prior distributions for (Λ k , Ψ k ), so that we adopt the normal prior density for the Λ k conditional on the matrix Ψ k , where W ∼ IW m (V, ν) denotes the inverse Wishart distribution whose pdf IW m (W ; V, ν) is: Note that if Λ k = (λ k1 , . . ., λ kq ), λ k ≡ vec(Λ k ) = (λ k1 , . . ., λ kq ) and λ 0k ≡ vec(Λ 0k ), then: This prior elicitation of the parameters, along with the three-level hierarchical model Equation ( 11), produces a hierarchical screened scale mixture of normal population model, which is referred to as HSSMN(Θ(k)) in the rest of this paper, where The HSSMN(Θ(k)) model is defined as follows.
are fixed as appropriate quantities to reflect the flatness of priors.

Posterior Distributions
Based on the HSSMN(Θ(k)) model structure with the likelihood and the prior distributions in Equation ( 12), the joint posterior distribution of Θ(k) is given by: where: )'s denote the densities of the mixing variables η ki 's.Note that the joint posterior of Equation ( 13) is not simplified in an analytic form of the known density and, thus, intractable for the posterior inference.Instead, we derived each of conditional posterior distribution of µ k , µ 0k , λ k ≡ vec(Λ k ), Σ 0k , F k , Ψ k and η ki 's, which will be useful for posterior inference based on Markov chain Monte Carlo methods (MCMC).All of the full conditional posterior distributions are as follows (see the Appendix for their derivations): (1) The full conditional distribution of µ k is a p-variate normal given by: where .
(2) The full conditional density of µ 0k is given by: (3) The full conditional posterior distribution of λ k is given by: where: The full conditional posterior distribution of Ψ k is an inverse-Wishart distribution: where (5) The full conditional posterior distribution of f ki is the q-variate truncated normal given by: where The full conditional posterior density of Σ 0k is given by: The full conditional posterior densities of η ki 's are given by: where z ki = x ki − µ k − Λ k f ki and η ki 's are independent.Based on the above full conditional posterior distributions and the stochastic representations of the SSMN in Lemma 1, one can easily obtain Bayes estimates of the k-th SSMN population mean Specifically, the mean and covariance matrix of an observation x belonging to π k : SSM N p (C q (α, β); µ * k , Σ * k , κ(η), G), which are used for calculating their Bayes estimates via Rao-Blackwellization, are given by: where We see that these moments of Equation ( 21) agree with the formula for the mean and covariance matrix of the untruncated marginal distribution of a general multivariate truncated distribution given by [23].Readers are referred to [24] with the R package tmvtnorm and [25] with the R package mvtnorm for implementing calculations of ξ k and P k involved in the first and second moments.
When the sampling information, i.e., the observed training samples, is augmented by the proper information of prior knowledge, the anomalies of the maximum likelihood estimate of the SSMN model, investigated by [16], would disappear in the HSSMN (Θ(k)) model.Furthermore, note that the conditional distribution of λ k in Equation ( 16) is a pq-dimensional one; and hence, its Gibbs sampling needs to be performed by using the inverse of the matrix of order pq, which may cause computational costs in implementing the MCMC method.For large q, a more computationally-convenient Gibbs sampler can be considered based on the full conditional posterior distributions of λ kj , j = 1, . . ., p, than the Gibbs sampler with λ k in Equation (16), where λ k ≡ V ec(Λ k ) and Λ k ≡ (λ k1 , . . ., λ kq ).
For this purpose, we defined the following notations: for j = 1, . . ., p, , . . ., e j−1 , e j+1 , . . ., e q ) , where e j denotes the j-th column of I q , namely an elementary vector with unity for its j-th element and zeros elsewhere.Furthermore, we consider the following partitions: , and Ωk (j) = Ωk11 (j) Ωk12 (j) Ωk21 (j) Ωk22 (j) , where the orders of λ * kj , θk (1j), Ωk11 (j) and Ωk21 (j) are (p − 1)q × 1, q × 1, q × q and (p − 1)q × q, respectively.Under these partitions, the conditional property of a multivariate normal distribution leads to the full conditional posterior distributions of λ kj given by: for j = 1, . . ., q, where: When p is large, we may partition λ kj into two vectors with smaller dimensions, say λ k = (λ kj (1) , λ kj (2), ) , then use their full conditional normal distributions for the Gibbs sampler.Now, the posterior sampling can be implemented by using all of the conditional posterior Equations ( 14)- (20).The Gibbs sampler and Metropolis-Hastings algorithm within the Gibbs sampler may be used to obtain posterior samples of all of the unknown parameters Θ(k).Note that in the case where the pq-dimensional matrix is too large to manipulate for computation, the Gibbs sampler can be modified by replacing the full conditional posterior Equation ( 16) with Equation (22).That is, as indicated by Equation ( 22), the modified Gibbs sampler based on Equation ( 22) would be more convenient for numerical computation than the first one using Equation ( 16).The detailed Markov chain Monte Carlo algorithm with Gibbs sampling is discussed in the next subsection.

Markov Chain Monte Carlo Sampling Scheme
It is not complicated to construct an MCMC sampling scheme working with Θ(k) = {µ k , µ 0k , Λ k , Ψ k , F k , Σ 0k , η k }, since a routine Gibbs sampler would work to generate posterior samples of (µ k , Λ k , Ψ k , F k ) based on each of their full conditional posterior distributions obtained in Section 3.2.In the posterior sampling of µ 0k , Σ 0k and η k , Metropolis-Hastings within the Gibbs algorithm would be used, since their conditional posterior densities do not have explicit forms of known distributions as in Equations ( 15), (19) and (20).
Here, for simplicity, we considered the MCMC algorithm based on the HSSMN(Θ(k)) model with a known screening scheme, in which µ 0k and Σ 0k are assumed to be known.The extension to the general HSSMN(Θ(k)) model with unknown µ 0k and Σ 0k can be made without difficulty.
The MCMC algorithm starts with some initial values µ k .The detailed posterior sampling steps are as follows: Step 1: generate µ k by using the full conditional posterior distribution in Equation ( 14).
Step 2: generate λ k by using the full conditional posterior distribution in Equation ( 16).
Step 3: generate inverse-Wishart random matrix ψ k by using the full conditional posterior distribution in Equation (17).
Step 4: generate independent q-variate truncated normal random variables f ki by using the full conditional posterior distribution in Equation (18).
When one conducts a posterior inference of the HSSMN (Θ(k)) model using the samples obtained from the MCMC sampling algorithm, the following points should be noted.
(i) See, e.g., [18], for the sampling method for η ki from various mixing distributions g(η ki ) of the SMN distributions, such as the multivariate t, multivariate logit , multivariate stable and multivariate exponential power models.(ii) Suppose the HSSMN(Θ(k)) model involves unknown µ 0k .Then, as indicated by the full conditional posterior of µ 0k in Equation ( 15), the complexity of the conditional distribution prevents us from using straightforward Gibbs sampling.Instead, we may use a simple random walk Metropolis algorithm that uses a normal proposal density q(µ * 0k |µ 0k ) = q(|µ * 0k − µ 0k |) to sample from the conditional distribution of µ * 0k ; that is, given the current point is µ 0k , the candidate point is µ * 0k ∼ N q (µ 0k , D), where a diagonal matrix D should be turned, so that the acceptance rate of the candidate point is around 0.25 (see, e.g., [26]).(iii) When the HSSMN(Θ(k)) model involves unknown Σ 0k : The MCMC sampling algorithm, using the full conditional posterior Equation ( 19) is not straightforward, because the conditional posterior density is unknown and complex.Instead, we may apply a Metropolized hit-and-run algorithm, described by [27], to sample from the conditional posterior of Σ 0k .(iv) One can easily calculate the posterior estimate of Θ k = (µ * k , Σ * k ) by using that of Θ(k), because the re-parameterizing relations are

The Predictive Classification Rule
Suppose we have K populations π k , k = 1, . . ., K, each specified by the HSSMN(Θ(k)) model.For each of the populations, we have the screened training sample D k comprised of a set of independent observations {x ki , i = 1, . . ., n k } whose population level is z ki = k.Let x be assigned to one of the K populations, with prior probability p k of belonging to π k , K k=1 p k = 1.Then, the predictive density of x given D under the HSSMN(Θ(k)) model with the space Θ(k) ∈ Θ(k) is: and the posterior probability that x belongs to π k , i.e., p(z where D = K k=1 D k , p(x |Θ(k)) is equal to Equation ( 6) and p(Θ(k) |D) is the joint posterior density given in Equation (13).We see from Equation ( 24) that the total posterior probability of misclassifying x from π i to π j , i = j is defined by: We minimize the misclassification error at this point if we choose j, so as to minimize Equation (25); that is, we select k that gives the maximum posterior probability p(x ∈ π k |D, x) (see, e.g., Theorem 6.7.1 of [28] (p.234).Thus, an optimal Bayesian predictive discrimination rule that minimizes the classification error is to classify x into π k , if x ∈ R k , where the optimal classification region is given by: Since we are unable to obtain an analytic solution of Equation ( 26), a numerical approach is required.Thus, we used the MCMC method of the previous section to draw samples from the posterior density of the parameters, p(Θ(k) |D), to approximate the predictive density, Equation ( 23), by: where Θ t k 's are posterior samples generated from the MCMC process under the HSSMN(Θ(k)) model and M and N k are the burn-in period and run length, respectively.
If we assume Dirichlet priors for p k , that is: (see, e.g., [19] (p.143) for the distributional properties), then: and p(z Thus, the posterior probabilities in Equation ( 24) and the minimum error classification region R k in Equation ( 26) can be generated within the MCMC scheme, which uses Equation ( 27) to approximate the predictive densities involved in Equations ( 24) and (26).

Simulation Study
This section presents results of a simulation study to show the convergence of the MCMC algorithm and the performance of the BP DA SSM N .Simulation of the training sample observations, model estimation by the MCMC algorithm and a comparison of classification results among three BPDA methods were implemented by coding the R package program.The three methods consist of two proposed BP DA SSM N methods (i.e., BP DA RSN and BP DA RSt for classifying RSN and RSt populations) and BP DA N by [11] (for classifying unscreened normal populations).Some of the trace plots from an MCMC run are provided in Figure 1.Each plot demonstrates a parallel zone centered near the true parameter value of interest with no obvious tendency or periodicity.These plots and the small MC error values listed in Table 1 convince us of the convergence of the MCMC algorithm.For a formal diagnostic check, we calculated the Brooks and Gelman diagnostic statistic R c (adjusted shrinkage factor introduced by [31]) using a MCMC runs with three chains in parallel, each one starting from different initial values.The calculated R c value for each parameter is listed in the 10th column of Table 1.Table 1 shows that all of the R c values are close to one, indicating the convergence of the MCMC algorithm.For another formal diagnostic check, we applied the Heidelberger-Welch diagnostic tests of [32] to single-chain MCMC runs, which were used to plot Figure 1.They consist of the stationarity test and the half-width test for the MCMC runs of each parameter.The 11th column of Table 1 lists the p-value of the test for the stationarity of the single Markov chain, where all of the p-values are larger than 0.1.Furthermore, all of the the half-width tests, testing the convergence of the Markov chain of a single parameter, were passed.Thus, all of the diagnostic checking methods (formal and informal methods) advocate the convergence of the proposed MCMC algorithm, and hence, we can say that it generates an MCMC sample that comes from the marginal posterior distributions of interest (i.e., the SSMN population parameters).It is seen that the similar estimation results in Table 1 apply to the posterior summaries of the other parameters in π 2 and π 3 distributions.According to these simulation results, we can say that the MCMC algorithm constructed in Section 3.3 provides an efficient method for estimating the SSMN distributions.To achieve this quality of MCMC algorithm for the higher dimensional case (with large p and/or q values), the diagnostic tests, considered in this section, should be used to monitor the convergence of the algorithm; for more details, see [30].

A Simulation Study: Performance of the Predictive Methods
This simulation study compares the performance of three BPDA methods using training samples generated from three rectangle screened populations, π k (k = 1, 2, 3).The three methods compared are BP DA RSN , BP DA RSt with degrees of freedom ν = 5, and BP DA N (a standard predictive method with no screening).Two different cases of rectangle screened population distributions were used to generate the training samples.One case is the rectangle screened population π k with the RSN p (C q (α, β); µ * k , Σ * k ) distribution.The other case is π k with the RSt p (C q (α, β); µ * k , Σ * k , ν = 5) distribution in order to examine the robustness of BP DA RSt in discriminating observations from heavily-tailed empirical distributions.For each case, we obtained 200 sets of training and validation (or testing) samples of each size n k = 20, 50, 100 generated from the rectangle screened distribution of π k .They are denoted by D k (i) and V k (i)(i = 1, . . ., 200).The i-th validation sample V k (i) that corresponds to the training D k (i) sample was simply obtained by setting V k (i) = D k (i − 1)(i = 1, . . ., 200), where The parameter values of the screened population distributions of the three populations π k were given by: for p = 2, 5, q = 2 and k = 1, 2, 3. Further, we assumed that the parameters µ 0k and Σ 0k of the underlying q-dimensional screening vector v 0 and the rectangle screening region C q (α, β) were known as given above.Thus, we may investigate the performance of the BPDA methods by varying the values of correlation ρ, dimension p of the predictor vector, rectangle screened region and differences among the three population means and covariance matrices whose expressions can be found in [4].Here, 1 r is a r × 1 summing vector whose every element is unity, and J p×q denote a p × 2 matrix whose every odd row is equal to (1, 0) and every even row is (0, 1).Using the training samples, we calculated the approximate predictive densities Equation ( 27) by the MCMC algorithm proposed in Section 3.3.In this calculation, we assumed that p k = 1/3, because n 1 = n 2 = n 3 = n.Thus, the posterior probabilities in Equation ( 24) and the minimum error classification region R k in Equation ( 26) can be estimated within the MCMC scheme, which uses Equation (27) to approximate the predictive densities involved in both Equations ( 24) and (26).Then, we estimated the classification error rates of the three BPDA methods by using the validation samples, V k (i)(i = 1, . . ., 200).To apply the BP DA RSN and BP DA RSt methods for classifying the simulated training samples, we used the optimal classification rule, which uses Equation ( 26), while we used the posterior odds ratio given in [11] to implement the BP DA N method.Then, we compare the classification results in terms of error rates.The error rate of each population (ER π k ) and the total error rate (TotalER) were estimated by: where n * k is the number of misclassified observations out of n k validation sample observations from π k .For each case of π k distributions, the above procedure was implemented on each set of 200 validation samples to evaluate the error rates of the BPDA methods.Here, [Case 1]denotes that the training (and validation) samples were generated from π k : RSN p (C q (α, β); µ * k , Σ * k ), and [Case 2] indicates that they were generated from π k : RSt For each case, Table 2 compares the mean of classification error rates obtained from the 200 replicated classifications by using the BPDA methods.The error rates and their standard errors in Table 2 are indicated as follows.(i) Both the BP DA RSN and BP DA RSt methods work reasonably well in classifying screened observations, compared to the BP DA N method.This implies that, in BPDA, they provide better classification results than the BP DA N , provided that π k 's are screened by a rectangle screening scheme.(ii) The performance of the BP DA SSM N methods becomes better as the correlation (ρ) between the screening variables and predictor variables becomes larger.(iii) For a comparison of the error rates with respect to the values of a, we see that the BP DA SSM N methods tends to yield better performance in the discrimination of a screened by a small rectangle screened region.(iv) The performance of the three BPDA methods improves when the differences of the mean increases.(v) An increase in the sizes of dimension p and training sample n also tends to yield a better performance of the BPDA methods.(vi) As expected, the performance of the BP DA RSN in [Case 1] is better than the other two methods, because the estimates of error rates are not covered by the corresponding two standard errors.Further, a considerable gain in the error rates over the BP DA N manifests the utility of the BP DA RSN in the discriminant analysis.(vii) As for [Case 2], the table indicates that the performance of the BP DA RSt is better than the two other methods.This demonstrates the robustness of the BP DA RSt method in the discrimination with screened and heavy tailed data.

Conclusions
In this paper, we proposed an optimal predictive method (BPDA) for the discriminant analysis of multidimensional screened data.In order to incorporate the prior information about a screening mechanism flexibly in the analysis, we introduced the SSMN models.Then, we provided the HSSMN(Θ(k)) model for Bayesian inference of the SSMN populations, where the screened data were generated.Based on the HSSMN(Θ(k)) model, posterior distributions of Θ(k) were derived, and the calculation of the optimal predictive classification rule was discussed by using an efficient MCMC method.Numerical studies with simulated screened observations were given to illustrate the convergence of the MCMC method and the usefulness of the BPDA.
The methodological results of the Bayesian estimation procedure proposed in the paper can be extended to other multivariate linear models that incorporate non-normal errors, a general covariance matrix and truncated random covariates.For example, the seemingly unrelated regression (SUR) model and the factor analysis model (see, e.g., [19]) can be explained in the same framework of the proposed HSSMN(Θ(k)) in Equation (12).The former is a special case of the HSSMN(Θ(k)) model in which Z k 's are observable as predictors.Therefore, when the regression errors are non-normal, it would be plausible to apply the proposed approach by using the HSSMN(Θ(k)) model to work with a robust SUR model, whereas the latter is a natural extension of the oblique factor analysis model to the case of that with non-normal measurement errors.The HSSMN(Θ(k)) model can also be extended to accommodate missing, values as done in the other models by [33,34].We are hopeful to address these issues, as well, in the near future.

R
k : p(x |D, z = k)p(z = k |D) > p(x |D, z = j)p(z = j |D), for all j = k; k = 1, . . ., K, (26) p(z = k |D) is the posterior probability of population π k given the dataset D. If we assume the values of p k 's are a priori known, then p(z = k |D) = p k .

Table 1 .
Posterior summaries of 200 Markov chain Monte Carlo (MCMC) results for the π 1 models.

Table 2 .
Classification error rates: the respective standard errors are in parenthesis.