A Bayes Inference for Ordinal Response with Latent Variable Approach

Sha, Naijun; Dechi, Benard Owusu

doi:10.3390/stats2020023

Open AccessArticle

A Bayes Inference for Ordinal Response with Latent Variable Approach

by

Naijun Sha

^*,†

and

Benard Owusu Dechi

Department of Mathematical Sciences, University of Texas at El Paso, El Paso, TX 79968, USA

^*

Author to whom correspondence should be addressed.

^†

Current address: 500 W University Ave., El Paso, TX 79968, USA.

Stats 2019, 2(2), 321-331; https://doi.org/10.3390/stats2020023

Submission received: 18 May 2019 / Revised: 8 June 2019 / Accepted: 14 June 2019 / Published: 16 June 2019

Download

Browse Figures

Versions Notes

Abstract

:

In this paper, we propose a Bayesian model for the analysis of categorical data with an ordered outcome. The method provides a latent variable approach with an informative prior transformed from a Dirichlet distribution for the boundary parameters. A simulation study is carried out to assess the performance of the methods under various settings of the data structure. Our method produces predictive accuracy over the conventional classification procedures. Real data are analyzed to demonstrate the efficiency of the proposed method.

Keywords:

ordinal outcome; latent variable; Bayesian inference; Dirichlet prior; MCMC sampling

1. Introduction

Over the past few decades, modeling and predicting of ordinal outcomes have become essential to study in various fields especially in social and economic sciences where the natural ordering data very commonly appear. For example, socioeconomic status is typically broken into three levels (high, middle, and low) to describe the three places into which a family or an individual may fall. The level of education consists of high school, bachelor’s, master’s, and doctoral degrees. These levels or classes can be viewed as ordinal variables, but with no scale or magnitude available between each order or category. Many such ordinal data and numerous methods of analysis have been introduced and discussed by a number of researchers: [1,2,3], just to name a few. One well-known method is polytomous ordinal logistic regression (POLR) or the cumulative logit model initially proposed by [4], later called the proportional odds model [2] since the same proportionality constant applies for all cumulative logits. For a multinomial response variable Z with possible J ordered categorical outcomes and the associated p-dimensional vector of covariates

x

, the cumulative probability for Z on

x

is given by:

\begin{matrix} P (Z \leq j | x) = \frac{\exp (α_{j} + x^{'} β)}{1 + \exp (α_{j} + x^{'} β)}, j = 1, 2, \dots, J - 1, \end{matrix}

(1)

or the cumulative logit form as:

\begin{matrix} \log (\frac{P (Z \leq j | x)}{P (Z > j | x)}) = α_{j} + x^{'} β, j = 1, 2, \dots, J - 1, \end{matrix}

(2)

where

α_{j}

is an unknown intercept parameter associated with the

j th

category and

β = {(β_{1}, β_{2}, \dots β_{p})}^{'}

is the common vector of effect coefficients across the categories. The POLR models the cumulative probabilities

P (Z \leq j)

rather than the specific category probabilities

P (Z = j)

as in the nominal logistic regression. Implementing nominal analysis on the ordinal data will lead to underestimation of the variation and considerable loss of information [1]. Other relevant approaches dealing with ordinal outcomes include adjacent-category logit and sequential logit models. A number of researchers have proposed various inference processes for the logistic regression or other methods by classical approaches in [2,5] or from Bayesian perspectives in [6,7]. A comprehensive review about the analysis of ordered category data can be found in [1]. Another commonly-adopted way of modeling ordinal data is via an underlying continuous latent variable. This regards the observed ordinal response as a crude measurement of a continuous variable falling into an interval on the real line. The work in [3] applied a Bayesian analysis approach for a latent variable model to investigate the effect of a binary treatment on an ordinal outcome of interest. A fully-Bayesian method for modeling polychotomous ordinal categories was developed in [6] by using the data augmentation approach. Some issues arise with these approaches, one of which is the estimation of the cutpoint parameters of the interval boundary. Incorporating a vague prior on these parameters, the work in [6] proposed a probit model in a Bayesian framework, which leads to a slow convergence under large sample sizes because of inefficient sampling of the cutoff point parameters. The work in [7] suggested a hybrid Gibbs/Metropolis–Hasting (MH) sampling scheme, which updates these parameters jointly with other parameters. Although the approach reduces the high auto-correlations in the sampling, it may require the computation of joint cumulative probabilities of multivariate distributions, which is generally intractable, for computing the acceptance probability in MH samplers. To avoid these difficulties, the work in [8] proposed a probit model with latent variables following a mixture model, which can successfully characterize the ordinality of data without the need to estimate these parameters.

In this paper, we propose informative priors on the parameters in the probit model, especially for the boundaries associated with category responses based on a Dirichlet distribution via a parameter transformation. The resulting posterior distributions with appropriate hyperparameter values chosen makes the sampling algorithm have fast convergence and efficient estimations. The rest of the article is arranged as follows. Section 2 presents our Bayesian inference procedure for the probit model with the the latent variable approach. Subsequently, we carry out simulation studies to investigate the performance of the proposed methods in Section 3. For illustrative purposes, two real datasets are analyzed in Section 4, followed by some concluding remarks in Section 5.

2. Bayesian Method for the Probit Model

We first briefly introduce the probit model with a latent variable and then present a Bayesian procedure for the analysis. Let

(Z, X)

denote the observed data, where the n-vector

Z = {(Z_{1}, Z_{2}, \dots, Z_{n})}^{'}

is the n-ordered categorical outcomes and

n \times p

matrix of covariates

X = {(x_{1}, x_{2}, \dots, x_{n})}^{'}

with

x_{i}

being the

i th

vector of covariates containing 1 in the first element. The responses

Z_{i}

take one of the J values,

1, \dots, J

, and are associated with the covariate vector

x_{i}

through a latent continuous variable

y_{i}

in the following linear regression model:

y_{i} = x_{i}^{'} β + ε_{i}, ε_{i} \sim N (0, σ^{2}), i = 1, \dots, n,

(3)

where

β = {(β_{1}, β_{2}, \dots, β_{p})}^{'}

is a

p \times 1

vector of regression coefficients and

σ^{2}

is set to 1 to make the model identifiable. The correspondence between the observed outcome

z_{i}

and the latent variable

y_{i}

is defined by:

z_{i} = j if δ_{j - 1} < y_{i} \leq δ_{j}, j = 1, \dots, J,

(4)

where the boundaries

δ_{j}

are unknown and

- \infty = δ_{0} < δ_{1} < \dots < δ_{J - 1} < δ_{J} = \infty

.

2.1. Prior Specification

We chose an independent prior for the parameters

β

and

δ

, respectively. First, we specify a conjugate prior, multivariate normal, for the regression coefficients:

\begin{matrix} β \sim N (β_{0}, Σ_{0}), \end{matrix}

(5)

with the p-vector mean

β_{0}

and

p \times p

variance-covariance matrix

Σ_{0}

being hyperparameters. For setting a prior for the boundaries

δ = (δ_{1}, δ_{2}, \dots, δ_{J - 1})

, first, suppose that there is a continuous distribution function

F (\cdot)

whose domain lies in

(- \infty, \infty)

, for example, a normal function, such that the coverage probability at each interval is

p_{j} = P (δ_{j - 1} < Δ < δ_{j}) = F (δ_{j}) - F (δ_{j - 1}), j = 1.2, \dots, J

, and so,

\sum_{j = 1}^{J} p_{j} = 1

. It follows that:

\begin{matrix} \{\begin{matrix} p_{1} & = & F (δ_{1}) \\ p_{2} & = & F (δ_{2}) - F (δ_{1}) \\ \dots \\ p_{J - 1} & = & F (δ_{J - 1}) - F (δ_{J - 2}) \end{matrix} ⟺ \{\begin{matrix} δ_{1} & = & F^{- 1} (p_{1}) \\ δ_{2} & = & F^{- 1} (p_{1} + p_{2}) \\ \dots \\ δ_{J - 1} & = & F^{- 1} (p_{1} + p_{2} + \dots + p_{J - 1}) . \end{matrix} \end{matrix}

(6)

Secondly, a Dirichlet prior distribution is set for

(p_{1}, p_{2}, \dots, p_{J})

, that is, the prior density

π (p_{1}, p_{2}, \dots, p_{J} | γ) = 1 / B (γ) \times \prod_{j = 1}^{J} p_{j}^{γ_{j} - 1}

with the positive hyperparameters

γ = (γ_{1}, γ_{2}, \dots, γ_{J})

and the Beta coefficient

B (γ) = \prod_{j = 1}^{J} Γ (γ_{j}) / Γ (\sum_{j = 1}^{J} γ_{j})

with the gamma function

Γ (\cdot)

. Thus, by the transformation in (6), the prior of

δ = (δ_{1}, δ_{2}, \dots, δ_{J - 1})

becomes:

\begin{matrix} π (δ_{1}, δ_{2}, \dots, δ_{J - 1} | F, γ) = \frac{1}{B (γ)} \prod_{j = 1}^{J} {[F (δ_{j}) - F (δ_{j - 1})]}^{γ_{j} - 1} \prod_{j = 1}^{J - 1} f (δ_{j}), \end{matrix}

(7)

where

f (\cdot)

is the density function of

F (\cdot)

. Clearly, the expression (7) can be regarded as the joint density of order statistics

(δ_{1}, δ_{2}, \dots, δ_{J - 1})

with the rank index

(γ_{1}, γ_{2}, \dots, γ_{J})

drawn from the distribution function

F (\cdot)

(see, for example, [9]).

2.2. Posterior Inference

The prior beliefs are then updated with information from the data to lead to the following joint posterior distribution:

\begin{matrix} π (β, δ | X, Y, Z) \propto L (β, δ | X, Y, Z) π (β) π (δ), \end{matrix}

(8)

where

Y = {(y_{1}, y_{2}, \dots, y_{n})}^{'}

is the vector of latent variables and the likelihood function

L (β, δ | X, Y, Z) = \prod_{i = 1}^{n} f_{Y} (y_{i} | x_{i}, β) I (δ_{j - 1} < y_{i} < δ_{j}, z_{i} = j)

with

f_{Y} (\cdot)

being the density function of

N (x_{i}^{'} β, 1)

and

I (\cdot)

being the indicator function. Then, the conditional posteriors are:

\begin{matrix} (9) & π (β | X, Y, Z, δ) & \propto & L (β, δ | X, Y, Z) π (β), resulting in β | (X, Y, Z, δ) \sim N (\tilde{β}, \tilde{Σ}), \\ (10) & π (δ | X, Y, Z, β) & \propto & π (δ) \times \prod_{i = 1}^{n} I (δ_{j - 1} < y_{i} < δ_{j}, z_{i} = j), \end{matrix}

with the posterior mean vector

\tilde{β} = {(X^{'} X + Σ_{0}^{- 1})}^{- 1} (X^{'} Y + Σ_{0}^{- 1} β_{0})

and variance-covariance matrix

\tilde{Σ} = {(X^{'} X + Σ_{0}^{- 1})}^{- 1}

. The conditional posterior of

δ

in (10) is a truncated joint density of order statistics

(δ_{1}, δ_{2}, \dots, δ_{J - 1})

. To draw its posterior samples conveniently, we explore the conditional posterior distribution of each component

δ_{j}

. Let

δ_{(- j)}

be the vector

δ

without the

j th

element, that is,

δ_{(- j)} = (δ_{1}, \dots, δ_{j - 1}, δ_{j + 1}, \dots, δ_{J - 1})

, and it follows that the conditional posterior of

δ_{j}

is:

\begin{matrix} π (δ_{j} | {X, Y, Z, δ}_{(- j)}) & \propto & {[F (δ_{j}) - F (δ_{j - 1})]}^{γ_{j} - 1} {[F (δ_{j + 1}) - F (δ_{j})]}^{γ_{j + 1} - 1} f (δ_{j}) \\ \times & I (c_{j, 1} < δ_{j} < c_{j, 2}), j = 1, 2, \dots, J - 1, \end{matrix}

(11)

where

c_{j, 1} = \max {y_{i}, i = 1, 2, \dots, n : z_{i} = j}, c_{j, 2} = \min {y_{i}, i = 1, 2, \dots, n : z_{i} = j + 1}, j = 1, 2, \dots, J - 1

, and

F (δ_{0}) = 0, F (δ_{J}) = 1

. Therefore, conditionally,

δ_{j}

is a random variable whose transformed

F (δ_{j})

is distributed as the shifted

F (δ_{j - 1})

and scaled [

F (δ_{j + 1}) - F (δ_{j - 1})

] Betadistribution Beta

(γ_{j}, γ_{j + 1})

truncated at interval

[F (c_{j, 1}), F (c_{j, 2})]

, or equivalently:

\begin{matrix} \frac{F (δ_{j}) - F (δ_{j - 1})}{F (δ_{j + 1}) - F (δ_{j - 1})}| (δ_{j - 1}, δ_{j + 1}) \sim Beta (γ_{j}, γ_{j + 1}) \\ truncated at [\frac{F (c_{j, 1}) - F (δ_{j - 1})}{F (δ_{j + 1}) - F (δ_{j - 1})}, \frac{F (c_{j, 2}) - F (δ_{j - 1})}{F (δ_{j + 1}) - F (δ_{j - 1})}] . \end{matrix}

(12)

The method that we propose here is closely related to the approach presented in [10] for multinomial probit models. In this context, however, the correspondence between

Z_{i}

and

Y_{i}

uses different boundaries that account for the natural ordering of the outcome. We performed posterior inference using Markov chain Monte Carlo (MCMC) techniques. Specifically, a Gibbs sampling based on the above conditional posteriors was adopted starting from a set of initial values of parameters, and then, the following step was repeated M times, among which, given the values of

Y^{(k)}, β^{(k)}

and

δ^{(k)}

at the

k th

iteration, the

(k + 1) th

iteration is as follows:

(1): Update the latent vector $Y$ from its posterior distribution given $(β, δ, X, Z)$ , each element of which is a truncated normal under the constraints defined in Equation (4):

$\begin{matrix} Y_{i}^{(k + 1)} | (β^{(k)}, δ^{(k)}, X, Z) \sim N (x_{i}^{'} β^{(k)}, 1), & δ_{j - 1}^{(k)} < Y_{i}^{(k + 1)} \leq δ_{j}^{(k)} if Z_{i} = j, \\ j = 1, \dots, J, i = 1, 2, \dots, n . \end{matrix}$

(13)
(2): Update the regression coefficients $β$ from their posterior distribution in Equation (9) under the updated $Y^{(k + 1)}$ .
(3): Update the boundary parameters $δ_{j}$ from their posterior densities given in Equation (12) with $c_{j, 1}$ and $c_{j, 2}$ being evaluated at $Y^{(k + 1)}$ . Specifically, first draw a $d_{j}^{(k + 1)}$ from the truncated Beta in Equation (12), and then, get $δ_{j}^{(k + 1)} = F^{- 1} (u_{j}^{(k + 1)})$ where $u_{j}^{(k + 1)} = F (δ_{j - 1}^{(k + 1)}) + [F (δ_{j + 1}^{(k)}) - F (δ_{j - 1}^{(k + 1)})] d_{j}^{(k + 1)}, j = 1, 2, \dots, J - 1$ .

2.3. Hyperparameter Settings

The prior on

β

depends on the mean

β_{0}

and covariance matrix

Σ_{0}

, and the work in [11] discussed the relative merits and drawbacks of different specifications. Here, we set

β_{0} = 0

and

Σ_{0} = c I

, which is easier to calibrate. The parameter c regulates the amount of shrinkage in the model. In general, we want to avoid very small values of c, which cause too much regularization, and large values, which can induce nonlinear shrinkage as a result of Lindley’s paradox [12]. In the context of probit models for classification into nominal groups, the work in [10] provided some guidelines on how to choose this hyperparameter value. We used similar guidelines here. In practice, the values of c that provide good mixing of the MCMC sampler, with 25–50% distinct visited models, are appropriate [13]. An informative prior of

δ

in (7) can be specified by using the number of categorical data to set the component

γ_{j}

in the parameter

γ

to be the counts in the

j th

category among

z_{i}

’s. Alternatively, a diffuse prior can be made by assigning all

γ_{j} = 1

to express no prior belief for

δ_{j}

, namely, a uniform distribution. For the transformed distribution function

F (\cdot)

, we chose a normal distribution with zero mean and large-scale

σ_{0}

(for example

σ_{0} = 50)

to cover a fairly wide range in

(- \infty, \infty)

, and so, in this case,

F (δ) = Φ (δ / σ_{0})

with

Φ (\cdot)

being the cumulative distribution function of the standard normal.

2.4. Posterior Prediction

The MCMC procedure results in a list of sampled

Y, β

, and

δ

vectors. In order to draw posterior inference, we first need to impute the latent vector

Y

, which can be viewed as missing data. Let

\hat{Y}, \hat{β}

and

\hat{δ}

be the estimates obtained by averaging respectively over the sampled

Y, β

, and

δ

vectors.

Inference on class prediction can be done in various ways. If a further future vector of covariates

x_{f}

is available for validation, the least squares prediction based on a single value of

β

can be computed as:

\begin{matrix} {\hat{y}}_{f} = x_{f}^{'} \hat{β} or {\hat{y}}_{f} = x_{f}^{'} \hat{\tilde{β}}, \end{matrix}

(14)

where the posterior mean

\hat{\tilde{β}} = {(X^{'} X + Σ_{0}^{- 1})}^{- 1} X^{'} \hat{Y}

. Alternatively, we can use the Bayesian model averaging over a set of a posteriori likely models of posterior samples

β^{(k)}

to estimate

y_{f}

as:

\begin{matrix} {\hat{y}}_{f} = \sum_{k = 1}^{M} x_{f}^{'} β^{(k)} π (β^{(k)} | X, \hat{Y}, Z, \hat{δ}) . \end{matrix}

(15)

The ordered categorical outcomes can then be predicted using the boundary correspondence:

\begin{matrix} {\hat{z}}_{f} = j if {\hat{δ}}_{j - 1} < {\hat{y}}_{f} \leq {\hat{δ}}_{j}, j = 1, 2, \dots, J . \end{matrix}

(16)

In addition, by the fact that

y_{f} \sim N (x_{f}^{'} β, 1)

, the prediction probability that it falls in each class can be computed through the model averaging over the posterior samples

β^{(k)}

:

\begin{matrix} P (Z_{f} = j) & \approx & \sum_{k = 1}^{M} P ({\hat{δ}}_{j - 1} < Y_{f} \leq {\hat{δ}}_{j} | β^{(k)}) π (β^{(k)} | X, \hat{Y}, Z, \hat{δ}) \\ = & [Φ ({\hat{δ}}_{j} - x_{f}^{'} {\hat{β}}^{(k)}) - Φ ({\hat{δ}}_{j - 1} - x_{f}^{'} {\hat{β}}^{(k)})] π (β^{(k)} | X, \hat{Y}, Z, \hat{δ}), \end{matrix}

(17)

where

Φ (\cdot)

is the distribution function of the standard normal distribution. The class membership can then be predicted by the mode of the predictive distribution:

\begin{matrix} {\hat{z}}_{f} = \underset{1 \leq j \leq J}{argmax} P (Z_{f} = j) . \end{matrix}

(18)

Furthermore, a less varied estimate can be obtained by the nearest rounded integer computed through membership averaging over predictive probabilities,

\begin{matrix} {\hat{z}}_{f} = [μ_{Z}], where μ_{Z} = \sum_{j = 1}^{J} j P (Z_{f} = j) . \end{matrix}

(19)

3. Simulation Study

We conducted a simulation study to assess the performance of the proposed Bayesian method. The simulated datasets were from three multivariate normal distributions whose dimension was four and the means

μ_{1} = {[3, 2, 4, 1]}^{'}, μ_{2} = {[3, - 2, 4 -, 1]}^{'}, μ_{3} = {[- 3, - 2, - 4, - 1]}^{'}

and equal correlation variance-covariance matrices

Σ = σ^{2} [(1 - ρ) I + ρ 11^{'}]

, where

σ = 2

, the increasingly-ordered correlations

ρ = 0.1, 0.5, 0.9

for three structures, corresponding to the ordered response

z = 1, 2, 3

, respectively,

I

is the identity matrix, and

1

is the vector of ones. We simulated 30 observational data

x = {(x_{1}, x_{2}, x_{3}, x_{4})}^{'}

from each multivariate normal to make a total sample size of 90. Considering the structure of data may be an essential aspect for efficient classification. In order not to make erroneous conclusions from the predicted error rates, we critically examined the nature of the data to explore the variations existing among groups. In a model for predicting ordinal outcomes, it is obvious to believe that, remarkably, the least variations and far apart locations among groups of variables could make it easy to classify or predict correctly into which category a particular observation falls. In this vein, we visualized the data structure by using the tool of displaying data concentration in [14], who proposed the p-dimensional ellipsoid,

E_{d}

, of size (“radius”) d, which is defined as the set of all points

X

in a contour:

\begin{matrix} E_{d} (\bar{X}, S) = {X : {(X - \bar{X})}^{T} S^{- 1} (X - \bar{X}) \leq d^{2}} . \end{matrix}

(20)

Clearly,

E_{d}

corresponds to the set of points whose Mahalanobis distances

D^{2} = {(X - \bar{X})}^{'} S^{- 1} (X - \bar{X})

with sample covariance matrix

S

, from the centroid of the sample,

\bar{X} = {({\bar{X}}_{1}, {\bar{X}}_{2}, \dots, {\bar{X}}_{p})}^{'}

, are less than or equal to

d^{2}

. For multivariate normal variables, the data ellipsoid approximates a contour with constant density in their joint distribution, and

D^{2}

is the asymptotic chi-squared distribution with p degrees of freedom,

χ_{p}^{2}

. The work in [15] elaborated more on the properties of data ellipsoids and their use in a wide variety of problems and applications. Taking

d^{2} = χ_{0.05, 2}^{2} = 5.99

, the two-dimensional pairwise ellipsoid is shown in Figure 1, where the ellipse encloses approximately 95% of the data points under the normal theory. The plot indicates that there were some overlaps between Groups 1 and 2, and Group 3 was relatively well-separated away from the other two. Thus, we expected a slightly high classification error rate where the most misclassified cases would probably occur among Groups 1 and 2.

Next, we performed the analysis by our method. To obtain accurate results and avoid over-fitting, we applied the Bayesian model on the simulated data (training data), and generated a test data with 300 observations for each category for validation. We ran four MCMC chains with widely different starting values for 10,000 iterations each and discarded the first 2000 as burn-in to eliminate dependence on the starting points. We also considered several hyperparameter values for the covariance of the regression coefficients

Σ_{0} = c I

, with c ranging between 5 and 20. There was minimal effect on the overall results, and here, we report the results for

c = 10

. An informative prior for the boundary parameter vector

γ

was specified by setting all the components

γ_{j} = 30

, the counts in the

j th

category among ordinal responses. We note that despite the widely different starting values, there was a good agreement between the results. The error rate for the test data was recorded from the predictions to the misclassification rate tabulated in Table 1, containing the three prediction approaches in Equations (16) (where

y_{f}

is estimated in Equations (14)), (18) and (19). The classification result showed that there were in total about 270 subjects misclassified, among which, 90 out of 300 observations in Group 1 were misclassified into Group 2, while 100 obsin Group 2 were incorrectly assigned to Group 1. The outcome validated our previous judgment from the pairwise ellipses in Figure 1, where Groups 1 and 2 shared quite a bit of common area. As shown in Table 1, the three types of predictions approximately yielded the same error rates, and the polytomous ordinal logistic regression (POLR) model resulted in a higher classification error rate.

Finally, for comparison purposes, we analyzed the data using common classification methods, such as linear discriminant analysis (LDA), quadratic discriminant analysis (QDA), k-nearest neighbor (KNN), and support vector machines (SVM), which build multi-class classifiers without taking the natural ordering of the response into account. For KNN, we considered values of group number k ranging from 2–8, and we report the results for

k = 3

, which gave the lowest overall misclassification rate. All these approaches produced a higher error rate as summarized in Table 1.

4. Real Data Analysis

We applied our model for discrimination analysis on two real datasets. The first was the well-known Irisflower data, and the second was about the measurements of male Egyptian skull data. The purpose of adopting these two datasets was to examine how the model predictions were affected by different data structures and measurement variations among groups.

4.1. Iris Data

We used the dataset as a benchmark for the analysis. It was introduced by the British statistician Ronald [16]. It is also sometimes called Anderson’s Iris data [17], which was collected to quantify the morphological variation of Iris flowers of three related species pictured in Figure 2. The dataset consisted of 50 samples from each of three species (Iris setosa, Iris versicolor, and Iris virginica) with four measured features for each sample: the length and width of the sepals and petals, in centimeters. These are generally viewed as nominal category data and have been analyzed in a vast literature, such as [18,19], and many others. Here, we explore it as ordinal outcomes of three species instead, ranked by the magnitude of measurement variations. The total variances of the four morphological measurements for each species were: 0.3292 (setosa), 0.6248 (versicolor), and 0.8883 (virginica). These numbers, to some extent, represent the size and spread of flowers, which is consistent with the image in Figure 2, where the flower of setosa has the least size in terms of sepal and petal, versicolor larger, and virginica largest.

These findings can also be observed from the visual display of the ellipses for all pairs of measurements shown in Figure 3, where one may further notice that, except a little overlap between groups of versicolor and virginica, the three species of data were overall well-separated; especially, Iris setosa was far apart from the other two species. To obtain an accurate error rate of classification, we applied the cross-validation approach to partition the whole dataset into a training set containing 120 observations with 40 for each category and a testing data containing 30 observations with 10 for each category. The partitioning was repeated five times until all the samples from each species were exhausted. The small error rate of classification produced by our method with cross-validation prediction, along with POLR regression, is summarized in Table 2, where all the cases of setosa were classified correctly by both methods, two (or three) cases from versicolor were misclassified to the virginica species by the Bayesian (or POLR) method, and one from virginica was incorrectly assigned to versicolor by the POLR model. Other common classification methods dealing with nominal data yielded a little larger error rate, shown in Table 2. To see how much the reduction of the error rates for each prediction approach were, we simply computed the error rate of the null model where no covariates was used, that is,

1 - \max (R F_{j})

, as listed in Table 2, where

R F_{j}

is the relative frequency for the

j th

category in the dataset.

4.2. Skull Data

The dataset was obtained from the R embedded package “HSAUR”. The data consisted of four physical measurements in millimeters of 30 male Egyptian skulls from each of five epochs (periods) [20]: Period 1 (4000 BC), Period 2 (3300 BC), Period 3 (1850 BC), Period 4 (200BC), and Period 5 (150 AD). The measures were the maximal breadth (mb), basibregmatic height (bh), basialveolar length (bl), and nasal height (nh) of each skull. Figure 4 gives an illustrative labeled image for these four measurements of a typical skull. The researchers claimed that a change in skull measurement was as a result of the time duration. Systematic changes over time could indicate interbreeding among migrant populations (or the influence of other factors) [21]. The interest in this analysis, however, lies in the ability to predict which period these measurements fall within well.

Figure 5 displays the ellipses for all pairs of measurements, and it clearly shows that the five groups of data were pretty much overlapping each other so that one can hardly distinguish one class from the others, and a high classification error rate can be expected. For our Bayesian method, an MCMC chain with 10,000 iterations was run, and the first half was discarded to eliminate dependence on the starting points. Several hyperparameter values were chosen for the covariance of the regression coefficients, with c ranging between 5 and 15. We note that different hyperparameter values resulted in a similar classification. The error rates of cross-validated prediction are listed in Table 2, where, not surprisingly, the high error rates (83, 82, and 80 skulls were misclassified respectively by the three prediction approaches of the Bayesian method, while 92 skulls were incorrectly assigned by POLR) of classification produced by our method validated our previous judgment. In contrast to the Iris data, the skull dataset provided an example where the measurements tended to cluster around a common centroid and overlapped one another with little evidence of separation in location. The classification results of these two different structural datasets justified the reliability of our model in exploration and prediction for ordinal outcomes’ data.

We also noticed that all other common procedures even yielded higher error rates compared to our method summarized in Table 2. As a result of all these classification outcomes, we could logically infer that the measurements of these male Egyptian skulls did not change significantly across the given time periods.

5. Concluding Remarks

We have proposed a Bayesian approach for predicting ordinal outcomes. Introducing latent variables underlying the ordinal outcomes, the problem reduces to a linear model setting. While MCMC techniques are generally computationally intensive, with an informative prior posed on the boundary parameters, our MCMC algorithm showed practical implementation and fast convergence. The simulation study demonstrated the efficient and impressive performance of the proposed method. We have illustrated that our approach, with applications to two real datasets, provided an efficient, reliable, and precise analysis for ordinal categorical response data.

Author Contributions

N.S. contributions: methodology proposing and development; theoretical derivation; paper organization and writing; B.O.D. contribution: code writing and running; paper writing.

Funding

This research was funded by NSF CMMI-0654417 and NIMHD-2G12MD007592.

Conflicts of Interest

The authors declare no conflict of interest.

References

Agresti, A. Analysis of Ordinal Categorical Data, 2nd ed.; Wiley: Hoboken, NJ, USA, 2010. [Google Scholar]
McCullagh, P. Regression models for ordinal data. J. R. Stat. Soc. Ser. B 1980, 42, 109–142. [Google Scholar] [CrossRef]
Sirisrisakulchai, J.; Sriboonchitta, S. Causal effect for ordinal outcomes from observational data: Bayesian approach. Thai J. Math. 2016, (Special Issue on Applied Mathematics: Bayesian Econometrics), 63–70. [Google Scholar]
Walker, S.H.; Duncan, D.B. Estimation of the probability of an event as a function of several independent variables. Biometrika 1967, 54, 167–179. [Google Scholar] [CrossRef] [PubMed]
Harrell, F.E. Regression Modeling Strategies: With Applications to Linear Models, Logistic and Ordinal Regression, and Survival Analysis, 2nd ed.; Springer: New York, NY, USA, 2001. [Google Scholar]
Albert, J.H.; Chib, S. Bayesian analysis of binary and polychotomous response data. J. Am. Stat. Assoc. 1993, 88, 669–679. [Google Scholar] [CrossRef]
Cowles, M.; Carlin, B.; Connet, J. Bayesian tobit modeling of longitudinal ordinal clinical trial compliance data with nonignorable missingness. J. Am. Stat. Assoc. 1996, 91, 86–98. [Google Scholar] [CrossRef]
Zhou, X. Bayesian Inference for Ordinal Data. Ph.D. Thesis, Rice University, Houston, TX, USA, 2006. [Google Scholar]
David, H.A.; Nagaraja, H.N. Order Statistics, 3rd ed.; Wiley: Hoboken, NJ, USA, 2003. [Google Scholar]
Sha, N.; Vannucci, M.; Tadesse, M.G.; Brown, P.J.; Dragoni, I.; Davies, N.; Roberts, T.; Contestabile, A.; Salmon, M.; Buckley, C.; et al. Bayesian variable selection in multinomial probit models to identify molecular signatures of disease stage. Biometrics 2004, 60, 812–819. [Google Scholar] [CrossRef] [PubMed]
Brown, P.J.; Vannucci, M.; Fearn, T. Bayes model averaging with selection of regressors. J. R. Stat. Soc. Ser. B 2002, 64, 519–536. [Google Scholar] [CrossRef]
Lindley, D.V. A statistical paradox. Biometrika 1957, 44, 187–192. [Google Scholar] [CrossRef]
Gelman, A.; Carlin, J.B.; Stern, H.S.; Rubin, D.B. Bayesian Data Analysis, 2nd ed.; Chapman & Hall: London, UK, 2004. [Google Scholar]
Dempster, A.P. Elements of Continuous Multivariate Analysis; Addison-Wesley Publisher Co.: Reading, MA, USA, 1969. [Google Scholar]
Friendly, M.; Monette, G.; Fox, J. Elliptical insights: Understanding statistical methods through elliptical geometry. Stat. Sci. 2013, 28, 1–39. [Google Scholar] [CrossRef]
Fisher, R.A. The use of multiple measurements in taxonomic problems. Ann. Eugen. 1936, 7, 179–188. [Google Scholar] [CrossRef]
Anderson, E. The irises of the gaspe peninsula. Bull. Am. Iris Soc. 1935, 59, 2–5. [Google Scholar]
Dy, J.G.; Brodley, C.E. Feature selection for unsupervised learning. J. Mach. Learn. Res. 2004, 5, 845. [Google Scholar]
Johnson, R.A.; Wichern, D.W. Applied Multivariate Statistical Analysis, 6th ed.; Prentice Hall: Upper Saddle River, NJ, USA, 2007. [Google Scholar]
Thomson, A.; Randall-MacIver, D. The Ancient Races of the Thebaid: Being an Anthropometrical Study of the Inhabitants of Upper Egypt from the Earliest Prehistoric Times to the Mohammedan Conquest, Based upon the Examination of Over 1500 Crania; Clarendon Press: Oxford, UK, 1905. [Google Scholar]
Hand, D.J.; Lunn, A.D.; McConway, K.J.; Ostrowski, E. A Handbook of Small Datasets; Chapman and Hall/CRC: London, UK, 1994. [Google Scholar]

Figure 1. Simulated data: visualization of data variation for the three groups.

Figure 2. Three species of Iris flower.

Figure 3. Iris data: ellipse structure of three species.

Figure 4. A labeled male Egyptian skull.

Figure 5. Skull data: ellipses structure of five groups.

Table 1. Simulated data: test data prediction misclassification rates. POLR, polytomous ordinal logistic regression; QDA, quadratic discriminant analysis.

Bayesian Method			POLR	LDA QDA KNN SVM
Boundary	Probability	Average
0.308	0.307	0.294	0.348	0.362 0.357 0.356 0.371

Table 2. Real data: cross-validated prediction misclassification rates.

	Bayesian Method			POLR	LDA QDA KNN SVM	Null Model
	Boundary	Probability	Average
Iris	0.013	0.013	0.013	0.026	0.036 0.034 0.035 0.033	0.667
Skull	0.553	0.546	0.533	0.607	0.646 0.633 0.653 0.653	0.800

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Sha, N.; Dechi, B.O. A Bayes Inference for Ordinal Response with Latent Variable Approach. Stats 2019, 2, 321-331. https://doi.org/10.3390/stats2020023

AMA Style

Sha N, Dechi BO. A Bayes Inference for Ordinal Response with Latent Variable Approach. Stats. 2019; 2(2):321-331. https://doi.org/10.3390/stats2020023

Chicago/Turabian Style

Sha, Naijun, and Benard Owusu Dechi. 2019. "A Bayes Inference for Ordinal Response with Latent Variable Approach" Stats 2, no. 2: 321-331. https://doi.org/10.3390/stats2020023

APA Style

Sha, N., & Dechi, B. O. (2019). A Bayes Inference for Ordinal Response with Latent Variable Approach. Stats, 2(2), 321-331. https://doi.org/10.3390/stats2020023

Article Menu

A Bayes Inference for Ordinal Response with Latent Variable Approach

Abstract

1. Introduction

2. Bayesian Method for the Probit Model

2.1. Prior Specification

2.2. Posterior Inference

2.3. Hyperparameter Settings

2.4. Posterior Prediction

3. Simulation Study

4. Real Data Analysis

4.1. Iris Data

4.2. Skull Data

5. Concluding Remarks

Author Contributions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI