2.1. Ordinal Bayesian Models and R Syntax
We previously described four penalized cumulative logit Bayesian models that can be fit when the covariate space is high-dimensional [
18]. This includes a regression-based variable inclusion indicator ordinal model, a LASSO ordinal model, a normal spike-and-slab ordinal model, and a double exponential spike-and-slab ordinal model. To introduce our penalized cumulative logit Bayesian models, we let
represent the ordinal responses for
n subjects, which can take on one of
ordinal response levels, with
K representing the number of ordinal levels. Let
represent the vector of covariates for subject
i, where
P represents the number of predictors. When assuming proportional odds, the effect of each covariate is constant across all ordinal response levels such that the slope for the ordinal responses are parallel. For each ordinal level
, let
denote a vector of unknown regression coefficients. The cumulative logit model is
where
is the cumulative probability of the event
given
. The thresholds differentiate between the
K ordinal levels and must satisfy the constraint
.
Herein, we describe our
ordinalbayes package that enhances the functionality of the
runjags package by providing functions specific to fitting these four penalized ordinal Bayesian models and extracting results of interest. We also provide an overview of each model. Tables summarizing the package functions and syntax appears in
Appendix C.
The primary function for model fitting in the ordinalbayes package is ordinalbayes. The function arguments are
function (formula, data, x = NULL, subset, center = TRUE, scale = TRUE,
a = 0.1, b = 0.1, model = "regressvi", gamma.ind = "fixed",
pi.fixed = 0.05, c.gamma = NULL, d.gamma = NULL, alpha.var = 10,
sigma2.0 = NULL, sigma2.1 = NULL, coerce.var=10, lambda0 = NULL,
adaptSteps = 5000, burnInSteps = 5000, nChains = 3, numSavedSteps = 9999,
thinSteps = 3, parallel = TRUE, seed = NULL, quiet = FALSE)
This function accepts a model formula that specifies the ordinal response on the left-hand side of the equation and any unpenalized predictor variable(s) on the right-hand side of the equation. Unpenalized predictors are variables such as age that we include in the model without applying any shrinkage of their corresponding parameter estimates. When unpenalized predictors are included as covariates in the model, the user can specify the variance associated with the corresponding model parameters (default coerce.var = 10). If no unpenalized predictor variables are included, the model formula should be (representing the intercept). The user can subset the data.frame prior to model fitting, for example, subset=(race =="white"). To specify the penalized covariates in the model, the user should pass the data.frame to the x parameter, indicating the relevant columns of covariates. By default, the penalized covariates are centered (center = TRUE) and scaled (scale = TRUE).
The selected parameters are initialized prior to updating through MCMC. For one chain, the
ordinal thresholds,
, are initialized to the logit of the cumulative response probabilities, which is equivalent to the estimated
thresholds in an intercept-only model
For multiple chains, initial values for the terms for chains beyond the first chain are sampled from a Normal(0, 0.5) distribution and then sorted to impose the order restriction. Within the MCMC, the terms are sampled from a Normal (0, ), and users can adjust the variance by specifying alpha.var (default 10 such that the precision is 0.10). All penalized coefficients ( for ) are initialized to zero.
Other relevant parameters common to all model types include:
nChains, the number of parallel chains for the model (default 3);
adaptSteps, the number of iterations for adaptation (default 5000);
burnInSteps, the number of iterations of the Markov chain to run (default 5000);
numSavedSteps, the number of saved steps per chain (default 9999); and
thinSteps, the thinning interval for monitors (default 3). Provided the user will be running the model on a machine with multiple processors, the computational speed can be improved by running the chains in parallel by specifying
parallel = TRUE. When
parallel = TRUE,
runjags executes the MCMC sampling using
nChains parallel processors. To ensure the user can obtain reproducible results,
seed accepts an integer that is used to set the random seed. The output from JAGS can be suppressed by specifying
quiet = TRUE. The user can fit one of four available Bayesian models. A list of the parameters the user can set for all four models is provided in
Table A1. Following
Section 2.1.1, which describes applying
ordinalbayes to Bioconductor objects, each of the four models is described along with the relevant arguments that must be specified by the user. A list of the parameters the user needs to set for each specific model is provided in
Table A2.
2.1.1. Use with Bioconductor Objects: SummarizedExperiment and ExpressionSet
When analyzing data processed using the DESeq2 Bioconductor package, the genomic feature object is of class DESeqTransform, which is a SummarizedExperiment, and therefore, the phenotypic data are accessed using the colData extractor function. When analyzing data processed using packages that structure the genomic feature object as a Biobase ExpressionSet, the phenotypic data are accessed using the pData extractor function. Therefore, in the ordinalbayes call, data should be either a colData() or pData() call to the genomic feature object. Again, the ordinalbayes function accepts a model formula that specifies the ordinal response on the left-hand side of the equation and any unpenalized predictor variable(s) from the phenotypic dataset on the right-hand side of the equation. If no unpenalized predictor variables are included, the model formula should be (representing the intercept).
When specifying the penalized covariates in the model, the user should pass to the x parameter the appropriate call for extracting the genomic feature data from the object. For SummarizedExperiment objects, the genomic features to be penalized are accessed using the assay() extractor function. For ExpressionSet objects, the genomic features to be penalized are accessed using the exprs() extractor function. The user can also pass a matrix to x; however, the user needs to carefully verify that the observations in the x matrix are appropriately aligned to the phenotypic data. Note that the number of rows in both data and x should be the same, such that the transpose of assay or exprs should be supplied to x.
2.2. Regression-Based Variable Inclusion Indicator Ordinal Model
By default, the model that is fit is the regression-based variable inclusion indicator Bayesian model, specified by
model = "regressvi". This model takes the form
and assumes the penalized coefficients are from a Laplace (or double exponential) distribution with parameter
and that
is from a Gamma distribution with parameters
a and
b. Based on our extensive simulations [
19], model performance is not affected by choices of
a and
b, so we provide defaults of 0.1 for both. The variable inclusion indicator
is assumed to follow a Bernoulli distribution with parameter
. The user can use either a fixed constant prior (default) or a random prior. When using a fixed constant prior, the user must specify both
gamma.ind="fixed" and set
pi.fixed to some constant in the (0, 1) interval (default is 0.05). Alternatively, a random prior for
is acheived by specifying both
gamma.ind="random" and parameter values (
c.gamma and
d.gamma) for the Beta distribution. Values of
c.gamma and
d.gamma should be selected such that the mean of the Beta distribution for the variable inclusion indicators corresponds to the anticipated proportion of covariates truly associated with the ordinal response, given by
, while considering that the variance is given by
If unpenalized coefficients are included in the model, their coefficients are .
2.6. Other Package Functions
The ordinalbayes function yields an object of class ordinalbayes. Generic functions have been specifically tailored to extract meaningful results from the resulting MCMC chain. The print function returns several summaries from the MCMC output for each parameter monitored, including: the 95th lower confidence limit for the highest posterior density (HPD) credible interval (Lower95), the median value (Median), the 95th upper confidence limit for the HPD credible interval (Upper95), the mean value (Mean), the sample standard deviation (SD), the mode of the variable (Mode), the Monte Carlo standard error (MCerr,) percent of SD due to MCMC (MC%ofSD), effective sample size (SSeff), autocorrelation at a lag of 30 (AC.30), and the potential scale reduction factor (psrf). The plot function provides a trace of the sampled output and optionally the density estimate for each variable in the chain. This function additionally adds the appropriate beta and gamma labels for each penalized variable name.
When identifying important covariates, the regression-based variable inclusion indicator, normal spike-and-slab, and double exponential spike-and-slab Bayesian ordinal models all incorporate a variable inclusion indicator, , in the model. Variable selection can be based on whether the posterior mean of exceeds a pre-specified threshold. Alternatively, we can use the Bayes factor to test the hypotheses , where the null hypothesis is rejected for feature j if the Bayes factor exceeds a pre-specified threshold. For the LASSO, normal spike-and-slab, and double exponential spike-and-slab Bayesian ordinal models, the Bayes factor can be used to test an interval null hypothesis , where is a small positive value that is close to 0. For the regression-based variable inclusion indicator Bayesian ordinal model, the Bayes factor can be used to test . Note that for the Bayesian LASSO, no variable inclusion indicators are incorporated, so variable selection can only be performed using the Bayes factor for . The summary function requires an ordinalbayes object, and the user can specify epsilon (default 0.1) for testing the null hypothesis that . The output from summary is a list containing the following components: alphamatrix, the MCMC output for the threshold parameters; betamatrix, the MCMC output for the penalized parameters; zetamatrix, The MCMC output for the unpenalized parameters (if included); gammamatrix, the MCMC output for the variable inclusion parameters (not available when model = "lasso"); gammamean, the posterior mean of the variable inclusion indicators (not available when model = "lasso"); gamma.BayesFactor, Bayes factor for the variable inclusion indicators (not available when model = "lasso"); Beta.BayesFactor, Bayes factor for the penalized parameters; and lambdamatrix, the MCMC output for the penalty parameter (not available when model="normalss"). The coef function also accepts an ordinalbayes object and returns a function (default is method=mean) of the posterior distribution of the penalized parameter estimates and variable inclusion indicators.
The predict function accepts an ordinalbayes object and optionally allows the user to specify new data for unpenalized predictors and the penalized predictors by invoking neww = and newx =, respectively. If neww and newx are not supplied, the original data are used for prediction. The model.select parameter allows the user to obtain model predictions through one of three different methods. When model.select = "average" (default), the mean coefficient values over the MCMC chain are used to estimate fitted probabilities; the predicted response is attaining the maximum fitted probability. When model.select = "median", the median coefficient values over the MCMC chain are used to estimate fitted probabilities; the predicted ordinal response is attaining the maximum fitted probability. When model.select = "max.predicted.class", each step in the chain is used to calculate fitted probabilities and the ordinal response, then the final predicted ordinal response is taken as that ordinal response level that is most frequently predicted. The function fitted is synonymous with predict.