ordinalbayes: Fitting Ordinal Bayesian Regression Models to High-Dimensional Data Using R

The stage of cancer is a discrete ordinal response that indicates the aggressiveness of disease and is often used by physicians to determine the type and intensity of treatment to be administered. For example, the FIGO stage in cervical cancer is based on the size and depth of the tumor as well as the level of spread. It may be of clinical relevance to identify molecular features from high-throughput genomic assays that are associated with the stage of cervical cancer to elucidate pathways related to tumor aggressiveness, identify improved molecular features that may be useful for staging, and identify therapeutic targets. High-throughput RNA-Seq data and corresponding clinical data (including stage) for cervical cancer patients have been made available through The Cancer Genome Atlas Project (TCGA). We recently described penalized Bayesian ordinal response models that can be used for variable selection for over-parameterized datasets, such as the TCGA-CESC dataset. Herein, we describe our ordinalbayes R package, available from the Comprehensive R Archive Network (CRAN), which enhances the runjags R package by enabling users to easily fit cumulative logit models when the outcome is ordinal and the number of predictors exceeds the sample size, P > N, such as for TCGA and other high-throughput genomic data. We demonstrate the use of this package by applying it to the TCGA cervical cancer dataset. Our ordinalbayes package can be used to fit models to high-dimensional datasets, and it effectively performs variable selection.


Introduction
Despite the advent of HPV vaccinations and effective screening programs, globally, cervical cancer is the fourth most common cancer among women [1]. The estimated number of new cases in 2020 is 604,127 with 341,831 deaths [2]. The stage of cervical cancer, as outlined in the International Federation of Gynecology and Obstetrics (FIGO) guidelines, is based on physical examinations, endoscopic procedures, and imaging. Specifically, the FIGO stage is based on the size and depth of the tumor as well as the level of spread [3]. It is important that the stage, a discrete ordinal response, be correct as it is used to guide treatment planning, counsel patients with respect to prognosis, and to determine whether the patient meets eligibility criteria for available clinical trials or other research studies [4,5]. Unfortunately, there is still debate as to whether surgical or non-invasive radiological modalities for identifying parametrial and lymph node involvement is preferred when staging a patient [4]. Thus, it is clinically relevant to identify molecular features from high-throughput genomic assays that are associated with the stage of cervical cancer to elucidate pathways related to tumor aggressiveness, identify improved molecular features that may be useful for staging, and identify therapeutic targets.
Penalized frequentist models have been widely applied when analyzing high-dimensional data. Such models were initially described for linear [6] and logistic [7] regression and subsequently for ordinal response models [8][9][10]. However, when applying penalized frequentist models, the penalty parameter, or vector of parameters in the case of elastic net, must be selected by the analyst. As a result, the coefficient estimates from the resulting model are conditional on that penalty parameter. For that reason, penalized Bayesian models were developed for the linear [11][12][13][14] and logistic [15][16][17] regression settings. We also recently described penalized Bayesian models for the ordinal response setting [18] and demonstrated that our penalized Bayesian cumulative logit model has improved variable selection performance when compared to penalized frequentist cumulative logit models [19].
Herein, we describe our ordinalbayes R package, which enhances the runjags R package [20] by enabling users to easily fit penalized Bayesian cumulative logit models. The ordinalbayes function can be used to fit LASSO, normal spike-and-slab, double exponential spike-and-slab, and regression-based variable inclusion indicator Bayesian models. Variable selection can be performed using the Bayes factor or using the posterior distributions of the variable inclusion indicators directly. In the following sections, we describe our implementation and describe the syntax required for each of our Bayesian models. We then illustrate the functions in the ordinalbayes R package using two examples where we were interested in identifying transcripts important to predicting the FIGO stage in cervical cancer patients using high-throughput gene expression data. A small example is provided in Appendix A.

Ordinal Bayesian Models and R Syntax
We previously described four penalized cumulative logit Bayesian models that can be fit when the covariate space is high-dimensional [18]. This includes a regression-based variable inclusion indicator ordinal model, a LASSO ordinal model, a normal spike-and-slab ordinal model, and a double exponential spike-and-slab ordinal model. To introduce our penalized cumulative logit Bayesian models, we let Y 1 , … ,Y n represent the ordinal responses for n subjects, which can take on one of 1, … , K ordinal response levels, with K representing the number of ordinal levels. Let x i = (x i1 , x i2 , … , x iP )′ represent the vector of covariates for subject i, where P represents the number of predictors. When assuming proportional odds, the effect of each covariate is constant across all ordinal response levels such that the slope for the ordinal responses are parallel. For each ordinal level k = 1, 2, … , K − 1, let β = (β 1 , β 2 … , β P )′ denote a vector of unknown regression coefficients. The cumulative logit model where Pr(Y i ≤ k|x i ) is the cumulative probability of the event Y i ≤ k given x i . The thresholds differentiate between the K ordinal levels and must satisfy the constraint −∞ = α 0 < α 1 <α 2 < ··· < α K−1 <α K = ∞.
Herein, we describe our ordinalbayes package that enhances the functionality of the runjags package by providing functions specific to fitting these four penalized ordinal Bayesian models and extracting results of interest. We also provide an overview of each model. Tables summarizing the package functions and syntax appears in Appendix C.
The primary function for model fitting in the ordinalbayes package is ordinalbayes. The function arguments are This function accepts a model formula that specifies the ordinal response on the left-hand side of the equation and any unpenalized predictor variable(s) on the right-hand side of the equation. Unpenalized predictors are variables such as age that we include in the model without applying any shrinkage of their corresponding parameter estimates. When unpenalized predictors are included as covariates in the model, the user can specify the variance associated with the corresponding model parameters (default coerce.var = 10). If no unpenalized predictor variables are included, the model formula should be y ~ 1 (representing the intercept). The user can subset the data.frame prior to model fitting, for example, subset=(race =="white"). To specify the penalized covariates in the model, the user should pass the data.frame to the x parameter, indicating the relevant columns of covariates. By default, the penalized covariates are centered (center = TRUE) and scaled (scale = TRUE).
The selected parameters are initialized prior to updating through MCMC. For one chain, the k − 1 ordinal thresholds, α k , are initialized to the logit of the cumulative response probabilities, which is equivalent to the estimated k − 1 thresholds in an intercept-only For multiple chains, initial values for the α k terms for chains beyond the first chain are sampled from a Normal(0, 0.5) distribution and then sorted to impose the α 1 < ··· <α k−1 order restriction. Within the MCMC, the α k terms are sampled from a Normal 0, σ α k 2 , and users can adjust the variance by specifying alpha.var (default 10 such that the precision is 0.10). All penalized coefficients (β j for j = 1, … , P) are initialized to zero.
Other relevant parameters common to all model types include: nChains, the number of parallel chains for the model (default 3); adaptSteps, the number of iterations for adaptation (default 5000); burnInSteps, the number of iterations of the Markov chain to run (default 5000); numSavedSteps, the number of saved steps per chain (default 9999); and thinSteps, the thinning interval for monitors (default 3). Provided the user will be running the model on a machine with multiple processors, the computational speed can be improved by running the chains in parallel by specifying parallel = TRUE. When parallel = TRUE, runjags executes the MCMC sampling using nChains parallel processors. To ensure the user can obtain reproducible results, seed accepts an integer that is used to set the random seed. The output from JAGS can be suppressed by specifying quiet = TRUE. The user can fit one of four available Bayesian models. A list of the parameters the user can set for all four models is provided in Table A1. Following Section 2.1.1, which describes applying ordinalbayes to Bioconductor objects, each of the four models is described along with the relevant arguments that must be specified by the user. A list of the parameters the user needs to set for each specific model is provided in Table A2.

Use with Bioconductor Objects: SummarizedExperiment and
ExpressionSet-When analyzing data processed using the DESeq2 Bioconductor package, the genomic feature object is of class DESeqTransform, which is a SummarizedExperiment, and therefore, the phenotypic data are accessed using the colData extractor function. When analyzing data processed using packages that structure the genomic feature object as a Biobase ExpressionSet, the phenotypic data are accessed using the pData extractor function. Therefore, in the ordinalbayes call, data should be either a colData() or pData() call to the genomic feature object. Again, the ordinalbayes function accepts a model formula that specifies the ordinal response on the left-hand side of the equation and any unpenalized predictor variable(s) from the phenotypic dataset on the right-hand side of the equation. If no unpenalized predictor variables are included, the model formula should be y ~ 1 (representing the intercept).
When specifying the penalized covariates in the model, the user should pass to the x parameter the appropriate call for extracting the genomic feature data from the object. For SummarizedExperiment objects, the genomic features to be penalized are accessed using the assay() extractor function. For ExpressionSet objects, the genomic features to be penalized are accessed using the exprs() extractor function. The user can also pass a matrix to x; however, the user needs to carefully verify that the observations in the x matrix are appropriately aligned to the phenotypic data. Note that the number of rows in both data and x should be the same, such that the transpose of assay or exprs should be supplied to x.

Regression-Based Variable Inclusion Indicator Ordinal Model
By default, the model that is fit is the regression-based variable inclusion indicator Bayesian model, specified by model = "regressvi". This model takes the form γ j Bernoulli π j , for j = 1, …, p π j = t or π j Beta(c, d), for j = 1, …, p and assumes the penalized coefficients are from a Laplace (or double exponential) distribution with parameter λ and that λ is from a Gamma distribution with parameters a and b. Based on our extensive simulations [19], model performance is not affected by choices of a and b, so we provide defaults of 0.1 for both. The variable inclusion indicator γ j is assumed to follow a Bernoulli distribution with parameter π j . The user can use either a fixed constant prior (default) or a random prior. When using a fixed constant prior, the user must specify both gamma.ind="fixed" and set pi.fixed to some constant in the (0, 1) interval (default is 0.05). Alternatively, a random prior for π j is acheived by specifying both gamma.ind="random" and parameter values (c.gamma and d.gamma) for the Beta distribution. Values of c.gamma and d.gamma should be selected such that the mean of the Beta distribution for the variable inclusion indicators corresponds to the anticipated proportion of covariates truly associated with the ordinal response, given by c/(c + d), while considering that the variance is given by Archer et al. Page 5 If unpenalized coefficients are included in the model, their coefficients are ζ ~ Normal 0, σ coerce 2 .

Lasso Ordinal Model
The LASSO Bayesian ordinal model can be fit by specifying model="lasso". This model assumes the penalized coefficients β j for j = 1, … , P are from independent Laplace (or double exponential) distributions with parameter λ and that λ is from a Gamma distribution with parameters a and b.
As previously mentioned, model performance is not affected by choices of a and b, so we provide defaults of 0.1 for both. If unpenalized coefficients are included in the model, their coefficients are ζ Normal 0, σ coerce 2 .
When fitting this model, the user is required to specify the variance for the spike σ 0 2 by setting sigma2.0 to a small positive value (e.g., 0.01) and variance for the slab σ 1 2 by setting sigma2.1 to a large positive value (e.g., 10). As with the regression-based variable inclusion indicator Bayesian model, the variable inclusion indicator γ j is assumed to follow a Bernoulli distribution with parameter π j . The user can use either a fixed constant prior (default) or a random prior. When using a fixed constant prior, the user must specify both gamma.ind="fixed" and set pi.fixed to some constant in the (0, 1) interval (default is 0.05). Alternatively, a random prior for π j is acheived by specifying both gamma.ind="random" and parameter values (

Double Exponential Spike-and-Slab Ordinal Model
The double exponential spike-and-slab ordinal model can be fit by specifying model="dess" and is given by γ j Bernoulli π j , for j = 1, …, p π j = t or π j Beta(c, d), for j = 1, …, p When fitting this model the user is required to specify the parameter for the spike (λ 0 ) using lambda0, which should be a large positive value (e.g., 20), while the slab is taken to be a double exponential distribution with parameter λ where that λ is from a Gamma distribution with parameters a and b. As with the regression-based variable inclusion indicator and Normal spike-and-slab models, the variable inclusion indicator γ j is assumed to follow a Bernoulli distribution with parameter π j . The user can use either a fixed constant prior (default) or a random prior. When using a fixed constant prior, the user must specify both gamma.ind="fixed" and set pi.fixed to some constant in the (0, 1) interval (default is 0.05). Alternatively, a random prior for π j is achieved by specifying both gamma.ind="random" and parameter values (c.gamma and d.gamma) for the Beta distribution. If unpenalized coefficients are included in the model, their coefficients are ζ Normal 0, σ coerce 2 .

Other Package Functions
The ordinalbayes function yields an object of class ordinalbayes. Generic functions have been specifically tailored to extract meaningful results from the resulting MCMC chain. The print function returns several summaries from the MCMC output for each parameter monitored, including: the 95th lower confidence limit for the highest posterior density (HPD) credible interval (Lower95), the median value (Median), the 95th upper confidence limit for the HPD credible interval (Upper95), the mean value (Mean), the sample standard deviation (SD), the mode of the variable (Mode), the Monte Carlo standard error (MCerr,) percent of SD due to MCMC (MC%ofSD), effective sample size (SSeff), autocorrelation at a lag of 30 (AC.30), and the potential scale reduction factor (psrf). The plot function provides a trace of the sampled output and optionally the density estimate for each variable in the chain. This function additionally adds the appropriate beta and gamma labels for each penalized variable name.
When identifying important covariates, the regression-based variable inclusion indicator, normal spike-and-slab, and double exponential spike-and-slab Bayesian ordinal models all incorporate a variable inclusion indicator, γ j , in the model. Variable selection can be based on whether the posterior mean of γ j exceeds a pre-specified threshold. Alternatively, we can use the Bayes factor to test the hypotheses H 0j : γ j = 0 versus H aj : γ j = 1, where the null hypothesis is rejected for feature j if the Bayes factor exceeds a pre-specified threshold. For the LASSO, normal spike-and-slab, and double exponential spike-and-slab Bayesian ordinal models, the Bayes factor can be used to test an interval null hypothesis H 0j : |β j | ≤ ϵ versus H aj : |β j | > ϵ, where ϵ is a small positive value that is close to 0.
For the regression-based variable inclusion indicator Bayesian ordinal model, the Bayes factor can be used to test H 0j : |γ j β j | ≤ ϵ versus H aj : |γ j β j | > ϵ. Note that for the Bayesian LASSO, no variable inclusion indicators are incorporated, so variable selection can only be performed using the Bayes factor for β. The summary function requires an ordinalbayes object, and the user can specify epsilon (default 0.1) for testing the null hypothesis that H 0j : |β j | ≤ ϵ. The output from summary is a list containing the following components: alphamatrix, the MCMC output for the threshold parameters; betamatrix, the MCMC output for the penalized parameters; zetamatrix, The MCMC output for the unpenalized parameters (if included); gammamatrix, the MCMC output for the variable inclusion parameters (not available when model = "lasso"); gammamean, the posterior mean of the variable inclusion indicators (not available when model = "lasso"); gamma.BayesFactor, Bayes factor for the variable inclusion indicators (not available when model = "lasso"); Beta.BayesFactor, Bayes factor for the penalized parameters; and lambdamatrix, the MCMC output for the penalty parameter (not available when model="normalss"). The coef function also accepts an ordinalbayes object and returns a function (default is method=mean) of the posterior distribution of the penalized parameter estimates and variable inclusion indicators.
The predict function accepts an ordinalbayes object and optionally allows the user to specify new data for unpenalized predictors and the penalized predictors by invoking neww = and newx =, respectively. If neww and newx are not supplied, the original data are used for prediction. The model.select parameter allows the user to obtain model predictions through one of three different methods. When model.select = "average" (default), the mean coefficient values over the MCMC chain are used to estimate fitted probabilities; the predicted response is attaining the maximum fitted probability. When model.select = "median", the median coefficient values over the MCMC chain are used to estimate fitted probabilities; the predicted ordinal response is attaining the maximum fitted probability. When model.select = "max.predicted.class", each step in the chain is used to calculate fitted probabilities and the ordinal response, then the final predicted ordinal response is taken as that ordinal response level that is most frequently predicted. The function fitted is synonymous with predict.

Analysis of Cervical Cancer Dataset
We downloaded the transcript-level HTSeq count data for the 309 subjects from the The Cancer Genome Atlas Cervical Squamous Cell Carcinoma and Endocervical Adenocarcinoma (TCGA-CESC) project [21] having transcriptome profiling performed using the TCGAbiolinks Bioconductor package [22]. We then restricted attention to the 253 cervical cancer subjects with a primary diagnosis of squamous cell carcinoma. Subsequently, we removed one subject whose sample was FFPE preserved, one subject with metastatic disease, two subjects who contributed only solid normal tissue, and seven subjects lacking FIGO stage. This left 242 subjects in Stage I (N = 124), II (N = 61), and III-IV (N = 57). Using the DESeq2 Bioconductor package [23], we performed differential expression analysis using the stage as the independent predictor in the negative binomial model. We then applied the regularized log transformation to robustly transform the count data to a log 2 scale to stabilize the variance and then filtered the resulting dataset to retain transcripts that had a mean expression > 0.5 and FDR< 0.10 from the stage I versus stages III/IV contrast.
We fit a regression-based variable inclusion indicator Bayesian ordinal model using a Beta(0.01, 0.19) hyperprior for the π j using the runjags package to run three parallel chains with 5000 burn-in, 5000 tuning steps, and thinned to keep every third step in the sampling process to reduce auto-correlation in our posterior samples, and we kept 9999 saved steps per chain. Convergence was assessed using Gelman and Rubin's potential scale reduction factor (PSRF).

Results
There were 1137 transcripts that were differentially expressed at a Benjamini-Hochberg FDR< 0.05 and 2009 transcripts that were differentially expressed at a Benjamini-Hochberg FDR< 0.10 when examining the contrast between stage I and stages III/IV. These 2009 transcripts were retained for Bayesian modeling. Forty transcripts had a Bayes factor > 4 when testing H 0j : |γ j β j | ≤ 0.1 versus H aj : |γ j β j | > 0.1. Forty-one transcripts had a Bayes factor > 4 when testing H 0j : γ j = 0 versus H aj : γ j = 1 (Table 1). Notably, the features were the same with the exception that Bayes factor testing γ j = 0 additionally identified ENSG00000115548 (Gene symbol KDM3A).
Many genes listed in Table 1 are relevant to cervical cancer, related cancers of the female reproductive system, or cancer in general. For example, in a tissue-based study, CAPN6 was not detected in normal cervical squamous epithelium, but its expression was observed in low-grade and increased further in high-grade squamous cervical intraepithelial lesions [24]. KDM3A is an epigenetic regulator that has been found to be highly expressed in cervical cancer tissues and involved in cervical cancer progression [25]. P4HA1 was included in a five-gene signature to predict cervical cancer prognosis [26]. A previous study suggested that CMTM5 is a tumor suppressor that is frequently methylated and thus loses function in cancer [27], including cervical cancer [28]. RAB6C has been shown to be aberrantly methylated in cervical cancer compared to normal tissues [29]. ALDH7A1 was among 30 genes that demonstrated a dose-response pattern with NNK, a tobacco carcinogen, in cervical cancer samples [30], implicating tobacco may be a causative factor in cervical cancer development in addition to HPV infection.
Other genes, while not yet described in cervical cancer, have been found to be prognostic in ovarian cancer (RGS11 [31], CHAD and CBLN2 [32], NETO1 [33], HSPE1 [34], and BIRC6, which Lnc-TTC27-9 is intronic to [35]). The expression of SH3BP5 is reduced in ovarian cancer samples compared to normal tissue and that silencing of Sab protein expression may lead to chemo-resistance [36]. The expression of SNTN has high discriminatory power to differentiate between normal tissue, serous borderline ovarian tumors, and serous ovarian carcinoma [37]. IL22RA2 is highly expressed in various tissues, including those in the female reproductive system [38]. With respect to genes associated with other cancers, NXT2 was among 12 genes used to define prognostic risk groups in melanoma [39]. A review article described that the aberrant expression of HS3ST3B1 is observed in many cancers, and the authors posited that HS3ST3B1 may act as a tumorpromoting enzyme [40]. The expression of KRT85 was found to be associated with overall survival in subjects with colon cancer [41].
When using the fitted model using the 2009 transcripts, only 16.9% of subjects were misclassified, with all misclassifications in Stage II. However, when fitting a parsimonious model including only the 41 transcripts in Table 1, the misclassification rate decreased to 11.6%. For evaluating the effectiveness of this multi-category classification, we evaluated the hypervolume under ROC manifold [42,43], which was 0.865 (95% CI: 0.800, 0.914) for the 41 transcript model, indicating good discrimination among the three stages.

Discussion
The ordinalbayes package is based on runjags and enables the user to easily fit penalized ordinal Bayesian cumulative logit models to high-dimensional datasets. The package includes methods for monitoring the mixing of chains (plot) and convergence (print). It also includes a summary function that permits the user to estimate the Bayes factor for testing an interval null hypothesis for β j and for testing the null that γ j = 0 to assist the user with variable selection. The coef function uses the posterior distribution to return summary estimates of the penalized β j and the γ j indicators. The predict (or equivalently, fitted) function can be used to obtain the estimated ordinal response probabilities as well as the predicted ordinal response level for each observation.
When applied to The Cancer Genome Atlas cervical cancer dataset, predictive performance was excellent. When restricting attention to only the 41 transcripts with a Bayes factor > 4, predictive performance yielded an overall misclassification error of 11.6%, though the misclassification error increased from 0% for Stage I and III/VI in the full model to 3.2% and 14.0%, respectively, in the reduced model. Interestingly, transcripts that were identified have known associations with cervical cancer, cancers of the female reproductive system, and other cancer in general. The syntax we used to analyze this dataset appears in the Appendix B.

Funding:
This research was funded by the National Cancer Institute of the National Institutes of Health under Award Number R03CA245771. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

Data Availability Statement:
The TCGA-CESC high-throughput gene expression data can be downloaded using the TCGAbiolinks Bioconductor package.

Appendix A. Example
The ordinalbayes R package is available from CRAN and github (https://github.com/ kelliejarcher/ordinalbayes), where the latter includes installation instructions and the example code in this Appendix. For a toy example, subset data of the cervical cancer data are stored in the data.frame named cesc. This data.frame includes age_at_index, cigarettes_per_day, race, Stage, and expression of 41 transcripts. The regressionbased variable inclusion model with random prior to π can be fit after loading the ordinalbayes R package using the syntax: library("ordinalbayes") data("cesc") fit <-ordinalbayes(Stage~1, data = cesc, including the psrf to assess model convergence. Please note that to foster reproducibility of our output, we set the random seed. Subsequent runs using different seeds will produce different results due to the random nature of the MCMC sampling. To summarize the fitted model object, To obtain the γ estimates we used the following code: To obtain model predictions, phat<-predict(fit)

Appendix B. Reproducing CESC Results Using Bioconductor Objects
The data used in this example are stored in the finalSet object. Because this object was derived using the DESeq2 BioConductor package, we load it first. Please note that due to the use of the default parameters for the number of saved steps per chain (9999) and the large size of this dataset, the model took 3.2 days to run on a 13 inch MacBook Pro with four cores and 16GB RAM. For those interested in running examples using this package, a smaller version of these data, reducedSet, which includes the 41 transcripts, may be used instead. Alternatively, parameters related to the number of steps can be reduced.
The regression-based variable inclusion model with random prior to π was fit after loading the ordinalbayes R using the syntax: x=t(assay(finalSet)), model="regressvi", gamma.ind="random", c.gamma=0. 01, d.gamma=0.19, seed=26) Again note that to foster reproducibility of our output, we set the random seed. Subsequent runs using different seeds will produce different results due to the random nature of the MCMC sampling.
We then summarize the fitted model object and to identify which transcripts had a Bayes factor > 4 when testing H 0j : |γ j β j | ≤ 0. To obtain the γ estimates, we used the following code: x=t(assay(reducedSet)), model="regressvi", gamma.ind="random", c.gamma=100, d.gamma=1, seed=26) Because we were using gamma.ind="random", we changed the parameter values for the variable inclusion indicator hyperprior to c.gamma=100, d.gamma=1 ensure virtually all transcripts would be included in each model. If fitting a model using gamma.ind="fixed", the hyperprior pi.fixed=0.99 would accomplish the same thing. This smaller model only took 9.1 min to complete.
phat.reduced<-predict(fitted.regressvi.reduced) The mcca R package additionally includes a plot function that can be used to explore the plot of the three-dimensional ROC surface. The variance for the spike (set to some small positive value, e.g., 0.01)

Abbreviations
The following abbreviations are used in this manuscript: