Comparing the MCMC Efﬁciency of JAGS and Stan for the Multi-Level Intercept-Only Model in the Covariance-and Mean-Based and Classic Parametrization

: Bayesian MCMC is a widely used model estimation technique, and software from the BUGS family, such as JAGS, have been popular for over two decades. Recently, Stan entered the market with promises of higher efﬁciency fueled by advanced and more sophisticated algorithms. With this study, we want to contribute empirical results to the discussion about the sampling efﬁciency of JAGS and Stan. We conducted three simulation studies in which we varied the number of warmup iterations, the prior informativeness, and sample sizes and employed the multi-level intercept-only model in the covariance-and mean-based and in the classic parametrization. The target outcome was MCMC efﬁciency measured as effective sample size per second (ESS/s). Based on our speciﬁc (and limited) study setup, we found that (1) MCMC efﬁciency is much higher for the covariance-and mean-based parametrization than for the classic parametrization, (2) Stan clearly outperforms JAGS when the covariance-and mean-based parametrization is used, and that (3) JAGS clearly outperforms Stan when the classic parametrization is used.


Introduction
Bayesian statistics is gaining in popularity in many disciplines and are used for many different purposes, for instance, to include previous knowledge, to estimate otherwise intractable models, to model uncertainty (e.g., [1]), and to stabilize parameter estimates (e.g., [2]).
A popular software platform for Bayesian estimation is the BUGS family including BUGS [3,4] (see [5] for an overview of the history of BUGS), WinBUGS [6], OpenBUGS [7,8], JAGS [9], and NIMBLE [10]. Monnahan et al. (p. 339, [11]) even call BUGS the "workhorse for Bayesian analyses in ecology and other fields for the last 20 years." More recently, the software Stan whose development was inspired by the "pathbreaking programs" BUGS and JAGS (p. 538, [12]) and "motivated by the desire to solve problems that could not be solved in a reasonable time [. . .] using other packages" (p. 537, [12]) entered the market with promises of higher computational and algorithmic efficiency. Often, the superiority of Stan over JAGS, which is a more modern member of the BUGS family, is claimed to be due to more advanced MCMC algorithms. Whereas JAGS uses conjugate and slice sampling, Stan uses the No-U-Turn Sampler (NUTS; [13]), which is an adaptive variant of Hamiltonian Monte Carlo (HMC; [14]). A comprehensive illustration of NUTS and HMC can be found in the work of Monnahan et al. [11] and more detailed technical descriptions in the works of Nishio and Arakaw [15] and Betancourt [16].
There has been much debate on efficiency differences between Stan and JAGS, and some authors have explored this research question by conducting comparison studies. For instance, Carpenter et al. (p. 10, [17]) found that "Compared to BUGS and JAGS, Stan is Psych 2021, 3 often relatively slow per iteration but relatively fast to generate a target effective sample size." Monnahan et al. (p. 339, [11]) conclude that "[f]or small, simple models there is little practical difference between the two platforms, but Stan outperforms BUGS as model size and complexity grows." which is in line with Gelman et al.'s (p. 538, [12]) statement that "Stan is faster for complex models and scales better than Bugs or Jags for large data sets". However, Grant et al. [18] compared the performance (total time per effective sample size) of various software (including StataStan and JAGS) depending on the number of parameters in the Rasch and hierarchical Rasch model and found that "no one software option dominated" (p. 350, [18]). Additionally, Merkle et al. (p. 2,[19]) report that the original Stan implementation in the package blavaan "was not much faster or more efficient than the JAGS approach.", and Wingfeet [20] concluded that "[n]either JAGS nor Stan came out clearly on top". Further, Bølstad [21] reports mixed results depending on the conjugacy of the priors, with JAGS beating Stan for a fully conjugate hierarchical model. For a completely non-conjugate model with t-distributions instead of normal distribution, the effect reversed, with Stan being much faster.
In summary, the competition between JAGS and Stan has not been finally decided and performance might depend on several factors, for instance, on the model itself, its complexity, number of parameters, priors, and the parametrization.

Purpose and Scope
The purpose of the present work is to contribute to the discussion about the efficiency of JAGS and Stan. To this end, we conducted three simulation studies in which we varied the number of warmup iterations, the informativeness of the prior distributions, the sample sizes, and the model parametrization, and compared the MCMC efficiency operationalized as the effective sample size per second (ESS/s) between JAGS and Stan. The targeted model was the multi-level intercept-only model, which is a popular model in, for example, psychological research and the building block for many more complex multi-level models.
The article is organized into the following sections. First, we describe our Simulation Study 1 including the data generation, the simulation design, the analysis approaches and procedures, and the results of this simulation study. As suggested by anonymous reviewers, we extended the scope of our work by adding Simulation Study 2 (in which we explored a small sample scenario) and Simulation Study 3 (in which we used another model parametrization). Second, we conclude with a discussion of our work. Annotated JAGS/rjags and Stan/rstan code and an example generated data file are provided in the Supplementary Material and Appendices A-G.

Simulation Study 1
In this simulation study, the MCMC efficiency of estimating the covariance-and mean-based parametrization of the multi-level intercept-only model with JAGS and Stan is explored.

Data Generation
The data generating model was the multi-level intercept-only model with overall mean µ = 0, level 2 variance σ 2 θ = 1, residual variance σ 2 = 1, J = 1000 level 2 units, and P = 20 level 1 units: where y jp is the pth value of level 2 unit j, and θ j is unit j's mean parameter. The number of generated data sets (replications) was N repl = 1000. Each of these data sets were analyzed within all 16 design cells of the simulation design.

Analysis
The simulation study was conducted with the statistical software R [22]. For JAGS [23], the R package rjags [24], and for Stan [25], the R package rstan [26] was used. The analysis model was similar to the data generating model, but we used the covariance-and meanbased implementation of the multi-level intercept-only model [27]: where S is the sample scatter matrix,ȳ is the sample mean vector, W is the Wishart distribution, and where Σ and µ are the model-implied covariance matrix and mean vector: The inverse of Σ is: with ξ = Pσ 2 θ σ 2 + σ 4 .
This "covariance-and mean-based approach" [27] is conceptually similar to the "marginal Stan method" [19] as both groups of authors capitalize on the idea of integrating latent variables out of the model likelihood and has, for instance, been shown to be beneficial for MCMC estimation of continuous-time models [28]. Throughout the paper, we consistently use the terms "covariance-and mean-based" and "classic" as defined in the work of Hecht et al. [27], although other labels exist. The former approach is also called "marginal" approach as parameters are integrated out of the likelihood. Formulating models for continuous variables in terms of multivariate normal and Wishart distributions has also previously been described in, for instance, the work of Goldstein [29]. Hecht et al. [27] created the term "classic" to distinguish approaches that include certain parameters from the ones in which they are integrated out.
The parametrization of the Wishart distribution differs in JAGS and Stan. Whereas JAGS's dwish(Σ −1 , df) function uses the inverse covariance matrix, Stan's wishart(df, Σ) function uses the covariance matrix. To make the comparison between JAGS and Stan as fair as possible, we avoided time-consuming in-software matrix inversion by passing the parametrization-consistent matrix to the functions, that is, matrix C from Equation (7) for setting up the model in JAGS and matrix Σ from Equation (5) for setting up the model in Stan. Avoiding the inverse of large covariance matrices has also been recommended by Goldstein [29]. Each replication ran on one Intel Xeon E5-2687W (3.0 GHz) CPU of a 64-bit Windows Server 2008 system with a total of 48 CPUs, on which we ran 36 replications (each on one core) in parallel. Run times for the following process steps were obtained. For JAGS, the warmup run time [jags.model()] and the sampling run time [jags.samples()] were determined by the before-after difference of time stamps obtained from the function Sys.time(). For Stan, the warmup and sampling run times can be obtained from the console output of the function sampling() and are also retrievable from the returned results object (attributes(fitted@sim$samples[[1]])[["elapsed_time"]]). Additionally, we recorded the time to set up and compile the Stan model [stan_model(), stanc()] and also the run time of the function sampling() via Sys.time() differences. We report all run times in seconds. The number of warmup iterations [set via argument n.adapt in jags.model() and argument warmup in Stan's sampling()] was varied in the simulation design (see above), whereas the number of sampling iterations [set via argument n.iter in jags.samples() and argument iter in Stan's sampling()] was 100,000. Only these 100,000 sampling iterations served as the basis for computing further statistics (see next paragraph). Hence, the warmup iterations were excluded from further processing and can be considered as omitted "burn-in". Whether omission of additional burn-in was needed or not was determined by visual inspections of trace plots and based on convergence statistics (potential scale reduction (PSR)). One chain per parameter was used without thinning. Starting values were the true parameter values (see Data Generation above).
The effective sample size (ESS) and the PSR were computed with the R package shinystan [30] for both JAGS and Stan samples. The mode of the converged chain served as the parameter estimate. As parameter recovery and precision statistics, bias, root mean squared error (RMSE), and the 95% coverage rate were computed. Our target outcome variable is the MCMC efficiency, which we calculated as the ratio of the ESS and the run time (in seconds) of the sampling on one CPU (a similar definition of MCMC efficiency can be found, for example, in the work of Turek et al. [31]). Carpenter et al. (p. 10, [17]) termed this ESS/s (or its inverse) the "most relevant statistic for comparing the efficiency of sampler implementations" because the estimation accuracy is governed by the square root of the ESS, a fact that has also been shown by Zitzmann and Hecht [32] (see also [33]).

Results
Descriptive statistics (based on the 100,000 sampling iterations) are presented in Table 1. The maximum PSR value is 1.0002 (JAGS) and 1.0003 (Stan), respectively, indicating that all chains had converged. Visual inspection of randomly selected trace plots confirmed trouble-free convergence. Mean ESS values were 74,334 for JAGS and 86,486 for Stan. Hence, Stan produced a higher ESS than JAGS on average. Considering that a higher ESS was achieved in much shorter time, Stan clearly samples more efficiently than JAGS on average (in our simulation setup). However, in addition to warmup and sampling, Stan's sampling() function took another 215 seconds on average before returning the samples. Thus, users need to wait much longer for the results of the sampling process when using rstan instead of rjags.  Figure 1 shows the MCMC efficiency (i.e., the effective sample size produced in one second) split by the levels of the simulation factors and the model parameters. Additional to these marginal mean MCMC efficiencies, we present mean MCMC efficiencies for the simulation factors and model parameters split by software JAGS and Stan in Figure 2 to investigate interaction effects. The overall mean MCMC efficiency was 1180 ESS/s. Considering software, the average MCMC efficiency is 649 for JAGS and 1712 for Stan. Thus, Stan outperforms JAGS by roughly a factor of two and a half on average. Mean MCMC efficiency is lower for 150 warmup iterations (M warmup=150 = 1034) than for 1000 warmup iterations (M warmup=1000 = 1327). From   Effect sizes η 2 for the simulation factors are presented in Table 2. Software exhibits the by far highest variance explanation (68.4%). Warmup iterations and parameter explain 5.2% and 2.0% of the variance in MCMC efficiency, respectively, whereas variance explanation by prior informativeness is essentially zero. Interactions with above zero variance explanation are Software × Warmup Iterations (5.4%), Software × Parameter (4.5%), Warmup Iterations × Parameter (1.0%), and Software × Warmup Iterations × Parameter (1.0%). In summary, both software estimate the multi-level intercept-only model equally well, but Stan outperforms JAGS in the production of effective sample size per time unit. Further, Stan profits from more warmup iterations, the prior informativeness is practically not related to the MCMC efficiency, and the MCMC efficiency differs between model parameters.

Simulation Study 2: Small Sample Size
In this simulation study, the MCMC efficiency of estimating the covariance-and mean-based parametrization of the multi-level intercept-only model with JAGS and Stan is explored for a small sample size scenario. The simulation design and the analysis strategy were similar to Simulation Study 1. The data generation was similar as well, except that the number of level 2 units was reduced to J = 100 and the number of level 1 units to P = 5.
The overall mean efficiency in this small sample scenario is 8343 ESS/s and thus roughly seven times higher than in Simulation Study 1 where sample sizes were larger (J = 1000, P = 20). Investigating the mean MCMC efficiencies split by the simulation factors, model parameters, and software ( Figure 3) yields basically the same pattern as in Simulation Study 1. Stan outperforms JAGS; however, JAGS' underperformance is not as pronounced as in the large sample scenario. Concerning warmup, the figure again shows that JAGS does not profit from more warmup iterations, whereas Stan does. Prior informativeness exhibits no clear effect. An interaction of parameter and software can again be identified. Whereas JAGS performs better in efficiently estimating µ than σ 2 θ and σ 2 , Stan is approximately equally efficient in estimating all three model parameters.
In summary, reducing the sample size resulted in higher overall MCMC efficiency, but software differences in MCMC efficiency remained, although JAGS caught up somewhat to Stan.

Simulation Study 3: Classic Parametrization
In this simulation study, the MCMC efficiency of estimating the classic parametrization of the multi-level intercept-only model with JAGS and Stan is explored. The data generation and the simulation design were similar to Simulation Study 1. For the analysis model, the classic parametrization was now chosen, in which the parameters of the level 2 units (θ j in Equations (1) and (2)) were part of the model formulation and thus needed to be sampled (see [27] for further details on the differences between the classic and the covariance-and mean-based parametrization of the multi-level intercept-only model). With J = 1000 level 2 units, this meant that 1000 additional parameters needed to be sampled (an increase by a factor of 334 compared to Simulation Study 1). Hence, to keep the simulation within manageable boundaries, the number of sampling iterations was reduced to 10,000.
Compared to Simulation Studies 1 and 2 (with the covariance-and mean-based parametrization), two major differences emerged: (1) The overall mean efficiency (137 ESS/s) is by far lower, and (2) the software rank order reversed: When employing the classic parametrization, JAGS is much more efficient than Stan (see Figure 4). Again, Stan profits from more warmup iterations, whereas JAGS does not, and prior informativeness has no effect on MCMC efficiency. Concerning the parameters, there is no clear differential picture (i.e., within software, the parameters are estimated with roughly the same efficiency), although Stan seems to have problems with estimating the residual variance σ 2 (with just 6 ESS/s).
In summary, choosing the classic instead of the covariance-and mean-based implementation resulted in lower overall MCMC efficiency, and JAGS clearly outperformed Stan.

Discussion
With this study, we want to contribute empirical results to the discussion about the sampling efficiency of JAGS and Stan. In our limited simulations, Stan outperformed (i.e., exhibited a higher ESS/s) JAGS for the covariance-and mean-based parametrization of the multi-level intercept-only model. However, when the classic parametrization was chosen, JAGS outperformed Stan. Additionally, we found that Stan profited from more warmup iterations and that prior informativeness had no effect on the MCMC efficiency.
The results from our simulation study most certainly will not generalize to other models (or model parametrizations), conditions, or other values/levels of our simulation factors.
Our model was a very simple one. Stan is often said to gain in efficiency with increasing model complexity (e.g., [11,12]). We found higher efficiency (roughly by a factor of 2.5) already for a simple model. It would be interesting whether Stan outperforms JAGS even more for more complex multi-level models and other relevant models for psychological research.
Besides the model itself, the parametrization of the model is crucial as well. We showed that with the classic parametrization of the multi-level intercept-only model, JAGS clearly outperformed Stan. As the classic parametrization is presumably the most intuitive and easiest to implement for psychologists, the software recommendation clearly leans towards JAGS here. However, as shown, users can speed up their analyses (i.e., more efficiently estimate the model parameters) by switching to the covariance-and meanbased parametrization (a comprehensive tutorial on how to set up this parametrization is given by Hecht et al. [27]). For this parametrization, Stan clearly outperforms JAGS and should be the software of choice (if MCMC efficiency is the target criterion). Our results are in line with Merkle et al.'s [19] results who reported superiority of their "new Stan approach" which is conceptually similar to Hecht et al.'s [27] covariance-and mean-based approach. The efficiency of this new approach was even so convincing that the authors of R package blavaan made it their new default method (p. 12, [19]). However, in contrast to Hecht et al. ([27], Equations (13) and (14)), Merkle et al. ( [19], Equations (5) and (6)) use multivariate normal marginal distributions of the sample value vectors (presumably because it might be a more flexible and convenient approach, for instance for handling missing data). Future research could investigate performance differences between the Wishart and the multivariate normal parametrization. From our experience with JAGS (not published), multivariate normal modeling of the sample value vectors was much slower than the Wishart modeling of the sample scatter matrix, but this might be different for Stan. Future research could compare more parametrizations of multi-level models.
An anonymous reviewer pointed us to an interesting advantage of marginalized model parametrizations which is detailed in the work of Nielsen et al. [34]. In the classic parametrization that includes random effects, positive within-cluster correlations are assumed. The covariance-based parametrization, in which random effects are integrated out, is more general because negative correlations are allowed as well.
Another approach to improve the efficiency of MCMC sampling is to formulate the model in a way that autocorrelation in the chains is reduced. Various strategies for various models have been proposed (e.g., [35][36][37]). According to Monnahan et al. (p. 344,[11]), "MCMC efficiency for hierarchical models depends on the random effects parameterization, with the centered and non-centered complementary forms being useful for a broad class of models". In our simulations, we solely used the centered form (Equation (2)) because it is presumably the "natural" form researchers obtain when they translate their model equations into code. In the alternative non-centered form, the random effects (θ j ) would not be modeled directly, instead as θ j = µ + σZ with Z ∼ N (0, 1). Results from Monnahan et al.'s (p. 346, [11]) comparisons of JAGS and Stan suggest that Stan is "more sensitive to the parameterization of the random effects, suggesting analysts use non-centered parameterizations to improve performance". Future research could generate further evidence on how to improve MCMC efficiency by model reformulations and explore which software profits most.
Concerning sample sizes, other studies (e.g., [11]) report that Stan's relative efficiency over JAGS increases with increasing sample size. Our results are in line with this finding (for the covariance-and mean-based parametrization); for our small sample size scenario, Stan did not outperform JAGS as much as in the large sample size scenario. As we only had two sample size scenarios (and varied sample size only for the covariance-and meanbased parametrization), generalizations are limited. Future research could investigate the dependency of sample sizes on the efficiency and the moderating effect of sample size on the difference in efficiency between both software in more detail, especially also with respect to the model parametrization. Additionally, the relative performance of different methods might depend on the ratio of the variance components in the simulation model.
We used a high number of iterations to achieve sensibly sized absolute runtimes and to obtain reliable ESS estimates [38]. We assume that the MCMC efficiency (ESS/s) is constant over the course of sampling. Hence, given the chain has converged, the length of the sampling should have no effect on the MCMC efficiency.
We limited our study to two popular Bayesian software in psychological research, namely JAGS and Stan. Other Bayesain software packages, for instance, NIMBLE [10], PyMC3 [39], and LaplacesDemon [40], exist and have been the focus of research (e.g., [38,[41][42][43]). Future software comparisons should take these packages even more strongly into account.
The effect of the warmup iterations on Stan's MCMC efficiency is in line with the functioning of the NUTS algorithm as this algorithm needs to tune the step size to achieve a target acceptance rate and to tune the mass matrix whose function is to transform the posterior to have a simpler geometry for sampling (e.g., [11]). As the optimization of the step size and the mass matrix are mutually dependent, a sufficiently long warmup is needed [11]. We had only two warmup iteration sizes (150 and 1000), with 150 being the lowest default size for one warmup cycle (p. 150, [44]). An anonymous reviewer pointed out that the marginal variances of the parameters in our model are rather small; therefore, it is not surprising that one warmup cycle was not sufficient for optimization. In future research, it would be interesting to explore the shape of the functional dependency of the MCMC efficiency on the number of warmup iterations and determine areas of diminishing marginal utility to derive rule-of-thumb thresholds for sufficient warmup of the NUTS algorithm. Additionally, one could facilitate warmup optimization by programming all parameters "so that they have unit scale and so that posterior correlation is reduced; [. . .] For Hamiltonian Monte Carlo, this implies a unit mass matrix, which requires no adaptation as it is where the algorithm initializes." (p. 266, [45]). Further, the shown warmup dependency should be taken into consideration when comparing results from studies that differed in this aspect. Merkle et al. (p. 8, [19])-who used 300 warmup iterations for Stan-already pointed to this problem and concluded that therefore "the ESS/s metric is somewhat crude".
We found no effect of the prior informativeness on the MCMC efficiency. However, this result is just valid for the specific four variations of prior informativeness (and all other specifications) in our simulation setup and might not generalize. In fact, in simulation runs with much lower informativeness (not reported), Stan's MCMC efficiency was lower. Future research could investigate the prior informativeness effect on efficiency with a wider range of informativeness. Additionally, in our simulation, the amount of data was rather high, marginalizing the effect of the priors. In scenarios with less data, prior effects might arise.
The conjugacy of priors may play a role in MCMC efficiency as well. Using conjugate priors is usually considered computationally more efficient than using non-conjugate priors. Some software/algorithms might even profit from conjugate priors more than others. For instance, results from Bølstad [21] suggest that JAGS performs better than Stan when priors are conjugate and worse when priors are non-conjugate. In our simulations, we used conjugate priors for all parameters in all conditions, leaving prior conjugacy a constant.
Thus, we cannot contribute to the discussion of the effect of prior conjugacy on MCMC efficiency. Researchers could pick up this interesting topic in the future.
Although Stan was clearly more efficient than JAGS in the sampling phase, the total time users encounter until samples are returned depends on additional steps. Stan models need to be compiled into C++ code prior to sampling, which was considerable in our study (on average, model compilation took longer than the sampling of 100,000 iterations). Of course, once compiled, models may be reused to avoid the extensive compilation time. Still, users who run a model for the first time (or are not aware that previously compiled models may be reused) must afford the compilation time, and rstan's primary user-level function stan() that includes all processes may mask the opportunity for reusing compiled models and contribute to a lengthy user experience. Further, we encountered a relatively huge consumption of additional time besides warmup and sampling by rstan's sampling() function. In fact, the additional time was about twice the time for all other steps combined. We are not sure what this function needs this additional time for. According to an anonymous reviewer, the additional time is partly due to calculating ESS and PSR. As this needs to be done for JAGS as well, fair software comparisons would also need to take this into account. Maybe future versions of rstan's sampling() function may include an option to return the samples directly after sampling or explicitly report the time needed for additional calculations of convergence and precision statistics.
To conclude, in our specific study setup, the picture concerning MCMC efficiency was mixed. Stan clearly outperformed JAGS when the covariance-and mean-based parametrization of the multi-level intercept-only model was used and JAGS clearly outperformed Stan when the classic parametrization was used. In both software, MCMC efficiency is much higher for the covariance-and mean-based parametrization than for the classic parametrization.
Supplementary Materials: The following are available online at https://www.mdpi.com/article/10 .3390/psych3040048/s1, The supplementary materials include an exemplary generated data set in Rdata format, JAGS and Stan code for the multi-level intercept-only model in the covariance-and mean-based and in the classic parametrization, R code to run the models with rjags and rstan, and R code to run simulations.

Author Contributions:
The authors declare the following contributions (as defined by http://credit. niso.org (accessed on 30 November 2021)) to this article: M.H.: conceptualization, data curation, formal analysis, investigation, methodology, project administration, software, supervision, validation, visualization, writing: original draft, writing-review and editing; S.W.: conceptualization, data curation, formal analysis, investigation, methodology, software, validation, writing-review and editing; S.Z.: conceptualization, methodology, supervision, validation, writing-review and editing. All authors have read and agreed to the published version of the article.

Funding:
The authors received no financial support for the research, authorship, and/or publication of this article.

Informed Consent Statement: Not applicable.
Data Availability Statement: In this study, only generated data was used. The data generating code is provided in Appendix G.