ScholarWorks @ UTRGV ScholarWorks @ UTRGV

: Statistical analysis of multinomial data in complex datasets often requires estimation of the multivariate normal ( MVN ) distribution for models in which the dimensionality can easily reach 10–1000 and higher. Few algorithms for estimating the MVN distribution can offer robust and efﬁcient performance over such a range of dimensions. We report a simulation-based comparison of two algorithms for the MVN that are widely used in statistical genetic applications. The venerable Mendell-Elston approximation is fast but execution time increases rapidly with the number of dimensions, estimates are generally biased, and an error bound is lacking. The correlation between variables signiﬁcantly affects absolute error but not overall execution time. The Monte Carlo-based approach described by Genz returns unbiased and error-bounded estimates, but execution time is more sensitive to the correlation between variables. For ultra-high-dimensional problems, however, the Genz algorithm exhibits better scale characteristics and greater time-weighted efﬁciency of estimation.


Introduction
In applied multivariate statistical analysis one is frequently faced with the problem of estimating the multivariate normal (MVN) distribution (or, equivalently, integrating the MVN density) not only for a range of correlation or covariance structures, but also for a number of dimensions (i.e., variables) that can span several orders of magnitude. In applications for which only one or a few instances of the distribution, and of low dimensionality (n < ∼ 10), must be estimated, conventional numerical methods based on, e.g., Newton-Cotes formulae, Gaussian quadrature and orthogonal polynomials, or tetrachoric series, may offer satisfactory combinations of computational speed and estimation precision.
Increasingly, however, statistical analysis of large datasets requires many evaluations of very high-dimensional MVN distributions-often as an incidental part of some larger analysis-and places severe demands on the requisite speed and accuracy of numerical methods. We confront the need to estimate the high-dimensional MVN integral in statistical genetics, and particularly in genetic analyses of extended pedigrees (i.e., large, multigenerational collections of related individuals). A typical exercise is variance component analysis of a discrete trait (e.g., a qualitative or categorical measurement of some disease or other condition of interest) under a liability threshold model [1][2][3]. Maximum-likelihood estimation of the model parameters in such an application can easily require tens or hundreds of evaluations of the MVN distribution for which n ≈ 100-1000 or greater [4][5][6][7], and situations in which n ≈ 10,000 are not unrealistic.
In such problems the dimensionality of the model distribution is determined by the product of the total number of individuals in the pedigree(s) to be analyzed and the number of discrete phenotypes jointly analyzed [1,8]. For univariate traits studied in small pedigrees, such as sibships (sets of individuals born to the same parents) and nuclear families (sibships and their parents), the dimensionality is typically small (n ≈ 2-10), but analysis of multivariate phenotypes in large extended pedigrees routinely necessitates estimation of MVN distributions for which n can easily reach several thousand [2,3,7]. A single variance component-based linkage analysis of a univariate discrete phenotype in a set of extended pedigrees involves estimating these high-dimensional MVN distributions at hundreds of locations in the genome [3,9,10]. In these numerically-intensive applications, estimation of the MVN distribution represents the main computational bottleneck, and the performance of algorithms for estimation of the MVN distribution is of paramount importance.
Here we report the results of a simulation-based comparison of the performance of two algorithms for estimation of the high-dimensional MVN distribution, the widely-used Mendell-Elston (ME) approximation [1,8,11,12] and the Genz Monte Carlo (MC) procedure [13,14]. Each of these methods is well known, but previous studies have not investigated their properties for very large numbers of dimensions. Using conventional numerical desiderata of estimation accuracy, execution time, and computational efficiency, we examine the performance of these algorithms and identify aspects of the overall MVN context to which each method is particularly well suited.
In Section 2 we give a brief overview of techniques for estimating the MVN distribution. The ME and Genz MC algorithms are reviewed in Section 3. Procedures for exercising the algorithms and comparing their performance are described in Section 4, and results of the comparisons are presented in Section 5. In Section 6 we consider the interpretation and broader implications of our results for future applications.

Background
Numerical methods for estimation of the MVN distribution have a long and fascinating history of development and many interesting accounts from varied perspectives have been presented ( [15][16][17][18], and references therein). Classical approaches to the problem have generally been variations on a few standard methods [19,20]. Often, some form of numerical quadrature is involved, in which an estimate of the integral is formed by accumulating a weighted sum of integrand values at a sequence of abcissae covering the region of integration [21][22][23][24]. Tetrachoric series expansions [25,26] offer another approach to the problem, although these series may converge slowly, and in fact do not converge at all at some points in the correlation space for a given number of dimensions [27]. Other approaches have involved quadrature applied to an integral transformation of the tetrachoric series [19,28,29], and decomposition of the multidimensional probability into a product of conditional univariate probabilities [1,8,11,12,[30][31][32][33].
In practice, the utility and applicability of any algorithm for estimating the MVN distribution is overwhelmingly constrained by the dimensionality of the problem. Fast and accurate algorithms have been described for evaluation of the univariate and multivariate normal distributions for 'small' numbers of dimensions; for the frequently encountered cases of n = 1 [34] and n = 2 [35][36][37][38], several algorithms are available that can in principle provide any desired accuracy. For the case n > 2, several algorithms (error-bounded and not) based on quadrature have been developed and their relative performance compared [17,[21][22][23][24]. Monte Carlo approaches to the problem have also been developed that have desirable statistical properties and exhibit good scale properties with the number of dimensions [13,14,39,40].
As the number of dimensions reaches n ≈ 10, many approaches to estimating the MVN distribution become impractical. Conventional series approximations and quadrature methods grow unwieldy, and the computational burden for algorithms using these methods rapidly becomes prohibitive [13,14,19,20,41]. However, methods of estimation based on reduction or transformation of the joint n-variate distribution to a series of (typically univariate) integrals continue to scale favorably with the number of dimensions.

Algorithms
We examined the performance of two algorithms that appear particularly well suited to estimation of the high-dimensional MVN distribution. The first of these is the Mendell-Elston (ME) procedure (Algorithm 1), a deterministic, non-error-bounded procedure that approximates the MVN distribution as a sequence of conditional univariate normal integrals [1,8,11,12]. The second algorithm is the elegant transformation and estimation procedure described by Genz [13,14] (Algorithm 2). In this approach the original n-variate distribution is transformed into an easily sampled (n − 1)-dimensional hypercube and estimated by Monte Carlo methods (e.g., [42,43]). [12]. Estimate the standardized n-variate MVN distribution, having zero mean and correlation matrix R, between vector-valued limits s and t. The function φ(z) is the univariate normal density at z, and Φ(z) is the corresponding univariate normal distribution. See Hasstedt [12] for discussion of the approximation, extensions, and applications.

1.
input n, R, s, t [condition the remaining variables] for j = i + 1, . . . , n, k = j + 1, . . . , n The ME approximation is extremely fast, and broadly accurate over much of the parameter space [1,8,17,41]. The chief source of error in the approximation derives from the assumption that, at each stage of conditioning, the selected and unselected variables continue to distribute in approximately normal fashion [1]. This assumption is analytically true only for the initial stage(s) of selection and conditioning [17]; in subsequent stages the assumption is violated to greater or lesser degree and introduces error into the approximation [31,33,44,45]. Consequently, the ME approximation is most accurate for small correlations and for selection in the tails of the distribution, thereby minimizing departures from normality following selection and conditioning. Conversely, the error in the ME approximation is greatest for larger correlations and selection closer to the mean [1]. [13]. Estimate the m-variate MVN distribution having covariance matrix Σ, between vectorvalued limits a and b, to an accuracy with probability 1 − α, or until the maximum number of integrand evaluations N max is reached. The procedure returns the estimated probability F, the estimation error ∆, and the number of iterations N. The function Φ(x) is the univariate normal distribution at x, Φ −1 (x) is the corresponding inverse function; u( ) is a source of uniform random deviates on (0, 1); and Z α/2 is the two-tailed Gaussian confidence factor corresponding to α. See Genz [13,14] for discussion, a worked example, and suggestions for optimizing algorithm performance.

Algorithm 2 Genz Monte Carlo Estimation of the MVN Distribution
Despite taking somewhat different approaches to the problem of estimating the MVN distribution, these algorithms have some features in common. Most significantly, both algorithms reformulate the initial n-dimensional integral as a series of univariate integrals. This feature facilitates imposing an initial ordering of variables to minimize the potential loss of precision as the integral estimate is accumulated. In similar fashion, prioritizing variables appropriately can also help minimize error in the ME method introduced by violations of the assumptions underlying the method [17].

Program Implementation
Programs implementing the ME and MC approximations were written in ANSI C following published algorithms [12,13]. Implementation of the ME approximation follows the procedure described by Hasstedt [12] for likelihood evaluation of arbitrary mixtures of MVN densities and distributions. Although the algorithm in [12] is presented in the context of statistical genetics, it is a completely general formulation of the ME method and suitable for any application requiring estimation of the MVN distribution. Implementation of the MC approximation directly follows the algorithm presented by Genz [13].
To facilitate testing a simple driver program was written for each algorithm. The driver program accepts arguments defining the estimation problem (e.g., number of dimensions, correlations, limits of integration), and any algorithm-specific parameters (e.g., convergence criteria). The driver program then initializes the problem (i.e., generates the correlation matrix and limits of integration), calls the algorithm, records its execution time, and reports results. For the deterministic ME algorithm there are no essential user options; the only input quantities are those defining the MVN distribution and region of integration. The driver program for the Genz MC algorithm provides options for setting parameters unique to Monte Carlo estimation such as the (maximum) error in the estimate and the (maximum) allowed number of iterations (integrand evaluations) [13].
The actual software implementation of the estimation procedures and their respective driver programs is not critical; experiments with multiple independent implementations of these algorithms have shown consistent and reliable performance irrespective of programming language or style [2,3,7,10,46]. Attention to programming esoterica-e.g., selective use of alternative numerical techniques according to the region of integration, supplementing iterative estimation with functional approximations or table lookup methods, devolving the original integral as a sequence of conditional oligovariate (rather than univariate) problems-could conceivably yield modest improvements in execution times in some applications.

Test Problems
For validating and comparing the MC and ME algorithms it is important to have a source of independently determined values of the MVN distribution against which to compare the approximations returned by each algorithm. For many purposes it may be sufficient to refer to tables of the MVN distribution that have been generated for special cases of the correlation matrix [15,18,[47][48][49][50][51]. Here, however, as in similar numerical studies [1,8,14,41], values of the MVN distribution were computed independently for correlation matrices defined by where n is the number of dimensions, I is the identity matrix, J = 11 is a matrix of ones, and ρ is a correlation coefficient. For R n of this form, the n-variate MVN distribution at b = (b 1 , . . . , b n ) can be reduced to the single integral where φ(t) is the univariate normal density at t and Φ(t) is the corresponding univariate normal distribution [18,47,49,50]. This result involves only univariate normal functions and can be computed to desired accuracy using standard numerical methods (e.g., [43]).
In the second series of comparisons, correlation matrices R n were generated with values of ρ drawn randomly from the uniform distribution U(0, 1) [52,53]; lower limits of integration remained fixed at a i = −∞, but upper limits b i were chosen randomly from the uniform distribution U(0, √ n ). For the Genz MC algorithm an initial estimate was generated using N 0 = 100 iterations (the actual value of N 0 was not critical); then, if necessary, iterations were continued (using N k+1 = 3 2 N k ) until the requested estimation accuracy was achieved [13,14]. Under the usual assumption that independent Monte Carlo estimates distribute normally about the true integral value I, the 1 − α confidence interval for I is I ± Z α/2 σ I / √ n, where I is the estimated value, σ I / √ n is the standard error of I, Z α/2 is the Monte Carlo confidence factor for the standard error, and α is the Type I error probability. Therefore, to achieve an error less than with probability 1 − α, the algorithm samples the integral until Z α/2 σ I / √ n < . For all results reported here we took α = 0.01, corresponding to Z α/2 ≈ 2.5758.

Test Comparisons
Three aspects of algorithm performance were compared: the error in the estimate, the computation time required to generate the estimate, and the relative efficiency of estimation. One can invent many additional interesting and contextually relevant comparisons examining various aspects of estimation quality and algorithm performance, but the criteria used here have been applied in other studies (e.g., [39]), are simple to quantify, broadly relevant, and effective for delineating areas of the MVN problem space in which each method performs more or less optimally.
The estimation error is the difference between the estimate returned by the algorithm and the independently computed expectation. The computation time is the execution time required for the algorithm to return an estimate; for the MC procedure this quantity includes the (comparatively trivial) time required to obtain the Cholesky decomposition of the correlation matrix. The relative efficiency is the time-weighted ratio of the variance in each estimate (see, e.g., [39]). Thus, if t MC and t ME , respectively, denote the execution times of the MC and ME algorithms, and σ 2 MC and σ 2 ME the corresponding mean squared errors in the MC and ME estimates, then the relative efficiency is defined as θ = (t ME σ 2 ME )/(t MC σ 2 MC ), i.e., the product of the relative mean-squared error σ 2 ME /σ 2 MC and the relative execution time t ME /t MC . The measure is somewhat ad hoc, and in practical applications the choice of algorithm should ultimately be informed by pragmatic considerations but-ceteris paribusvalues θ 1 tend to favor the Genz MC algorithm, and values θ 1 tend to favor the ME algorithm.

Computing Platforms
Numerical methods are of little use if they are ill-suited to the hardware available to the user. Both the ME and Genz MC algorithms involve the manipulation of large, nonsparse matrices, and the MC method also makes heavy use of random number generation, so there seemed no compelling reason a priori to expect these algorithms to exhibit similar scale characteristics with respect to computing resources. Algorithm comparisons were therefore conducted on a variety of computers having wildly different configurations of CPU, clock frequency, installed RAM, and hard drive capacity, including an intrepid Intel 386/387 system (25 MHz, 5 MB RAM), a Sun SPARCstation-5 workstation (160 MHz, 1 GB RAM), a Sun SPARCstation-10 server (50 MHz, 10 GB RAM), a Mac G4 PowerPC (1.5 GHz, 2 GB RAM), and a MacBook Pro with Intel Core i7 (2.5 GHz, 16 GB RAM). As expected, clock frequency was found to be the primary factor determining overall execution speed, but both algorithms performed robustly and proved entirely practical for use even with modest hardware. We did not, however, further investigate the effect of computer resources on algorithm performance, and all results reported below are independent of any specific test platform.

Error
The errors in the estimates returned by each method are shown in Figure 1 for a single 'replication', i.e., an application of each algorithm to return a single (convergent) estimate. The figure illustrates the qualitatively different behavior of the two estimation proceduresthe deterministic approximation returned by the ME algorithm, and the stochastic estimate returned by the Genz MC algorithm. In contrast, the error in the estimate returned by the ME method, though not generally excessive, is strongly systematic. For small correlations, or for moderate correlations and small numbers of dimensions, the error is comparable in magnitude to that from MC estimation but is consistently biased. For ρ > ∼ 0.3, the error begins to exceed that of the corresponding MC estimate, and the desired distribution can be significantly under-or overestimated even for a small number of dimensions. This pattern of error in the ME approximation reflects the underlying assumption of multivariate normality of both the marginal and conditional distributions following variable selection [1,8,17]. The assumption is viable for small correlations, and for integrals of low dimensionality (requiring fewer iterations of selection and conditioning); errors are quickly compounded and the approximation deteriorates as the assumption becomes increasingly implausible.
Although bias in the estimates returned by the ME method is strongly dependent on the correlation among the variables, this feature should not discourage use of the algorithm. For example, estimation bias would not be expected to prejudice likelihoodbased model optimization and estimation of model parameters, which are determined by the location of likelihood extrema. However, estimation bias could conceivably vitiate likelihood-ratio tests involving functions of the actual likelihood values. The latter may become of particular concern in applications that accumulate and compare likelihoods over a collection of independent data under varying model parameterizations.

Mean Execution Time
Relative mean execution time, t ME and t MC for the ME and MC algorithms respectively, is summarized in Figure 2 for 100 replications of each algorithm. As absolute execution times for a given application can vary by several orders of magnitude depending on com-puting resources, the figure presents the ratio t ME /t MC which was found to be effectively independent of computing platform. For estimation of the MVN in moderately few dimensions (n < ∼ 30) the ME approximation is exceptionally fast. The mean execution time of the MC method can be markedly greater-e.g., at n ≈ 10 about 10-fold slower for ρ = 0.1 and 1000-fold slower for ρ = 0.9. For small correlations the execution time of the MC method becomes comparable with that of the ME method for n ≈ 100. For the largest numbers of dimensions considered, the Monte Carlo method can be substantially faster-nearly 10-fold when ρ = 0.3 and nearly 20-fold when ρ = 0.1.
The scale properties of mean execution time for the ME and MC algorithms with respect to correlation and number of dimensions may be important considerations for specific applications. The ME method exhibits virtually no variation in execution time with the strength of the correlation, which may be an attractive feature in applications for which correlations are highly variable and the dimensionality of the problem does not vary greatly. For the MC method, execution time increases approximately 10-fold as the correlation increases from ρ = 0.1 to ρ = 0.9, but is approximately constant with respect to the number of dimensions. This behavior would be desirable in applications for which correlations tend to be small but the number of dimensions varies considerably.

Relative Performance
In view of the statistical virtues of the MC estimate but the favorable execution times for the ME approximation, it is instructive to compare the algorithms in terms of a metric incorporating both of these aspects of performance. For this purpose we use the time-and error-weighted ratio used described by Deák [39], and compare the performance of the algorithms for randomly chosen correlations and regions of integration (see Section 4.3). As applied here, values of this ratio greater than one tend to favor the Genz MC method, and values less than one tend to favor the ME method.
The relative mean execution times, mean squared errors, and mean time-weighted efficiencies of the MC and ME methods are summarized in Figure 3. Although ME estimates can be markedly faster to compute-e.g., ∼100-fold faster for n ≈ 100 and ∼10-fold faster for n ≈ 1000, in these replications)-the mean squared error of the MC estimates is consistently 10-100-fold smaller, and on this basis alone is the statistically preferable procedure. Measured by their time-weighted relative efficiency, however, the disparity in performance is less extreme; the ME algorithm is comparatively efficient for n < ∼ 100 dimensions, beyond which the MC algorithm becomes the more efficient approach.

Discussion
Statistical methodology for the analysis of large datasets is demanding increasingly efficient estimation of the MVN distribution for ever larger numbers of dimensions. In statistical genetics, for example, variance component models for the analysis of continuous and discrete multivariate data in large, extended pedigrees routinely require estimation of the MVN distribution for numbers of dimensions ranging from a few tens to a few tens of thousands. Such applications reflexively (and understandably) place a premium on the sheer speed of execution of numerical methods, and statistical niceties such as estimation bias and error boundedness-critical to hypothesis testing and robust inference-often become secondary considerations.
We investigated two algorithms for estimating the high-dimensional MVN distribution. The ME algorithm is a fast, deterministic, non-error-bounded procedure, and the Genz MC algorithm is a Monte Carlo approximation specifically tailored to estimation of the MVN. These algorithms are of comparable complexity, but they also exhibit important differences in their performance with respect to the number of dimensions and the correlations between variables. We find that the ME algorithm, although extremely fast, may ultimately prove unsatisfactory if an error-bounded estimate is required, or (at least) some estimate of the error in the approximation is desired. The Genz MC algorithm, despite taking a Monte Carlo approach, proved to be sufficiently fast to be a practical alternative to the ME algorithm. Under certain conditions the MC method is competitive with, and can even outperform, the ME method. The MC procedure also returns unbiased estimates of desired precision, and is clearly preferable on purely statistical grounds. The MC method has excellent scale characteristics with respect to the number of dimensions, and greater overall estimation efficiency for high-dimensional problems; the procedure is somewhat more sensitive to the correlation between variables, but this is not expected to be a significant concern unless the variables are known to be (consistently) strongly correlated.
For our purposes it has been sufficient to implement the Genz MC algorithm without incorporating specialized sampling techniques to accelerate convergence. In fact, as was pointed out by Genz [13], transformation of the MVN probability into the unit hypercube makes it possible for simple Monte Carlo integration to be surprisingly efficient. We expect, however, that our results are mildly conservative, i.e., underestimate the efficiency of the Genz MC method relative to the ME approximation. In intensive applications it may be advantageous to implement the Genz MC algorithm using a more sophisticated sampling strategy, e.g., non-uniform 'random' sampling [54], importance sampling [55,56], or subregion (stratified) adaptive sampling [13,57]. These sampling designs vary in their approach, applicability to a given problem, and computational overhead, but their common objective is to estimate the integral as efficiently as possible for a given amount of sampling effort. (For discussion of these and other variance reduction techniques in Monte Carlo integration, see [42,43].) Finally, in choosing between these or other procedures for estimating the MVN distribution, it is helpful to observe a pragmatic distinction between applications that are deterministic and those that are genuinely stochastic in nature. The computational merits of fast execution time, accuracy, and precision may be advantageous for the analysis of well-behaved problems of a deterministic nature, yet be comparatively inessential for inherently statistical investigations. In many applications, some sacrifice in the speed of the algorithm (but not, as Figure 1 reveals, in the accuracy of estimation) could surely be tolerated in exchange for desirable statistical properties that promote robust inference [58]. These properties include unbiased estimation of the likelihood, an estimate of error instead of fixed error bounds (or no error bound at all), the ability to combine independent estimates into a variance-weighted mean, favorable scale properties with respect to the number of dimensions and the correlation between variables, and potentially increased robusticity to poorly-conditioned covariance matrices [20,42]. For many practical problems requiring the high-dimensional MVN distribution, the Genz MC algorithm clearly has much to recommend it.