Locally Scaled and Stochastic Volatility Metropolis– Hastings Algorithms

: Markov chain Monte Carlo (MCMC) techniques are usually used to infer model parameters when closed-form inference is not feasible, with one of the simplest MCMC methods being the random walk Metropolis–Hastings (MH) algorithm. The MH algorithm suffers from random walk behaviour, which results in inefﬁcient exploration of the target posterior distribution. This method has been improved upon, with algorithms such as Metropolis Adjusted Langevin Monte Carlo (MALA) and Hamiltonian Monte Carlo being examples of popular modiﬁcations to MH. In this work, we revisit the MH algorithm to reduce the autocorrelations in the generated samples without adding signiﬁcant computational time. We present the: (1) Stochastic Volatility Metropolis–Hastings (SVMH) algorithm, which is based on using a random scaling matrix in the MH algorithm, and (2) Locally Scaled Metropolis–Hastings (LSMH) algorithm, in which the scaled matrix depends on the local geometry of the target distribution. For both these algorithms, the proposal distribution is still Gaussian centred at the current state. The empirical results show that these minor additions to the MH algorithm signiﬁcantly improve the effective sample rates and predictive performance over the vanilla MH method. The SVMH algorithm produces similar effective sample sizes to the LSMH method, with SVMH outperforming LSMH on an execution time normalised effective sample size basis. The performance of the proposed methods is also compared to the MALA and the current state-of-art method being the No-U-Turn sampler (NUTS). The analysis is performed using a simulation study based on Neal’s funnel and multivariate Gaussian distributions and using real world data modeled using jump diffusion processes and Bayesian logistic regression. Although both MALA and NUTS outperform the proposed algorithms on an effective sample size basis, the SVMH algorithm has similar or better predictive performance when compared to MALA and NUTS across the various targets. In addition, the SVMH algorithm outperforms the other MCMC algorithms on a normalised effective sample size basis on the jump diffusion processes datasets. These results indicate the overall usefulness of the proposed algorithms.


Introduction
Markov chain Monte Carlo (MCMC) algorithms have been successfully utilised in fields such cosmology, finance, and health [1][2][3][4][5][6][7][8][9] and are preferable to other approximate techniques such as variational inference because they guarantee asymptotic convergence to the target distribution [10,11]. Examples of MCMC methods include inter alia Metropolis-Hastings [12], Metropolis Adjusted Langevin Algorithm [4,13], Hamiltonian Monte Carlo [14], Shadow Hamiltonian Monte Carlo [15][16][17] and Magnetic Hamiltonian Monte Carlo [9,[18][19][20]. The execution times of MCMC algorithms are a significant issue in practice [21]. Algorithms with low running times are preferable to methods with longer run times. curvature of the target reduces the correlations in the samples generated by MH. However, this comes at a cost of a substantial increase in the computation cost due to the calculation of the Hessian matrix of the posterior distribution.
The SVMH method sets the scaling matrix in MH to be random with a user-specified distribution, similarly to setting the mass matrix in Quantum-Inspired Magnetic Hamiltonian Monte Carlo [19,25] to be random to mimic the behavior of quantum particles. This approach has the benefit of improving the exploration of MH without any noticeable increase in computational cost. This strategy has been shown in the Quantum-Inspired Hamiltonian Monte [25] Carlo to improve sampling of multi-modal distributions. The weakness of this approach is that it requires the user to specify the distribution of the scale matrix and possibly tune the parameters of the chosen distribution.
The investigation in this manuscript is performed using a simulation study based on Neal's funnel of varying dimensions, multivariate Gaussian distributions of different dimensions, and using real world data modeled using jump diffusion processes and Bayesian logistic regression, respectively. The jump diffusion model examined in this paper is the one-dimensional jump diffusion model of Merton [26]. The calibration of various stochastic processes using MCMC is an exciting research avenue. In future work, we plan on calibrating Levy processes [27] with stochastic volatility [28] features using various MCMC methods.
We compare the novel LSMH and SVMH methods to MH, the No-U-Turn sampler (NUTS) method (state-of-art), and MALA. The performance is measured using the effective sample size, effective sample size normalised by execution time, and predictive performance on unseen data using the negative log-likelihood using test data and the area under the receiver operating curve (AUC). Techniques that produce higher effective sample sizes and AUC are preferable to those that do not, and methods that produce lower negative log-likelihood (NLL) on test data are preferable.
The experimental results show that LSMH and SVMH significantly enhance the MH algorithm's effective sample sizes and predictive performance in both the simulation study and real-world applications-that is, across all the target posteriors considered. In addition, SVMH does this with essentially no increase in the execution time in comparison to MH. Given that only minor changes are required to create SVMH from MH, there seems to be relatively little impediment to employing it in practice instead of MH.
The results show that although both MALA and NUTS exceed the proposed algorithms on an effective sample size basis, the SVMH algorithm has similar or better predictive performance (AUC and NLL) than MALA and NUTS on most of the targets densities considered. Furthermore, the SVMH algorithm outperforms MALA and NUTS on the majority of the jump diffusion datasets on time normalised effective sample size basis. These results illustrate the overall benefits that can be derived by using a well-chosen random scaling matrix in MH. The main contributions of this work are that:

Methods
This section outlines the Markov chain Monte Carlo (MCMC) methods used in this work. We first present the random walk Metropolis-Hastings algorithm followed by the Metropolis adjusted Langevin algorithm. We proceed to look at Hamiltonian Monte Carlo and the No-U-Turn sampler. We then present the two proposed methods being the Locally Scaled Metropolis-Hastings and Stochastic Volatility Metropolis-Hastings algorithms.

Random Walk Metropolis-Hastings Algorithm
The Metropolis-Hastings (MH) algorithm is an example of a basic and simple MCMC method and forms the essence of more complex MCMC algorithms, which use more advanced proposal distributions compared to MH. Suppose were are interested in determining the posterior distribution of parameters w from some model M. The MH method produces proposed samples using a proposal distribution Q(w * |w). A new parameter state w * is accepted or rejected probabilistically given the current state w based on the posterior likelihood ratio [11,29]: where π(w) is the stationary or target distribution evaluated at w, for some arbitrary w.
Random walk Metropolis is a version of MH that utilises a Gaussian distribution centred at the current state as the proposal distribution [11]. The transition density in random walk Metropolis is N (w, Σ), where is the variance of the proposal distribution [11]. Note that Σ is typically set to be the identity matrix in practice. In this work, we tune to target an acceptance rate of 25% using the primal-dual averaging methodology outlined in Section 3.2.
When the transition density is symmetric, it implies that Q(w|w * ) = Q(w * |w) reducing Equation (1) to a ratio of posterior likelihoods as follows [11]: This proposal normally results in random walk behaviour, which leads to high autocorrelations between the generated samples as well as slow convergence [11]. Algorithm 1 presents the pseudo-code for the MH algorithm.

Algorithm 1: The Metropolis-Hastings Algorithm
Data: w init , , N and unnormalised target π(w) w n ← w * with probability: α(w * |w) = min 1, π(w * )Q(w|w * ) π(w)Q(w * |w) end A significant drawback of the MH seminal method is the high autocorrelations between the generated samples. The high autocorrelations lead to slow convergence in the Monte Carlo estimates and consequently the necessity to generate large sample sizes. Approaches that reduce the random walk behavior are those that enhance the classical MH algorithm and utilise more information about the target posterior distributions [4]. The most popular extension is to incorporate first-order gradient information to guide the search of the target, as well as second-order gradient information to consider the local geometry of the posterior [4]. Examples of methods that are able to leverage first-order gradient information are Metropolis Adjusted Langevin Algorithm and Hamiltonian Monte Carlo [1,4,14,30]. We consider these two methods in more detail in the following sections. Algorithm 1 provides the pseudo-code for the MH algorithm.

Metropolis Adjusted Langevin Algorithm
The MALA is a MCMC method that includes first-order gradient information of the target posterior to enhance the sampling behaviour of the MH algorithm [4,31]. MALA reduces the random walk behaviour of MH via the use of Langevin dynamics which are as follows [4,31]: where π(w) represents the unnormalised target distribution (which is the negative loglikelihood), w is the position vector and Z t is a Brownian motion process at time t. As the Langevin dynamics are in the form of a stochastic differential equation, we typically wont be able to solve it analytically and we need to use a numerical integration scheme. The first-order Euler-Maruyama integration scheme is the commonly used integration scheme and the update equation is given as: [4,31]: where is the integration step size and z t ∼ N (0, I).
The Euler-Maruyama integration scheme does not provide an exact solution to the Langevin dynamics and produces numerical integration errors. These errors result in detailed balance being broken, and one ends up not sampling from the correct distribution. In order to ensure detailed balance, a MH acceptance step is required. The transition probability of the MALA method can be written as [31]: where P(w |w) and P(w|w ) are transition probability distributions, w is the current state and w is the new proposed state. The acceptance rate of the MALA takes the form: Unlike the MH algorithm, the MALA takes advantage of the first-order gradient information of the target distribution, which makes the sampler converge to the target distribution more rapidly [4,31]. However, the generated samples are still highly correlated. Girolami and Caldehad [4] extended MALA from being on a Euclidean manifold to a Riemannian Manifold, and hence incorporating second-order gradient information of the target. This approach showed considerable improvements on MALA, with an associated increase in compute time.
In the following section, we present Hamiltonian Monte Carlo that can explore the posterior distribution more efficiently than MALA.

Hamiltonian Monte Carlo and the No-U-Turn Sampler
The Hamiltonian Monte Carlo (HMC) algorithm employs first-order gradient data of the posterior distribution, similarly to MALA, to navigate the parameter space [2,14]. However, unlike MALA, the HMC adds an extra momentum variable p to the parameter space w and uses Hamiltonian dynamics to evolve the system over time. Note that p and w have the same number of dimensions. This dynamic system results in a Hamiltonian H(w, p) which is given as [1]: where U(w) is the potential energy (which is given by the negative log-likelihood of the target) and K(p) is the kinetic energy which is a Gaussian kernel with covariance matrix M [3]: Hamilton's equations are used to generate the path of the Markov chain at time t as follows [1]: dw These Hamiltonian dynamics ensure that the Hamiltonian system conserves energy, is reversible, and preserves phase space volume. These dynamics typically have to be solved numerically, with the leapfrog integrator being the most common numerical scheme. The leapfrog integration scheme has the following update equations [3,14]: The leapfrog integration scheme does not give an exact solution to Hamiltonian dynamics and produces numerical integration errors. This results in the total energy not being conserved, resulting in the detailed balance condition no longer holding. To ensure that detailed balance hods, a Metropolis-Hastings acceptance step is utilsed. The overall process for generating a single sample from HMC is a Gibbs sampling scheme. In this scheme, we first generate the momentum variable from the Gaussian distribution, after which we sample new parameters based on the newly drawn momentum using the leapfrog integration scheme. This procedure is repeated until the desired number of samples has been generated.
We have yet to address how one selects the step size and trajectory length L parameters of HMC. These parameters significantly influence the sampling performance of the algorithms [32]. The NUTS algorithm of Hoffman and Gelman [32] automates the tuning of the HMC step size and trajectory length parameters. A large step size typically results in most of the generated samples being rejected, while a small step size results in slow mixing [32]. When L is too tiny, then HMC depicts random walk behaviour, and when the L is large, the method consumes computational resources [32]. In the NUTS methodology, the step size parameter is tuned through primal-dual averaging during an initial burn-in phase. On the other hand, the trajectory length parameter is automatically tuned by iteratively doubling the trajectory length until the Hamiltonian becomes infinite or the chain starts to trace back [8,33]. That is when the last proposed position state w * starts becoming closer to the initial position w. More information about the NUTS algorithm can be found in [32]. In this manuscript, we tune in NUTS to target an acceptance rate of 70% using primal-dual averaging as discussed in Section 3.2.

Locally Scaled and Stochastic Volatility Metropolis-Hastings Algorithms
As highlighted previously, the MH algorithm experiences random walk behaviour, which causes the generated samples to have high autocorrelations. We attempt to address this by designing the scaling matrix in two distinct ways.
Firstly, we set Σ so that it depends on the neighborhood curvature of the distribution, similarly to manifold MALA and Riemannian Manifold Monte Carlo [4,7]. We call this algorithm the Locally Scaled Metropolis-Hastings (LSMH) algorithm. In this algorithm, we set the Σ to be equal to the Hessian of the target posterior evaluated at the current state. In this work, for the instances where the negative log-density is highly non-convex such as Neal's funnel, the employe the SoftAbs metric to approximate the Hessian [34]. The SoftAbs metric needs an eigen decomposition and all second-order derivatives of the target distribution, which leads to higher execution times [34,35]. However, the SoftAbs metric is more generally suitable as it eliminates the boundaries of the Riemannian metric, which limits the use to models that are analytically available [35].
Secondly, we consider Σ to be random and assume that it has distribution Σ ∼ P Σ (Σ), with P Σ (Σ) being a user-specified distribution in a related fashion to Quantum-Inspired Hamiltonian Monte Carlo [7,25]. We call this proposed algorithm the Stochastic Volatility Metropolis-Hastings (SVMH) algorithm. In this paper, we set the covariance matrix to be diagonal, with the diagonal components sampled from a log-normal distribution with mean zero and variance 1.
The pseudo-code for the LSMH and SVMH algorithms is the same as that of the MH algorithm in Algorithm 1, with the only exception being how Σ is treated. In particular: We tune via primal-dual averaging [32] targeting a sample approval rate of 70%. Our pilot runs indicated that targeting various acceptance rates (e.g., 25% as with MH) did not materially change the overall results.
It becomes apparent that LSMH will require the most computational resources out of all the algorithms due to the requirement to calculate the Hessian of the target at each state. SVMH should have a similar execution time to MH because only one extra step is added to the MH algorithm. If the assumed distribution for the scale matrix is not too difficult to sample from, MH and SVMH should have almost indistinguishable computational requirements.

Experiments
This section describes the performance metrics, the primal-dual averaging methodology used to tune the parameters of the MCMC methods, the simulation study undertaken, and the tests performed on real world datasets. The investigation in this work is performed using a simulation study based on Neal's funnel of varying dimensions, multivariate Gaussian distributions of different dimensions, and using real world data modeled using jump diffusion processes and Bayesian logistic regression, respectively.

Performance Metrics
The performance metrics used in this article are the multivariate Effective Sample Size (mESS) and the mESS normalised by the execution time [6,7,9,16,36]. For the real world datasets, we also assess the predictive performance of each algorithm. The execution time is defined as the time needed to generate the samples after the burn-in period. The predictive performance metric used for the real world datasets is the negative log-likelihood (NLL) using test data and the Area Under the Curve (AUC) for classification datasets. Methods that produce high effective sample size and AUC are better than those that do not, and algorithms with low execution time and NLL are preferable to those that do not.
We use the multivariate ESS calculation of Vats et al. [6,36] as a measure of the number of uncorrelated samples generated. This ESS calculation is superior to the minimum univariate ESS, commonly used in the literature, as it can consider the correlations between all the parameter dimensions. The multivariate ESS used in their work is calculated as: where N is the number of generated samples, D is the number of parameters, Λ is the sample covariance matrix and M is the estimate of the Markov chain standard error. When D = 1, mESS is equivalent to the univariate ESS [36]. Note that when there are no correlations in the chain, we have that |Λ| = |Σ| and mESS = n.

Scale Matrix and Step Size Tuning
This section sketches how we tune the scale matrix in the MH algorithm and the step size in the other MCMC algorithms. We utilise the primal-dual averaging methodology outlined in [37,38] through the burn-in phase. In primal-dual averaging, a user-specified MH acceptance rate δ is targeted via the following updates: where µ is a free parameter that t tends to, the convergence rate towards µ is controlled by γ with the rate of adaptation decaying according to η t [38], and H t is the difference between the target and actual acceptance rates. The updates are such that the expected difference between the target and actual acceptance rates is zero, which then updates the step size towards the target acceptance rate. Hoffman and Gelman [32] found that setting µ = log(10 0 ),¯ 0 = 1,H 0 = 0, γ = 0.05, t 0 = 10, κ = 0.75 with 0 being the initial step size results in good performance across various target distributions. These are the settings that we utilise in this paper with 0 = 0.0001.

Simulation Study
In the simulation study, we aim to retrieve the posterior samples from multivariate Gaussian distributions and Neal's [39] funnel. That is, we sample from Gaussian distributions N (0, Σ) with mean zero and covariance matrix Σ. The covariance matrix Σ is diagonal, with the standard deviations being log-normal distribution with mean zero and unit standard deviation.
Neal [39] introduced the funnel density as an illustration of a density that reveals the problems that arise in Bayesian hierarchical and hidden variable models [40,41]. The model handles the variance of the parameters as a latent log-normal stochastic variable v [40,41]. The density for Neal's [39] funnel is given as: with µ = 0 and σ = 3. For the sampling from each of these two targets, we examined D ∈ {10, 20}. A total of 10,000 samples were generated after dropping 5000 samples as burn-in. Thirty independent paths were run for both these targets and each value of D.

Real World Application
We used jump diffusion processes to model asset returns and Bayesian logistic regression to model binary classification tasks for the real world application.

Merton Jump Diffusion Model
It is well documented that financial asset returns do not follow a normal distribution but instead have fat tails [42,43]. Stochastic volatility models, Levy models, and combinations these models have been utilised to capture the leptokurtic nature of asset return distributions [26,[44][45][46]. Jump diffusion processes were introduced into the financial markets literature by Merton and Press in the 1970s [26,43,47]. In this paper, we study the jump diffusion model of Merton [26], which is a one-dimensional Markov process {S t , t ≥ 0} with the following dynamics [43]: where µ is the drift coefficient, σ is the diffusion coefficient, B t and N t are the standard Brownian motion process and Poisson process with intensity λ, respectively, and Y i ∼ N (µ jump , σ 2 jump ) is the size of the ith jump. As shown in Mongwe [43], the transition density of the returns of the jump diffusion in Equation (13) is given as: where w and x are realisations of S(t) at times t + τ and t, respectively, and φ is the probability density function of a standard normal random variable. The likelihood function is the given as: where N is the sample size. The likelihood is multi-modal as it is an infinite mixture of normally distributed stochastic variables with the mixing weights being probabilities being from a Poisson distribution [43]. In this article, we truncate the infinite summation in Equation (14) to the first ten terms as done in [43]. Furthermore, the jump diffusion model is calibrated to historical financial market returns data as outlined in Table 1.
In this work, we use MCMC methods to calibrate the jump diffusion process with the likelihood function in Equation (15) to data across different financial markets. A total of 1000 samples were generated after dropping 500 samples as burn-in. Thirty independent paths were run for both these targets and each value of D.

Bayesian Logistic Regression
The real world binary classification datasets in Table 1 are modelled using Bayesian logistic regression. The negative log-likelihood l(D|w) function for logistic regression is given as: with N and D being the number of realisations and dimensions, respectively. The unnormalised log posterior distribution is given as: ln p(w|D) = l(D|w) + ln p(w|α) where ln p(w|α) is the log of the prior distribution on the parameters given the hyperparameters α. The parameters w have a Gaussian prior distribution with a mean of zero and variance hundred. A total of 10,000 samples were generated after dropping 5000 samples as burn-in. Thirty independent paths were run for both these targets and each value of D.

Datasets
The datasets that we use in this paper are outlined in Table 1. The dataset consists of financial market data that we use to calibrate the jump diffusion process model and real-world classification datasets that we model using Bayesian logistic regression.
We calibrate the jump diffusion process to four datasets. These are single stock, cryptocurrency, stock index, and currency datasets. These datasets are daily closing prices from 1 January 2017 to 31 December 2020 retrieved from Google Finance [48], which we converted into log returns. The datasets were: MTN dataset which is a JSE listed stock, Bitcoin dataset which is a crypto-currency (in USD), S&P 500 dataset which is a stock index and USDZAR dataset which is a currency. Note that the formula used to calculate the log-returns is given as: where r i is the log return on day i and S i is the stock or currency level on day i. The descriptive statistics of the dataset are shown in Table 2. This table shows that the USDZAR dataset has a very low kurtosis, suggesting that it has very few (if any) jumps when compared to the other three datasets. There are four datasets that we modeled using Bayesian logistic regression. All the datasets have two classes and thus present a binary classification problem. The specifics of the datasets are: • Heart dataset-This dataset has 13 features and 270 data points. The purpose of the dataset is to predict the presence of heart disease based on medical tests performed on a patient [49]. • Australian credit dataset-This dataset has 14 features and 690 data points. This dataset aims to assess applications for credit cards [49]. • South African fraud dataset-This dataset is of audit findings of South African municipalities [50]. The dataset has 14 features, which are financial ratios, and 1560 data points. For this dataset, the aim is to classify the local government entities with fraudulent financial statements [51]. • German credit dataset-This dataset has 25 features and 1000 data points. This dataset aimed to classify a customer as either good or bad credit [49].
The jump diffusion process datasets used a time series of log returns as the input. The features for the Bayesian logistic datasets were normalised. We used 90% of the data to train the models and 10% to test the models. The train-test split on the time series datasets is based on a cutoff date that confines 90% of the data into a training set. The prior distribution over the parameters for all the jump diffusion process datasets was a standard normal, while for the logistic regression datasets the prior was a Gaussian with standard deviation equal to 10 as in Girolami and Calderhead [4].

Results and Discussion
We implemented all the models and algorithms in PyTorch. All the experiments were run on a machine with a 64 bit CPU. The sampling performance of the algorithms for the simulation study and real world benchmark datasets is shown in Figure 1. The detailed results for the financial market data and real world classification datasets across different metrics is shown in Tables 3 and 4, respectively. Note that the execution time t in Tables 3 and 4 Figure 2 shows the diagnostic negative log-likelihood trace plots for each sampler across various targets. Figure 3 shows the autocorrelations produced by each MCMC method for the first dimension across various targets. Figure 4 shows the predictive performance of each of the sampling methods on the real world classification datasets modeled using BLR.
The first two rows in Figure 1 show the mESS and mESS normalised by execution time for the multivariate Gaussian distributions and Neal's funnel used in the simulation study with increasing dimensionality D. The third and fourth row in Figure 1 show the mESS and mESS normalised by execution time for the financial market datasets modeled using jump diffusion processes, while rows five and six show the results for real world classification datasets modeled using Bayesian logistic regression. The results in Tables 3 and 4      The first couple rows of Figure 1 show that in all the values of D on both targets, SVMH and LSMH outperform MH in terms of effective sample sizes with SVMH and LSMH having similar effective sample sizes. When execution time is taken into consideration, SVMH outperforms LSMH and MH. The results also show that the NUTS algorithm outperforms all the MCMC methods on an effective sample size basis, while the MALA outperforms all the other MCMC methods on a normalised effective sample size basis.
Rows three to six in Figure 1 show that across all the real world datasets, SVMH and LSMH have similar effective sample sizes, with MH producing the smallest effective sample size. NUTS outperforms all the algorithms on an effective sample size basis. On a normalised effective sample size basis, SVMH outperforms all the algorithms on the jump diffusion datasets except for the USDZAR dataset. NUTS outperforms on the USDZAR dataset due to the lower kurtosis on this dataset as seen in Table 1. NUTS appears to perform poorly on the datasets that have a high kurtosis. Figure 2 shows that the SVMH algorithm converges to the same level as NUTS on the majority of the datasets. On the other hand, the MH and LSMH algorithms converge to higher (i.e., less optimal) negative log-likelihoods. Figure 3 shows that NUTS produces the lowest autocorrelations on all the targets except on the jump-diffusion datasets. MH produces the highest autocorrelations on the majority of the targets. The autocorrelations on the SVMH algorithm are higher than NUTS and MALA but approach zero quicker than for MH across the targets considered. Note that these are auto correlations after the burn-in period. Table 3 and 4 show that, as expected, MH produces the lowest execution time with SVMH being a close second. LSMH is slower than SVMH due to the computation of the Hessian matrix at each state. The NUTS algorithm has the highest execution time on the jump diffusion datasets.   Tables 3 and 4 as well as Figures 4 and 5 show that, in terms of predictive performance using the negative log-likelihood on test data and AUC, the SVMH algorithm outperforms all the algorithms on the bulk of the real world datasets. Note that the SVMH method outperforms all the methods or has similar performance on a test negative log-likelihood basis on the logistic regression datasets. The results also show that the MH algorithm progressively becomes worse as the dimensions increase. The predictive performance of the LSMH algorithm is not stable as dimensions increase, with its performance at times being close to the MH algorithm.
It is worth noting that although LSMH and SVMH outperform the MH on several metrics, the LSMH and SVMH methods still produce low effective sample sizes compared with the required number of samples. This can potentially be improved by centering the proposal distribution at the current point plus first-order gradient-which should assist in guiding the exploration (and decrease the autocorrelations in the generated samples) as in Metropolis Adjusted Langevin Algorithm [4], and Hamiltonian Monte Carlo [3,14].

Conclusions
We introduce the stochastic volatility Metropolis-Hastings and locally scaled Metropolis-Hastings Markov chain Monte Carlo (MCMC) algorithms. We compare these methods with the vanilla Metropolis-Hastings algrouthm using a simulation study on multivariate Gaussian distributions and Neal's funnel and real-world datasets modeled using jump diffusion processes and Bayesian logistic regression. The proposed methods are compared against the Metropolis-Hastings method, the Metropolis adjusted Langevin algorithm (MALA), and the No-U-Turn sampler.
Overall, the No-U-Turn sampler outperforms all the MCMC methods on an effective sample size basis. Stochastic Volatility Metropolis-Hastings (SVMH) and Locally Scaled Metropolis-Hastings produce higher effective sample sizes than Metropolis-Hastings, even after accounting for the execution time. In addition, SVMH outperforms all the methods in terms of predictive performance on the majority of the real world datasets. Furthermore, the SVMH method outperforms NUTS and MALA on the majority of the jump diffusion datasets on time normalised effective sample size basis. Given that SVMH does not add notable computational complexity to the Metropolis-Hastings algorithm, there seem to be trivial impediments to its use in place of Metropolis-Hastings in practice.
This work can be improved by examining the case where the scaling matrix in stochastic volatility Metropolis-Hastings is sampled from a Wishart distribution so that correlations between the different parameter dimensions are taken into account. An analysis to determine the optimal distribution for the scaling matrix could improve this paper. Constructing the stochastic volatility version of the Metropolis adjusted Langevin algorithm could also be of interest, significantly since Metropolis adjusted Langevin algorithm can use first-order gradient information to improve exploration of the target. Furthermore, we plan on including the calibration of jump diffusion with stochastic volatility characteristics using MCMC techniques in future work.