Inference Using Simulated Neural Moments

: This paper studies method of simulated moments (MSM) estimators that are implemented using Bayesian methods, speciﬁcally Markov chain Monte Carlo (MCMC). Motivation and theory for the methods is provided by Chernozhukov and Hong (2003). The paper shows, experimentally, that conﬁdence intervals using these methods may have coverage which is far from the nominal level, a result which has parallels in the literature that studies overidentiﬁed GMM estimators. A neural network may be used to reduce the dimension of an initial set of moments to the minimum number that maintains identiﬁcation, as in Creel (2017). When MSM-MCMC estimation and inference is based on such moments, and using a continuously updating criteria function, conﬁdence intervals have statistically correct coverage in all cases studied. The methods are illustrated by application to several test models, including a small DSGE model, and to a jump-diffusion model for returns of the S&P 500 index.


Introduction
It has long been known that classical inference methods based on first-order asymptotic theory, when applied to the generalized method of moments estimator, may lead to unreliable results, in the form of substantial finite sample biases and variances, and incorrect coverage of confidence intervals, especially when the model is overidentified (Donald et al. 2009;Hall and Horowitz 1996;Hansen et al. 1996;Tauchen 1986). In another strand of the literature, Chernozhukov and Hong (2003) introduced Laplace-type estimators, which allow for estimation and inference with classical statistical methods (those which are defined by optimization of an objective function) to be done by working with the elements of a tuned Markov chain, so that potentially difficult or unreliable steps such as optimization or computation of asymptotic standard errors, etc., may be avoided. A third important strand of literature is simulation-based estimation. The strands of momentbased estimation, simulation, and Laplace-type methods meet in certain applications. The code by Gallant and Tauchen (Gallant and Tauchen 2010) for efficient method of moments estimation (Gallant and Tauchen 1996), which has been used in numerous papers, is an example. Another is Christiano et al. (2010) (see also Christiano et al. 2016), which proposes a Laplace-type estimation methodology that uses simulated moments which are defined in terms of impulse response functions for estimation of macroeconomic modes. Very similar methodologies may be found in the broad Approximate Bayesian Computing literature, some of which uses MCMC methods and criteria functions that involve simulated moments (e.g., Marjoram et al. 2003).
Given the uneven performance of inference in classical GMM applications, one may wonder how reliable are inferences made using the combination of Laplace-type methods and simulated moments. Henceforth, this combination is referred to as MSM-MCMC, because the specific Laplace-type method considered here is to use the criteria function of the MSM estimator to define the likelihood that determines acceptance/rejection in Metropolis-Hastings MCMC, as was the focus of Chernozhukov and Hong (2003). This paper provides experimental evidence that confidence intervals derived from such estimators may have poor coverage when the moments over-identify the parameters, a result that parallels the above cited results for classical GMM estimators. It goes on to provide evidence that the simulated neural moments that were introduced in Creel (2017), which are just-identifying, when used with MSM-MCMC techniques, cause inferences to become much more reliable, especially when the continuously updating version of the GMM criteria is used. This paper is a continuation of the line of research in Creel (2017), its main new contribution being the experimental confirmation that inferences based upon simulated neural moments are reliable. The paper concludes with an example that uses the methods to estimate a jump-diffusion model for returns of the S&P 500 index.
Section 2 reviews how Laplace-type methods may be used with simulated moments, giving the MSM-MCMC combination, and Section 3 then discusses how neural networks may be used to reduce the dimension of the moment conditions. Section 4 presents four test models, and Section 5 gives results for these models. Section 6 illustrates the methods in the context of an empirical analysis of a model of more complexity, concretely, a jump-diffusion model for financial returns, and Section 7 summarizes the conclusions. The SNM archive (release version 1.2) contains all the code and results reported in this paper. These results were obtained using the Julia package SimulatedNeuralMoments.jl (release version 0.1.0), which provides a convenient way to use the methods for other research projects.

Simulated Moments, Indirect Likelihood, and MSM-MCMC Inference
This section summarizes results from the part of the simulation-based estimation literature that bases estimation on a statistic, including (Gallant and Tauchen 1996;Gouriéroux et al. 1993;McFadden 1989;Smith 1993), among others, which is reviewed in (Jiang and Turnbull 2004). Suppose there is a model M(θ) which generates data from a probability distribution P (θ) which depends on the unknown parameter vector θ. M(θ) is fully known up to θ, so that we can make draws of the data from the model, given θ. Let Y = Y(θ) be a sample drawn at the parameter vector θ, where θ ∈ Θ ⊂ R k and Θ is a known parameter space. Suppose we have selected a finite-dimensional statistic Z = Z(θ) = Z(Y(θ)) upon which to base estimation, and assume that the statistic satisfies a central limit theorem, uniformly, for all values of θ of interest: Let Z s (θ) = Z(Y s (θ)) be the statistic evaluated using an artificial sample drawn from the model at the parameter value θ. This statistic has the same asymptotic distribution as does Z(θ), and furthermore, the two statistics are independent of one another. With S such simulated statistics, define Now, suppose we have a real sample which was generated at the unknown true parameter value θ 0 , and letẐ be the associated value of the statistic. Definem(θ) =Ẑ − S −1 ∑ s Z s (θ). With this, and Equation (2), we can define the indirect likelihood function 1 where whereV(θ) is a consistent estimate ofV(θ).
To estimateV(θ), one possibility is to use a fixed sample-based estimate that does not rely on an estimate of θ 0 (see, for example, Christiano et al. 2010Christiano et al. , 2016. Another possibility is to (1) compute the estimateΣ(θ) of the covariance matrix in expression 1 as the sample covariance of R draws of √ nZ s (θ): where M = 1 R ∑ r √ nZ r (θ) is the sample mean of the draws, and then (2) multiply the result by 1 + S −1 to obtain the estimatê This estimator may be used in a continuously updating fashion, by updatingV(θ) in Equations (3) or (4) every time the respective function is evaluated. Alternatively, if we obtain an initial consistent estimator of θ 0 , thenV(θ) can be computed at this estimate, and kept fixed in subsequent computations, in the usual two-step manner. Please note that if a fixed covariance estimator is used, then the maximizer of L is the same as the minimizer of H.
Extremum estimators may be obtained by maximizing log L, or minimizing H. Laplacetype estimators, as defined by Chernozhukov and Hong (2003), may be defined by setting their general criteria function, L n (θ), as defined in their Section 3.1, to either log L, or − 1 2 H. Once this is done, then the practical methodology is to use Markov chain Monte Carlo (MCMC) methods to draw a chain C = {θ r }, r = 1, 2, ..., R, given the sample statisticẐ, where acceptance/rejection is determined using the chosen L n (θ), along with a prior, and standard proposal methods 2 . This specific version of Laplace-type methods is referred to as MSM-MCMC in this paper. This paper will rely directly on the theory and methods of Chernozhukov and Hong (2003), as MSM-MCMC falls within the class of methods they study. In the following, a primary use of the Chernozhukov and Hong (2003) methodology will be in order to obtain confidence intervals. For a function f (θ), Theorem 3 of Chernozhukov and Hong (2003) proves that a valid confidence interval can be obtained using the quantiles of { f (θ r )} r=1,2,..R , based on the final chain C = {θ r }, r = 1, 2, ..., R. For example, a 95% confidence interval for a parameter θ j is given by the interval (Q θ j (0.025), Q θ j (0.975)), where Q θ j (τ) is the τth quantile of the R values of the parameter θ j in the chain C.

Neural Moments
The dimension of the statistics used for estimation, Z, can be made minimal (equal to the dimension of the parameter to estimate, θ) by filtering an initial set of statistics, say, W, through a trained neural net. Details of this process are explained in Creel (2017) and references cited therein, and the process is made explicit in the code which accompanies this paper 3 . A summary of this process is: Suppose that W is a p vector of statistics W = W(Y), with p ≥ k, where k = dim θ. We may generate a large sample of (W, θ) pairs, following: 1.
Draw θ s from the parameter space Θ, using some prior distribution.

2.
Draw a sample Y s from the model M(θ) at θ s . 3.
Compute the vector of raw statistics W(Y s ). We can repeat this process to generate a large data set {θ s , W s }, s = 1, 2, ..., S, which can be used to train a neural network which predicts θ, given W. This process can be done without knowledge of the real sample data, and can in fact be done before the real sample data are gathered. The prediction from the net will be of the same dimension as θ, and, according to results collectively known as the universal approximation theorem, will be a very accurate approximation to the posterior mean of θ conditional on W (Hornik et al. 1989;Lu et al. 2017). The output of the net may be represented aŝ θ = f (W,φ), where f (W, φ) : R p → R k is the neural net, with parameters φ, that takes as inputs the p statistics W, and has k = dim θ outputs. The parameters of the net, φ, are adjusted using standard training methods from the neural net literature to obtain the trained parameters,φ. Then we can think ofθ = f (W,φ) as a k−dimensional statistic which can be computed essentially instantaneously once provided with W. We will use this statisticθ as the Z of the previous section. Because the statistic is an accurate approximation to the posterior mean conditional on W (supposing the net was well trained), it has two virtues: it is informative for θ (supposing that the initial statistics W contain information on θ) and it has the minimal dimension needed to identify θ. From the related GMM literature, GMM methods are known to lead to inaccurate inference when the dimension of the moments is large relative to the dimension of the parameter vector (Donald et al. 2009). Use of a neural net as described here reduces the dimension of the statistic to the minimum required for identification.
When the statistic Z is the output of a neural net f (W, φ), where the parameter vector of the net, φ, can have a very high dimension (hundreds or thousands of parameters are not uncommon), the simulated likelihood of Equation (3) will be a wavy function, with many local maxima. This will occur even if the net is trained using regularization methods. Because of this waviness, gradient-based methods will not be effective when attempting to maximize log L or to minimize H (Equations (3) and (4)), and attempts to compute the covariance matrix of the estimator that rely on derivatives of the log likelihood function will also be unlikely to succeed. However, derivative free methods can be used to compute extremum estimators, to obtain point estimators or to initialize a MCMC chain, and the simulation-based estimator of the covariance matrixΣ(θ) of Equation (1) discussed in the previous section does not depend on derivatives. A major motivation of using Laplace-type estimators in the first place is to overcome problems of local extrema, as Chernozhukov and Hong (2003) emphasize. It is worth noting that the output of the net evaluated at the real sample statistic,θ, will also provide an excellent starting value for computing extremum estimators, or for initializing a MCMC chain. Likewise, the covariance estimator of Equation (6) can be used to define a random walk multivariate normal proposal density for MCMC, by drawing the trial value θ s+1 from N(θ s ,V), where θ s is the current value of the chain. Experience with this proposal density, as reported below, is that it is easy to tune, by scaling the covariance by a scalar, to achieve an acceptance rate withing the desired limits 4 . Creel (2017) used neural moments to compute a Laplace-type estimator, similarly to what is done here. That paper used nonparametric regression quantiles applied to the set of draws from the Laplace-type posterior to compute confidence intervals, and the posterior draws were generated by a procedure similar to sequential Monte Carlo, rather than MCMC. Additionally, the metric used for selection of particles was different from the GMM criteria, which were used here. The use of nonparametric regression quantiles is very costly to study by Monte Carlo. Thus, this paper focuses on straightforward use of the methods that Chernozhukov and Hong (2003) focus on: traditional MCMC using the GMM criteria function, with confidence intervals computed using the direct quantiles from the posterior sample. These simplifications give a simpler and more tractable procedure that can reasonably be studied and verified by Monte Carlo. For theoretical support, we can note that the methods fall within the class of methods studied by Chernozhukhov and Hong, with the only innovation being the use of statistics filtered through a previously trained neural net. The neural nets used here consist of a finite series of nonstochastic nonlinear mappings to the (−1, 1) interval, followed by a final linear transformation. As such, the conjecture that the final statistics that are the output of the net follow a uniform law of large numbers and a uniform central limit theorem seems reasonable, but this is not formally verified in this paper.

Examples
This section presents example models that are used to investigate the performance of the proposed methods. For all models, the code used (for the Julia language) is available in an archive 5 , release version 1.2, where the details of each example may be consulted. The example models also serve as templates that may be used to apply to proposed methods to models of the reader's interest: one simply needs to provide similar functions to what is found in the directory for each example, for the model of interest. These are, fundamentally, (1) a prior from which to draw the parameters; (2) code to simulate the model given the parameter value, and finally, (3) code to compute the initial statistics, W, given the data generated from the model. For the examples, uniform and fairly uninformative priors were used in all cases. The details regarding priors and statistics, W, may be consulted in the links provided, below.

Stochastic Volatility
The simple stochastic volatility (SV) model is where t and u t are independent standard normal random variables. We use a sample size of 500 observations, and the true parameter values are θ 0 = (φ 0 , ρ 0 , σ 0 ) = (0.692, 0.9, 0.363). These parameter values have been chosen to facilitate comparison with results of several previous studies that have used the same SV model to check properties of estimators. For estimation, 11 statistics are used to form the initial set, W, which include moments of y and of |y|, as well as the estimated parameters of a heterogeneous autoregressive (HAR) auxiliary model (Corsi 2009) fit to |y|. 6

ARMA
The next example is a simple ARMA(1, 1) model with true values θ 0 = (α 0 , β 0 , σ 2 0 ) = (0.95, 0.5, 1.0). The sample size is n = 300. The 13 statistics used to define the initial set, W, include sample moments and correlations, OLS estimates of an AR(1) auxiliary model fit to x t , as well as another AR(1) model fit to the residuals of the first model, plus partial autocorrelations of x t . 7

Mixture of Normals
For the mixture of normals example, the variable y is drawn from the distribution N(µ 1 , σ 2 1 ) with probability p and from N(µ 1 − µ 2 , σ 2 1 +σ 2 2 ) with probability 1 − p. Samples of 1000 observations are drawn. The true parameter values are θ 0 = (µ 1 , σ 1 , µ 2 , σ 2 , p) = (1.0, 1.0, 0.2, 1.8, 0.4), and the prior restricts all parameters to be positive. Thus, the parameterization and the prior together impose that the first component has a larger mean and a lower variance than does the second component, in order to ensure identification. Additionally, the probability that either component is sampled is restricted to be at least 0.05. The 15 auxiliary statistics are the sample mean, standard deviation, skewness, kurtosis, and 11 quantiles of y. 8

DSGE Model
The previous models are all simple, quickly simulated, and with relatively few parameters. This section presents a model which is more representative of an actual research problem. The model is a simple dynamic stochastic general equilibrium model with two shocks: At the beginning of period t, the representative household owns a given amount of capital (k t ), and chooses consumption (c t ), investment (i t ) and hours of labor (n t ) to maximize expected discounted utility subject to the budget constraint c t + i t = r t k t + w t n t , available time 0 ≤ n t ≤ 1, and the accumulation of capital k t+1 = i t + (1 − δ)k t , each of which must hold for all t. The shock, η t , that affects the desirability of leisure relative to consumption, evolves according to ln η t = ρ η ln η t−1 + σ η t . The single competitive firm maximizes profits y t − w t n t − r t k t from production of the good (y t ), taking wages (w t ) and the interest rate (r t ) as given, using the constant returns to scale technology The technology shock, z t , also follows an AR(1) process in logarithms: ln z t = ρ z ln z t−1 + σ z u t . The innovations to the preference and technology shocks, t and u t , are independent standard normal random variables. Production (y t ) can be allocated by the consumer to consumption or investment: y t = c t + i t . The consumer provides capital and labor to the firm, and is paid at the competitive rates r t and w t , respectively. From this model, samples of size 160, which simulate 40 years of quarterly data, are drawn, given the 9 parameters α, β, γ, δ, ρ z , σ z , ρ η , σ η and ψ. The variables available for estimation are y, c, n, w, and r. It is possible to recover the parameters α and δ exactly, given the observable variables, so these two parameters are set to fixed values, and the remaining 7 parameters are estimated. To facilitate setting priors, the steady state value of hours (n) is estimated instead of ψ, which may then be recovered. For estimation, 45 statistics are used, including means and standard deviations of the observable variables, and estimates from auxiliary regressions 9 .

Monte Carlo Results
This section reports results for MSM-MCMC estimation of each of the test models, using the GMM-like criteria function H (Equation (4)) as the L n of Chernozhukov and Hong (2003). Results using the criterion L (Equation (3)) were qualitatively very similar in all cases where the two versions were computed, and are thus not reported 10 . In all cases, 500 Monte Carlo replications were done. For all the test models, the number of artificial samples used to train the neural net was 20,000 times the number of parameters of the model. This is actually a fairly small number, given that generating the samples and training the nets is an operation that takes only 10 min or less for the test models, other than the DSGE model. The reason that a larger number of samples was not used is that it was desired to obtain results that may be more relevant for cases where it is more costly to simulate from the model, as is the case of the jump diffusion model studied below.
First, we report results for the SV and ARMA models, where MSM-MCMC estimators were computed using both the overidentifying statistic, W, and the exactly identifying neural moments, Z. For the plain overidentifying statistics, the results are computed using the CUE GMM criteria. For the neural moments, both the two-step and the CUE criteria were used. Table 1 reports RMSE. The three versions of the MSM-MCMC estimators lead to similar RMSEs, generally speaking. The version that uses the raw statistics has somewhat higher RMSE than do the versions based on the neural statistics, Z, in most cases, but the differences are not important.
Tables 2-4 address the main point of the paper, inference, reporting confidence interval coverage, which is the proportion of times that the true parameter lies inside the computed confidence interval. Critical coverage proportions that would lead one to reject correct coverage may be computed from the binomial(500, p) distribution, where p is the significance level associated with the respective confidence interval. These critical coverage proportions are 0.864 and 0.932 for 90% intervals, 0.924 and 0.974 for 95% intervals, and 0.976-1.0 for 99% intervals. Looking at the column labeled W (CUE) in these tables, we see that the results on the unreliability of inferences for overidentified GMM estimators, which were reviewed in the Introduction, carry over to Bayesian MCMC methods, at least for the models considered. In all entries but one, the coverage is significantly different from correct coverage, erring on the side of being too low, and, in many cases, considerably so. This implies that the probability of Type-I error is higher than the associated nominal significance level. For the neural net statistics, Z, coverage is improved. For the two-step version, coverage is in all cases closer to the correct proportion than when the raw statistics, W, are used. In several cases, correct coverage is also statistically rejected, but now, the error is on the side of conservative confidence intervals, which contain the true parameters more often than the nominal coverage. In this case, the probability of Type-I error will be less than the nominal significance level associated with the confidence intervals. For the CUE version that uses the neural statistics, Z, coverage is very good, and is close to the nominal proportion in all cases. Statistically correct coverage is never rejected when neural moments and the CUE criteria are used.   For the other two test models, MN and DSGE, results were computed only for the neural moments, as the results for the SV and ARMA models, as well as the results from the GMM literature, already indicated that inferences based on the raw statistics, W, were very likely to be unreliable. Table 5 has the RMSE results for the MN and DSGE models. We can see that the use of the two-step or CUE criteria makes little difference for RMSE, in common with the above results for the SV and ARMA models. Tables 6-8 hold the confidence interval coverage results for these two models. Again, the intervals based on the two-step criteria often contain the true parameters more often than they should. The coverage of intervals based on the CUE criteria is very good in all cases, and is never statistically significantly different from correct.
In summary, this section has shown that confidence intervals based on raw overidentifying statistics may be unreliable, rejecting the true parameter values more often than they should. Intervals based on exactly identifying neural moments are more reliable, in general. When the two-step version is used, the intervals are often too broad, so the probability of Type-I error is less than what it should be, and power to reject false hypotheses is lower than it could be. When the CUE version is used, coverage is very accurate: correct coverage was never rejected in any of the cases.
It is to be noted that the CUE version is computationally more demanding than is the two-step version, as the weight matrix must be estimated at each MCMC trial vector. Each of these estimations requires a reasonably large number of simulations to be drawn, to estimate the covariance matrix accurately. If a researcher is primarily concerned with limiting the probability of Type-I error, and is willing to accept a loss of power to accelerate computations, then the two-step version might be preferred. If one is willing to accept more costly computations, all the examples considered here indicate that the CUE version will lead to accurate confidence intervals.

Application: A Jump-Diffusion Model of S&P 500 Returns
The previous examples are mostly small models that are not costly to simulate, except for the DSGE example. As an example of a more computationally challenging model that may be more representative of actual research problems, this section presents results for estimation of a jump-diffusion model of S&P 500 returns. Solving and simulating 11 the model for each MCMC trial parameter acceptance/rejection decision takes about 15 s, when the CUE criteria are used, so training a net and estimation by MCMC is somewhat costly, requiring approximately 2.5 days to complete using a moderate power workstation 12 and threads-based parallelization, where possible. This example is intended to show that the methods are feasible for moderately complex models.
The jump-diffusion model is where p t is 100 times log price, h t is log volatility, J t is jump size, and N t is a Poisson process with jump intensity λ 0 . W 1t and W 2t are two standard Brownian motions with correlation ρ. When a jump occurs, its size is J t = aλ 1 exp h t , where a is 1 with probability 0.5 and −1 with probability 0.5. Therefore, jump size depends on the current standard deviation, and jumps are positive or negative with equal probability. Log price, p t , is simulated using 10-minute tics, and the observed log price adds a N(0, τ 2 ) measurement error to p t , when τ is greater than zero. From this model, 1000 daily observations on returns, realized volatility (RV), and bipower variation (BV) are generated. Because log price has been scaled by 100 in the parameterization of the model, returns, computed as the first difference of p t at the close of trading days, are directly in percentage terms. Both RV and BV are informative about volatility , and, because BV is somewhat robust to jumps, while RV is not, the difference between the two can help to identify the frequency and size of jumps (Barndorff-Nielsen and Shephard 2002). The model is simulated on a continuous 24-hour basis, and returns are computed using the change in daily log closing price, for trading days only. Overnight periods and weekends are simulated, but returns, RV and BV are recorded only at the close of trading days. In summary, the seven parameters are θ = (µ, κ, α, σ, ρ, λ 0 , λ 1 , τ), and simulated data consists of 1000 daily observations on returns, RV and BV. The model studied here is quite similar to that studied in (Creel 2017;Creel and Kristensen 2015), except that the drift process is simplified to be constant, and the jump process is modeled somewhat differently, with constant intensity, and with the magnitude of a jump depending on the current instantaneous volatility. These changes were motivated by the results of the previous papers, and by the better tractability of the present specification. The raw statistics, W, which are used to train the net and to do estimation, are a combination of coefficients from auxiliary regressions between the three observed variables, summary statistics, and functions of quantiles of the variables. The details of the 25 statistics are found in the file JDlib.jl (this same file also gives details of the priors, which are uniform over fairly broad supports, for all parameters). The neural net was fit using 160,000 draws from the prior to generate the training and testing data. The importance of each of the statistics can be assessed by examining the maximal absolute weights on each of the raw statistics in the first layer of the neural net, as is discussed in Creel (2017). These may be seen in Figure 1. We see that most of the 25 statistics have a non-negligible importance, which means that these statistics are contributing information to the fit of the neural net. The output of the net is of dimension 8, the same as the dimension of the parameters of the model. The net is combining the information of the overidentifying statistics to construct a just-identifying vector of statistics, which is then used to define the moments for MSM-MCMC estimation. The model was fit, using MSM-MCMC and the CUE criteria, to S&P 500 data 13 from 16 December 2013 to 4 December 2017, which is an interval of 1000 trading days, the same as was used to train the neural net. The data may be seen in Figure 2, where we observe typical volatility clusters and some jumps. For example, the Brexit drop of June 2016 is clearly seen, and the more extreme spike in RV versus BV at this point illustrates the fact that jumps can be identified by comparing the two. Over the sample period, this price index climbed from about 1800 to 2600, which is approximately a 44% increase, or approximately 0.04% per trading day.  Ten MCMC chains of length 1000 were drawn independently using threads-based parallelization, for a final chain of length 10,000. The estimation results are in Figure 3, which shows nonparametric plots of the marginal posterior density for each parameter, along with posterior means and medians, and 90% confidence intervals defined by the limits of the green areas. All posteriors are considerably more concentrated than are the priors. Drift (µ) is concentrated around a value slightly below 0.04, which is consistent with the average daily returns over the sample period. There is quite a bit of persistence in volatility, as mean reversion, κ, is estimated to be quite low, concentrated around 0.125. Leverage (ρ) is quite strong, concentrated around −0.85. The jump probability per day (λ 0 ) is concentrated around 0.014, and is significantly different from zero. Therefore, jumps are a statistically important feature of the model. When a jump does occur, its magnitude (λ 1 ) is approximately 4 times the current instantaneous standard deviation, but this parameter is not very well identified, as the posterior is quite dispersed. An interesting result is that τ, the standard deviation of measurement error of log price, is concentrated around 0.0055. It is not significantly different from zero at the 0.10 significance level, but it is close to being so. From the Figure, one can see that H o : τ < 0 would be very close to being rejected at the 10% significance level. The parameterization of the model is such that there is no measurement error when τ ≤ 0. Thus, it appears that it is a safer option to allow for measurement error in the model, as the evidence suggests that it is very likely present, and its omission could bias the estimates of the other parameters.

Conclusions
This paper has shown, through Monte Carlo experimentation, that confidence intervals based upon quantiles of a tuned MCMC chain may have coverage which is far from the nominal level, even for simple models with few parameters. The results on poor reliability of inferences when using overidentified GMM estimators, which were referenced in the Introduction, carry over to a Bayesian version of overidentified MSM, implemented using the Chernozhukov and Hong (2003) methodology, for the models considered in this paper. The paper proposes to use neural networks to reduce the dimension of an initial set of moments to the minimum number of moments needed to maintain identification, following Creel (2017). When estimation and inference using the MSM-MCMC methodology is based on neural moments, which are exactly identifying, confidence intervals have statistically correct coverage in all cases studied by Monte Carlo, when the CUE version of MSM is used. Thus, there seems to be no generic problems with the MSM-MCMC methodology for the purpose of inference. A potential problem has to do with the choice of moments upon which MSM is based. Too much over-identification results in poor inferences, for the models studied. The use of neural moments solves this problem, by reducing the number of moments, without losing the information that they contain. The fact that RMSE does not rise when one moves from raw to neural moments illustrates that the neural moments do not lose the information that is contained in the larger set. The methods have been illustrated empirically by the estimation of a jump-diffusion model for S&P 500 data. An interesting result of the empirical work is that measurement error in log prices is likely to be present.
It is to be noted that the step of filtering moments though a neural net is very easy and quick to perform using modern deep learning software environments. The software archive that accompanies this paper provides a function for automatic training, requiring no human intervention. It only requires functions that provide simulated moments computed using data drawn from the model at parameter values drawn from the prior. Filtering moments through a neural net gives an informative, minimal dimension statistic as the output. This provides a convenient and automatic alternative to moment selection procedures. Uninformative moments are essentially removed, and correlated moments are combined. This paper has examined how inference using the MSM-MCMC estimator may be improved when neural moments are used instead of a vector of overidentifying moments. It seems likely that other inference methods which are used with simulation-based estimators, such as Hamiltonian Monte Carlo and sequential Monte Carlo, among others, may be made more reliable if neural moments are used, as dimension reduction while maintaining relevant information is likely to be generally beneficial. Support from the Spanish Ministry of Science, Innovation and Universities and FEDER through a Severo Ochoa Center of Excellence award to the Barcelona GSE and through grant PGC2018-094364-B-I00 is gratefully acknowledged. Data Availability Statement: Publicly available datasets were analyzed in this study. This data can be found here: https://realized.oxford-man.ox.ac.uk/data/download

Conflicts of Interest:
The author declares no conflict of interest. Notes 1 These definitions and notation are loosely based on Jiang and Turnbull (2004). 2 It may be noted that methods other than MCMC may be used to generate the set of draws from the posterior, C. For example, one might use sequential Monte Carlo. Point estimation and inference using C remains the same regardless of how C is generated. 3 The function which specifies and trains the neural net is MakeNeuralMoments.jl 4 See the file MCMC.jl for the details of how this proposal density is implemented. The details of the model and priors may be seen at CKlib.jl. The model is solved using third order projection, making use of the SolveDSGE.jl package. The model is discussed in more detail in Chapter 14 of the document econometrics.pdf. 10 These results are available for the SV and ARMA models, as well as an unreported additional model, in the WP branch of the GitHub archive.

11
The model is solved and simulated using the SRIW1 strong order 1.5 solver from the DifferentialEquations.jl package for the Julia language. 12 The workstation has 4 Opteron 6380 processors, each with 4 physical cores, running at 2500 MHz. 13 The data source is the Oxford-Man Institute's realized library, v. 0.3, https://realized.oxford-man.ox.ac.uk/images/oxfordman realizedvolatilityindices.zip