Galilean and Hamiltonian Monte Carlo

Abstract: Galilean Monte Carlo (GMC) allows exploration in a big space along systematic trajectories, thus evading the square-root inefficiency of independent steps. Galilean Monte Carlo has greater generality and power than its historical precursor Hamiltonian Monte Carlo because it discards second-order propagation under forces in favour of elementary force-free motion. Nested sampling (for which GMC was originally designed) has similar dominance over simulated annealing, which loses power by imposing an unnecessary thermal blurring over energy.


Introduction
Question: How does a mathematician find a needle in a haystack? Answer: Keep halving the haystack and discarding the "wrong" half.
This trick relies on having a test for whether the needle is or is not in the chosen half. With that test in hand, the mathematician expects to find the needle in log 2 N steps instead of the O( 1 2 N) trials of direct point-by-point search.
The programmer is faced with a similar problem when trying to locate a small target in a large space. We do not generally have volume-wide global tests available to us, being instead restricted to point-wise evaluations of some quality function Q(x) at selected locations x. A successful algorithm should have two parts.
One part uses quality differences to drive successive trial locations towards better (larger) quality values. This iteration reduces the available possibilities by progressively eliminating bad (low quality) locations. Nested sampling [1] accomplishes this without needing to interpret Q as energy or anything else. By relying only on comparisons (> or = or <) it's invariant to any monotonic regrade, thereby preserving generality. Its historical precursor was simulated annealing [2], in which log Q was restrictively interpreted as energy in a thermal equilibrium.
The other part of a successful algorithm concerns how to move location without decreasing the quality attained so far. Here, it will often be more efficient to move systematically for several steps in a chosen direction, rather than diffuse slowly around with randomly directed individual steps. After n steps, the aim is to have moved ∆x ∝ n, not just √ n. Galilean Monte Carlo (GMC) accomplishes this with steady ("Galilean") motion controlled by quality value. Its historical precursor was Hamiltonian Monte Carlo (HMC) [3], in which motion was controlled by "Hamiltonian" forces restrictively defined by a quality gradient which sometimes doesn't exist.
In both parts, nested sampling compression and GMC exploration, generality and power are retained by avoiding presentation in terms of physics. After all, elementary ideas underlie our understanding of physics, not the other way round, and discarding what isn't needed ought to be helpful.
There may be very many (N) possible "locations" x to choose from. For clarity, assume these have equal status a priori-there is no loss of generality because unequal status can always be modelled by retreating to an appropriate substratum of equivalent points. For the avoidance of doubt, we are investigating practical computation so N is finite, though with no specific limit.
As the first step, the programmer with no initial guidance available can at best select a random location x 1 for the first evaluation Q 1 = Q(x 1 ). The task of locating larger values of Q carries no assumption of geometry or even topology. Locations could be shuffled arbitrarily and the task would remain just the same. Accordingly, we are allowed to shuffle the locations into decreasing Q-order without changing the task (Figure 1). If ordering is ambiguous because different locations have equal quality, the ambiguity can be resolved by assigning each location its own (random) key-value to break the degeneracy. Being chosen at random, x 1 's shuffled rank Nu 1 marks an equally random fraction of the ordered N locations. Our knowledge of u 1 is uniform: u 1 ∼ Uniform(0, 1). We can encode this knowledge as one or (better) more samples that simulate what the position might actually have been. If the programmer deems a single simulation too crude and many too bothersome, the mean and standard deviation log u 1 = −1 ± 1 (1) often suffice. The next step is to discard the "wrong" points with Q < Q 1 and select a second location x 2 randomly from the surviving Nu 1 possibilities. Being similarly random, x 2 's rank Nu 1 u 2 marks a random fraction of those Nu 1 , with u 2 ∼ Uniform(0, 1) ( Figure 2). And so on. After k steps of increasing quality Q 1 < Q 2 < · · · < Q k , the net compression ratio X k = u 1 u 2 . . . u k can be simulated as several samples from to get results fully faithful to our knowledge or simply abbreviated as mean and standard deviation Compression proceeds exponentially until the user decides that Q has been adequately maximised. At that stage, the evaluated sequence Q 1 < Q 2 < · · · < Q k of qualities Q has been paired with the corresponding sequence X 1 > X 2 > . . . X k of compressions X (Figure 3), either severally simulated or abbreviated in mean. Q can then be integrated as so that quantification is available for Bayesian or other purposes. Any function of Q can be integrated too from the same run, changing Q i to some related Q i while leaving the X i fixed. And the statistical uncertainty in any integral Z is trivially acquired from the repeated simulations (2) of what the compressions X might have been according to their known distributions. That's nested sampling. It requires two user procedures additional to the Q(x) function. The first is to sample an initially random location to start the procedure. The second-which we next address-is to move to a new random location obeying a lower bound on Q. Note that there is no modulation within a constrained domain. Locations are either acceptable, Q(x) ≥ Q * , or not, Q(x) < Q * .

Exploration by Galilean Monte Carlo
The obvious beginners' MCMC procedure for moving from one acceptable location to another, while obeying detailed balance but not moving so far that Q always disobeys the lower bound Q * , is: However, randomising v every step is diffusive and slow, with net distance travelled increasing only as the square root of the number of steps. All locations within the constrained domain are equally acceptable, so the program might better try to proceed in a straight line, changing velocity only when necessary in an attempt to reflect specularly off the boundary (Figure 4, left). The user is asked to ensure that the imposed geometry makes sense in the context of the particular application, otherwise there will be no advantage. 1 The practice With finite step lengths, it's not generally possible to hit the boundary exactly whilst simultaneously being sure that it had not already been encountered earlier, so the ideal path is impractical. Instead, we take a current location x and suppose that a corresponding unit vector n can be defined there as a proxy for where the surface normal would be if that surface was close at hand (Figure 4, right). Again, it is the user's responsibility to ensure that n makes sense in the context of the particular application: exploration procedures cannot anticipate the quirks of individual applications.
Reflection from a plane orthogonal to n (drawn horizontally in Figure 5) modifies an incoming velocity v (in northward direction from the South) to v = v − 2n(n T v). Depending on the circumstances, the incoming particle may proceed straight on to the North (+v), or be reflected to the East (+v ), or back-reflected to the West (−v ), or reversed back to the South (−v). Mistakenly, the author's earlier introduction of GMC in 2011 [4] reduced the possibilities by eliminating West, but at the cost of allowing the particle to escape the constraint temporarily, which damaged the performance and cancelled its potential superiority.
If the potential destination North is acceptable (bottom left in Figure 5), the particle should move there and not change its direction (so n need not be computed). Otherwise, the particle needs to change its direction but not its position. For a North-South oriented velocity to divert into East-West, either East or West must be acceptable, but not both because East-West particles would then pass straight through without interacting with North-South, so the proposed diversion would break detailed balance. Likewise, for an East-West velocity to divert North-South, either North or South must be acceptable but not both. These conditions yield the following procedure: Start at x with acceptable quality Q(x) ≥ Q * Set v = isotropically random velocity Repeat for length of trajectory Any self-inverse reflection operator R will do, though the reflection idea suggests R = I − 2n n T . That's Galilean Monte Carlo. The trajectory is explored uniformly, with each step yielding an acceptable (though correlated) sample.

Compression and Exploration
GMC was originally designed for nested-sampling compression, from which probability distributions can be built up after a run by identifying quality as likelihood L in the weighted sequence (4) of successively compressed locations. However, GMC can also be used when exploring a weighted distribution directly.
For compression (standard nested sampling, Figure 6, left), only the domain size X is iterated, albeit under likelihood control.

Compression:
Enter with X and L Set constraint L * = L defining X * Sample within L * to get X = uX * Exit with X and L For exploration (standard reversible MCMC, Figure 6, right), the likelihood is relaxed as well through a preliminary random number u ∼ Uniform(0, 1).

Exploration:
Enter with X and L Set constraint L * = u L defining X * Sample within L * to get X = uX * Exit with X and L This is equivalent to standard Metropolis balancing "Accept x if L(x ) ≥ u L(x)", the only difference being that the lower bound u L is set beforehand instead of checked afterwards.

Exploration by Hamiltonian Monte Carlo
Hamiltonian Monte Carlo (HMC) [3] uses a physical analogy with kinetic theory of gases in which a thermal distribution of moving particles, whose position/velocity probability distribution factorises into space and velocity parts with potential energy V defining the spatial target distribution Pr(x) ∝ e −V(x) and kinetic energy T = 1 2 |v| 2 distributed as the Boltzmann thermal equilibrium Pr(v) ∝ e −T(v) . The usual dynamics (Figure 7) dx relaxes an initial setting towards the joint equilibrium (9) under occasional collisions which reset v according to Pr(v), leaving x as a sample from the target Pr(x). Between collisions, the force field is necessarily digitised into impulses at discrete time intervals δt, so the computation obeys To make the trajectory reversible and increase the accuracy order, the impulses are halved at the start x and end x , but even this does not ensure full accuracy because the dynamics has been approximated ( Figure 8).  To correct this, the destination x , whose total energy E = V + T ought to agree with the initial E = V + T, is subjected to the usual Metropolis balancing.
In practice, the correction is often ignored because the (reversible) algorithm explores "level sets" whose contours are often an adequately good approximation to the true Hamiltonian provided the fixed timestep is not too large. That's Hamiltonian Monte Carlo. The trajectory is explored non-uniformly, with successive steps being closer at altitude where the particles are moving slower, so that sampling is closer where the density is smaller-a mismatch which needs to be overcome by the equilibrating collisions. And, of course, HMC requires the potential V(x) (a.k.a. log-likelihood) to be differentiable and generally smooth.

Compression versus Simulated Annealing
Simulated annealing uses a physical analogy to thermal equilibrium to compress from a prior probability distribution to a posterior. As in HMC, though without the complication of kinetic energy, the likelihood (or quality) is identified as the exponential L = e −E of an energy. In annealing, the energy is scaled by a parameter β so that the quality becomes Q = L β = e −βE of this scaled energy, with β used to connect posterior (where β = 1) with prior (where β = 0).
This "simulates" thermal equilibrium at coolness (inverse temperature) β, and "annealing" refers to sufficiently gradual cooling from prior to posterior that equilibrium is locally preserved. A few lines of algebra, familiar in statistical mechanics, show that the evidence (or partition function) can be accumulated from successive thermal equilibria as where log L β is the average log-likelihood as determined by sampling the equilibrium appropriate to coolness β. Equilibrium is defined by weights L β and can be explored either by GMC or (traditionally) by HMC. There is seldom any attempt to evaluate the statistical uncertainty in log Z, the necessary fluctuations being poorly defined in the simulations. At coolness β, the equilibrium distribution of locations x, initially uniform over the prior, is modulated by L β so that the samples have probability distribution Pr(x) ∝ L(x) β which corresponds to in terms of compression. Consequently, samples cluster around the maximum of β log L + log X, where the log L(log X) curve has slope −1/β (Figure 9, left). Clearly this only works properly if log L(log X) is concave ( ). Any zone of convexity ( ) is unstable, with samples heading toward either larger L at little cost to X or toward larger X at little cost to L. A simulated-annealing program cannot enter a convex region, and the steady cooling assumed in (13) cannot occur. In the physics analogy, this behaviour is a phase change and can be exemplified by the transition from steam to water at 100 • C (Figure 9, right). Because of the different volumes (exponentially different in large problems), a simulated-annealing program will be unable to place the two phases in algorithmic contact, so will be unable to model the latent heat of the transition. Correspondingly, computation of evidence Z will fail. Yet our world is full of interesting phase changes, and a method that cannot cope with them cannot be recommended for general use.
Nested sampling, on the other hand, compresses steadily with respect to the abscissa log X regardless of the (monotonic) behaviour of log L, so is impervious to this sort of phase change. Contrariwise, simulated annealing cools through the slope, which need not change monotonically. By using thermal equilibria which average over e −βE , simulated annealing views the system through the lens of a Laplace transform, which is notorious for its ill-conditioning. Far better to deal with the direct situation.

Conclusions
The author suggests that, just as nested sampling dominates simulated annealing, . . . In each case, reverting to elementary principles by discarding physical analogies enhances generality and power.