Massively-Parallel Molecular Dynamics Simulation of Clathrate

Massively-parallel classical equilibrium molecular dynamics (MD) simulations have been performed to investigate the computational performance of the Simple Point Charge (SPC) model and single-particle model of Molinero et al. applied to simulation of methane hydrates, using systems consisting of several million particles, on a variety of Blue Gene/L, P and Q platforms. It was found that the newer Blue Gene/Q platform offers attractive performance for massively-parallel simulation.


Introduction
Clathrate hydrates are non-stoichiometric crystalline inclusion compounds in which a water host lattice encages small guest atoms or molecules in cages; the empty lattice is thermodynamically unstable, and owes its existence to hydrogen bond stabilisation resulting from the enclathration of the trapped solutes in its cavities [1,2].There are three known common hydrate structures: (s)I, II and H.In type I hydrates, the unit cell is formed from two small 5 12 pentagonal dodecahedral cavities and six slightly larger tetrakaidecahedral 5 12 6 2 cages [1,2].Methane hydrates are the most widespread type, and are thought to exist in nature primarily as type I in the permafrost and deep ocean regions; this has driven significant interest in their use as a resource, as a potential geohazard, their role in the global carbon cycle, and considerations of their potential influence on climate change [1][2][3][4].Progress in research in these areas depends fundamentally on the availability of high-quality property data and an increased understanding of underlying physics and chemistry governing clathrates in nature.Molecular dynamics (MD) simulation offers a detailed understanding of underlying molecular mechanisms of OPEN ACCESS hydrate behaviour, and the use of massively-parallel High-Performance Computing (HPC) platforms offers obvious potential for consideration of the larger system sizes needed for more realistic modelling, e.g., of kinetic properties, where artifacts of smaller system sizes may become manifest rather clearly under periodic boundary conditions (PBC) for smaller systems used historically in MD simulation [5].The kinetic mechanisms of clathrate hydrate crystallisation and dissociation, especially at a molecular level, are understood rather poorly.Theoretical predictions of these rates could differ from experimentally measured ones by at least an order of magnitude [1][2][3][4]; improving the understanding of such mechanisms is desirable in terms of enhancing the viability of large-scale methane production from hydrates [3,4].Equally importantly, the use of state-of-the-art massively-parallel HPC platforms for MD modelling of dynamical properties of clathrates has the implication that once-prohibitive, or "heroic", system sizes composed of many millions of particles can now be modelled more routinely.To this end, massively-parallel HPC systems offer the potential to model very large hydrate systems and are the ideal complement to specialised hardware platforms for acceleration of hydrate simulation timescales [6].One of the goals of this article is to highlight the performance possible for calculations on large-scale systems of hydrates, and demonstrate the utility of this for determination of hydrate properties in a routine and power-efficient manner.
In parallel to developments to our understanding of hydrates, the development of massively-parallel MD in recent years on Blue Gene (BG) architectures from the first-generation BG/L, second-generation BG/P to the most recent BG/Q [7], has seen a marked increase in feasible system sizes for simulation towards petaflop-scale operation on billions of particles, with relatively low power consumption.The earlier development of IBM's Blue Matter classical MD software suite [8] and its efficient deployment on BG/L [9] demonstrated that sustained MD performance at a fraction of a petaflop was possible with reasonable parallel scaling on the order of a thousand MPI ranks, and was an important development in massively-parallel MD; this allowed systems with millions of particles, or more, to be simulated on a relatively routine basis largely for the first time.Several popular community MD codes using spatial-decomposition parallelisation were deployed on BG/L's toroidal network arrangement, such as LAMMPS and NAMD [10], with reasonable parallel efficiencies.More recently, a variety of community MD codes have been employed relatively efficiently across one or multiple 1024-node cabs of BG/P, e.g., NAMD [11], with optimisation of code or more efficient mapping of algorithms thereon, especially mapping three-dimensional FFT routines on the BG network topology [12] for (Particle-Particle) Particle-Mesh-Ewald electrostatics.Various applications of massively-parallel MD making use of BG/P have been reported, inter alia the non-equilibrium simulation of electrophoretic deposition of light-conducting polymers onto arrays of nanorods in systems approaching a million atoms in size [13].In the past year or so, the introduction of the BG/Q architecture to the MD community has allowed for the use of even larger numbers of MPI ranks, with, inter alia, LAMMPS [14] and NAMD [15] being ported and mapped onto this architecture to run relatively efficiently.For LAMMPS, using spatial decomposition, OpenMP multi-threading allows up to 64 threads per 16-processor node [14], while pure-MPI deployment of LAMMPS can, in principle and practice, allow for (approaching) around a million MPI ranks with reasonable parallel scaling.NAMD, using hybrid force/spatial decomposition, can be scaled with reasonable efficiency depending on system size on BG/P and /Q, and there is an SMP and memory-optimised version, with up to several million atoms on the SMP version adequate for the construction/implementation of the PME calculation.
A goal of this study was to compare and contrast the performances of popular rigid water models, namely the Simple Point Charge (SPC) [16] and the single-molecule, tetrahedrally-biased model of Molinero et al.-often called the "mW" potential [17], for large hydrate systems consisting of up to millions of particles on Blue Gene platforms.This allows for the comparison of performance and scaling for two distinct representations of water: atomistic and more coarse-grained (with no explicit electrostatics in the latter case), where a different order of time steps may be used.

Simulation Methodology
Classical MD [18] was carried out on various BG/L, /P and /Q platforms using LAMMPS with the mW Stillinger-Weber-based single-particle potential for water and methane [19], while NAMD was used for an atomistic representation with the SPC water model and a single-site Lennard-Jones representation of methane [20].The cut-off radius for real-space interactions was set at 8.5 Å, and the time step used was 1 fs for an atomist representation, and 10 fs for the mW coarse-grained approach.The Smooth Particle Mesh Ewald (SPME) technique [21] was used to treat long-range electrostatic interactions for the atomistic representation, whilst no electrostatics were required for the mW approach, ipso facto.Pure-MPI versions of LAMMPS and NAMD were run on BG/L and /P (with both one and two 1024-node racks in the latter case), using all available processors per node assigned to one MPI task each (i.e., "virtual-node" operation, given the computation-dominated nature of the simulations vis-à-vis communication overhead).On BG/Q, both pure-MPI and hybrid MPI/OpenMP versions of LAMMPS were run, with the latter allowing for hyper-threading on up to four threads per processor (i.e., up to 64 threads per 16-processor node).For NAMD, both SMP and memory-optimised versions were run, to allow for larger systems to be modeled with the latter.
NAMD was chosen for the atomistic representation due to its better relative scaling where three-dimensional FFT approaches are required for long-range, reciprocal-space electrostatics.However, LAMMPS was found to offer very impressive performance and scaling for shorter-ranged potentials, and, as such, is a highly suitable code for use with a single-particle, (formally) electrostatics-free, Stillinger-Weber-based approach such as that used in the mW potential.This disparity was found to be especially the case on BG/P, for use of more than one rack.However, the faster interconnect and lower communication-latency times along the 5D network torus on BG/Q tend to render LAMMPS more competitive for improved scaling with longer-ranged potentials vis-à-vis BG/P, but still not to the same extent as NAMD for the systems used in this study, particularly considering the memory-optimised version of NAMD for systems larger than several million particles.
The starting configuration was based on the X-ray diffraction analysis of structure I ethylene oxide hydrate by McMullan and Jeffrey [22], which provides the positions of the oxygen atoms and the centres of mass of the methane molecules.For the atomistic representation of SPC water, the initial orientations of the water molecules were selected in a random manner to conform to the Bernal-Fowler rules [23] and to have a vanishingly small total dipole moment.The initial cubic unit-cell length was set at 12.03 Å, consistent with that experimentally at −25 °C [22], and the cell was fully occupied by methane molecules.Periodic replication of the unit cell was undertaken so as to create cubic systems containing up to several million particles.The systems studied consisted of: 10 3 , 14 3 , 18 3 unit cells for BG/L (~0.054-0.315 million molecules); 14 3 , 18 3 , 22 3 unit cells for BG/P (~0.148-0.747 million molecules); and 22 3 , 26 3 , 32 3 unit cells for BG/Q (~0.575-1.77 million molecules).These were simulated for around 0.5 ns in the NPT ensemble at 260 K and 50 bar.

Results and Discussion
The simulation speeds in time-steps per second (tps) are plotted in Figure 1 for both atomistic and coarse-grained representations on the three different platforms, in terms of "sensible" system sizes and number of processors used for each platform (or, in the case of LAMMPS on BG/Q, the number of threads, given that quad-level threading could be carried out per physical processor for the hybrid MPI-OpenMP implementation thereon).Ideal scaling curves are also presented for the smallest systems on each platform.In general, there is an overall trend towards higher absolute performance and better parallel scaling to higher processor-counts as one progresses from BG/L to /Q, as well as the increasing system sizes that can be handled successfully with reasonable relative performance considerations.As is the case for all platforms, systems and codes, there is a "lower-bound" sensible system size for deployment to reasonable parallel scaling on larger numbers of processors.Although it is somewhat subjective to consider a parallel efficiency in the region of 70% for the largest versus smallest processor-count, this is usually a reasonable metric in practice, and these system sizes are, respectively, 14 3 , 18 3 and 22 3 unit cells for the maximal processor/thread count on BG/L, /P and /Q, respectively.From inspection of Figure 1, it is readily evident that at lower processor-counts, LAMMPS is faster than NAMD, which should hardly be surprising given the lack of electrostatics in the short-ranged, coarse-grained potential with roughly three times less particles than an atomistic representation in NAMD.Although LAMMPS is still faster in absolute terms on BG/L and /P, there is poorer parallel scaling vis-à-vis NAMD, so that an atomistic representation in NAMD becomes competitive in absolute performance (time-steps per second, albeit needing to use a time step perhaps almost an order of magnitude smaller) at the maximal processor count (run over two racks for BG/P).There is slightly more of a fall-off in parallel efficiency for LAMMPS when extending to a second cab of BG/P, owing to communications in the code architecture and the underlying network topology of a second rack.NAMD performs better in this respect.On BG/Q, however, apart from the faster clock-speed performance of the processors, the 5D network torus leads to a tangible increase in relative scaling performance for LAMMPS with faster inter-thread communications, such that it retains its absolute-performance "edge" to a large extent vis-à-vis NAMD (to mention nothing of the underlying physical approximation inherent in the coarse-grained approach).Even so, the possibility of up-to quad-level hyper-threading for MPI-OpenMP LAMMPS on BG/Q is obviously highly attractive, as is the use of a substantially larger time-step for the coarse-grained approach.
It ought to be mentioned that the SMP version of NAMD on BG/Q was used to plot Figure 1 here.However, slightly better performance was obtained with the memory-optimised version.For much more than several million atoms, sizes not tested here, it is believed that the SPME implementation on NAMD benefits substantially from this code architecture.for the atomistic representation in NAMD (left) and coarse-grained approach in LAMMPS (right)."i.s." denotes ideal scaling, for reference."P" denotes number of processes, where this is understood to mean number of threads for LAMMPS on BG/Q (due to quad-threading with 64 threads per CPU thereon).
It was found that the configurational energies and densities of the SPC-and corase-grained-modelled hydrates were in reasonable accord with previous simulations with the same potentials.The cage radii were determined.The radius r i (t) for cage i was defined as the averaged distance of each constituent water molecules' centre-of-mass (or single particle for the coarse-grained approach) from the cavity's hypothetical "centre-point" r C (t), with the central point computed from the sum of the cage's water molecules' centres-of-mass, i.e.: where N T,i denotes the number of constituent water molecules in the cage i of type T (i.e., 20 for small cavities and 28 for large cages).The cages were found to be stable in structure throughout, and the identity of their constituent water molecules did not vary.The time-averaged 5 12 cage radii were found to be 3.96 ± 0.04 and 3.93 ± 0.02 Å for SPC-and coarse-grained-modelled hydrate, whilst the corresponding respective results for 5 12  Prior to making overall conclusions on the utility of Blue Gene architectures for large-system simulations, it is worthwhile considering from a broader perspective the outlook for their (potential) impact on gas hydrate simulation.At the present time, it is the case that usage of long-range (mesh-type) electrostatics in LAMMPS offers less parallel or absolute efficiency and performance than would be desired, particularly in comparison to NAMD.Naturally, this has implications for simulations of hydrates where (long-range) electrostatics are indeed required with the current generation of force-fields; obvious examples would include clathrates with "point-charged" guests that cannot be represented feasibly as single particles, e.g., CO 2 hydrates, or those which are dipolar, e.g., H 2 S hydrates [25]; a potential "work-around" may be to make liberal use of spherical cut-offs within the sub-domains allocated to each physical/virtual process, or to use truncated and "smoothed" electrostatics schemes [26,27] to "force" the spatial "decay" within the designated sub-domain; however, the limitations and system size-dependence of such a variety of electrostatics schemes applied to methane hydrate simulation has been discussed previously, especially for estimation of thermal conductivity [28].Massively-parallel MD using Blue Gene, or potentially other comparable HPC platforms, may be to used to simulate systems with a strong size-dependence or finite-size effects, e.g., dissociation or growth of hydrate nano-crystals in the absence or presence of silica (nano-scale) surfaces, thermal conductivity estimation of (defective) hydrate nano-crystals, or heterogeneous nucleation studies of sub-critical hydrate crystallites near the water-guest interface (possibly mediate by nano-scale silica).In terms of accessible time scales for, say, methane hydrate simulation on Blue Gene platforms, simulation of about one million molecules on BG/Q using LAMMPS with the coarse-grained potential leads to around 30 ns per 12-hour slot with a 12.5 fs time step with 8192 processes, whilst around 2.5 ns can be simulated using atomistic potentials with NAMD and full electrostatics and a 2 fs time step in the same time.Therefore, for up to, say, a dozen such time-slots, one could simulate feasibly via a coarse-grained potential on around one million particles the dissociation and growth of hydrate nano-crystals, as well as heterogeneous nucleation of mildly sub-critical-sized hydrate nuclei, possibly near nano-scale silica surfaces.Where electrostatics and system-size are important, or for more accurate assessment of shorter-time responses to pressure or thermal "shocks", e.g., for accurate estimation of thermal (conduction) properties of (defective) hydrate nano-crystals, possibly at silica surfaces, or aggressive pressure-/temperature-driven melting of such particles, then full atomistic simulation over tens of nanoseconds with several million atoms would be ideal, using NAMD on BG/Q.Certainly, the investigation of hydrate system response to electric or electromagnetic fields [1,29] requires explicit coupling of the Lorentz forces to point charges via non-equilibrium MD [30,31], and study of such a perturbation would require full, long-range electrostatics; in this case, a relatively aggressive external field intensity of the order of 0.025-0.04V/Å would be expected to lead to tangible effects in a hydrate-liquid system, possibly at a silica interface, within tens of nanoseconds, which is indeed feasible using NAMD on BG/Q.Of course, reasonable scaling to ~32k processes has also been found on BG/Q for around a million molecules in these cases, meaning that with around a dozen 12-hour slots, around 1 µs and 100 ns are easily accessible for the coarse-grained and atomistic representations, respectively.
Considering the outlook for Blue Gene or massively-parallel HPC access, as well as operational considerations, at the present time, the usage of Blue Gene architectures is not particularly widespread amongst the molecular simulation community, or those within this with an interest in (larger-scale) simulation of gas hydrates.However, given the large capital and running costs of such high-end architectures at present, they are often available to the community for production simulation (typically several to a few dozen millions of core-hours) on a shared-access basis via competitive submission of scientific proposals to national or supra-national peta-scale computing initiatives (e.g., U.S. DOE INCITE programme or PRACE Tier-0 production access schemes).However, the possibility of acquiring several hundred thousand to around a million core-hours via more discretionary access schemes for the purposes of benchmarking simulations, or perhaps very limited production runs, is increasingly available.Certainly, the prospect of greater access towards massively-parallel HPC architectures, offering tens of thousands of processors (whether physical or partially virtualised), via shared-access or cloud-computing arrangements, will increase dramatically in the coming decade, so as to become largely routine; in this sense, what may appear to be pioneering today on Blue Gene platforms for massively-parallel MD will be expected to become established practice relatively quickly, probably within a few years.Indeed, the impact of shared-access and cloud-computing arrangements has already had a larger effect on HPC than many in the high-end computational simulation community would have predicted even half a decade ago.Simulation set-up on such massively-parallel platforms today do require some specialist HPC knowledge on the part of the practitioner, in terms of RAM availability, parallelisation strategy, hyper-threading, compatibility of data-flows across network topology, and, often, algorithmic and coding re-design.However, the tuning and deployment of popular MD packages such as LAMMPS and NAMD onto such HPC architectures has led to simulation set-ups quite similar to smaller molecular systems of hundreds to perhaps a few tens of thousands of atoms used by the molecular simulation community today on serial architectures, or those with perhaps a few dozen separate processes; the key challenge for the "peta-scale" practitioner is to consider carefully the accommodation of molecular systems into available RAM for the given system parallelisation strategy, and then benchmark and optimise carefully performance with various different mappings of the computational load on to the physical HPC architecture, and any available hyper-threading arrangements.Naturally, hard-disk space for back-up and follow-up analysis on simulation trajectories pose additional challenges, and data-compression practices and parallelisation of analysis codes are particularly important here (usually on local computing facilities).

Conclusions
It has been found that Blue Gene computing may be used with some profit for a variety of massively-parallel clathrate hydrate systems consisting of the order of a million molecules or more, with an attractive speed-up for a coarse-grained representation, provided care is taken with minimal system size for effective parallelisation and optimal parallel scaling for appropriate system sizes.Given that the power draw of the individual processors is relatively low in comparison to ×86_64 architectures (which may or may not also feature GPUs), it can be seen that Blue Gene computing can be relatively attractive for systems containing millions of particles.Therefore, BG platforms are of much use for those wishing to study hydrate systems, e.g., two-or-more-phase systems for nucleation or growth where periodic size artefacts may be very important, particularly once the codes and levels of performance for different system sizes and processor/thread counts have been established for efficient operation, from both an absolute-performance and relative-scaling perspective.However, given that time cannot be parallelised, long-time MD simulation may also be of interest to the hydrate-simulation community on GPU and other specialised platforms, where it is hoped that millisecond timescales may become routinely accessible by the community within the next decade or so [6].Certainly, the availability of the coarse-grained potential [17,19] is also a significant development in this respect.

Figure 1 .
Figure 1.Time steps per second (t.p.s.) for BG/L (top), BG/P (middle) and BG/Q (bottom)for the atomistic representation in NAMD (left) and coarse-grained approach in LAMMPS (right)."i.s." denotes ideal scaling, for reference."P" denotes number of processes, where this is understood to mean number of threads for LAMMPS on BG/Q (due to quad-threading with 64 threads per CPU thereon).
6 2 cages were 4.40 ± 0.06 and 4.38 ± 0.05 Å.The cage radii are in good agreement with respective experimental data at 250 K for small and large cage radii of 3.90 and 4.33 Å [2] and 3.91 and 4.34 Å [24].