Accurate Sampling with Noisy Forces from Approximate Computing

In scientific computing, the acceleration of atomistic computer simulations by means of custom hardware is finding ever growing application. A major limitation, however, is that the high efficiency in terms of performance and low power consumption entails the massive usage of low-precision computing units. Here, based on the approximate computing paradigm, we present an algorithmic method to rigorously compensate for numerical inaccuracies due to low-accuracy arithmetic operations, yet still obtaining exact expectation values using a properly modified Langevin-type equation.


INTRODUCTION
Molecular dynamics (MD) is a very powerful and widely used technique to study thermodynamic equilibrium properties, as well as the real-time dynamics of complex systems made up of interacting atoms [Alder and Wainwright 1957].This is done by numerically solving Newton's equations of motion in a time-discretized fashion via computing the nuclear forces of all atoms at every time step [Rahman 1964].Computing these forces by analytically differentiating the interatomic potential with respect to the nuclear coordinates is computationally rather expensive, which is particularly true for electronic structure based ab-initio MD simulations [Car and Parrinello 1985;Kühne 2014;Kühne et al. 2007;Payne et al. 1992].
For a long time newly developed microchips became faster and more efficient over time due to new manufacturing processes and shrinking transistor sizes.However, this development slowly comes to an end as scaling down the structures of silicon based chips becomes more and more difficult.The focus therefore shifts towards making efficient use of the available technology.Hence, beside algorithmic developments [Gonnet 2007[Gonnet , 2012;;John et al. 2016;Kühne and Prodan 2018;Shan et al. 2005;Shaw 2005;Snir 2004;Tuckerman et al. 1992], there have been numerous custom computing efforts in this area to increase the efficiency of MD simulations by means of hardware acceleration, which we take up in this work.Examples of the latter are MD implementations on graphics processing units (GPUs) [Abraham et al. 2015;Anderson et al. 2008;Brown et al. 2012;Colberg and Höfling 2011;Eastman and Pande 2010;Le Grand et al. 2013;Stone et al. 2010], fieldprogrammable gate arrays (FPGAs) [Herbordt et al. 2008a,b], and application-specific integrated circuits (ASICs) [Shaw et al. 2007[Shaw et al. , 2014]].While the use of GPUs for scientific applications is relatively widespread [Owens et al. 2008;Preis et al. 2009;Weigel 2012], the use of ASICs [Boyle et al. 2005;Brown and Christ 1988;Fukushige et al. 1999;Hut and Makino 1999] and FPGAs is less common [Baity-Jesi et al. 2014;Belletti et al. 2009;Giefers et al. 2014;Kenter et al. 2017Kenter et al. , 2018;;Meyer et al. 2012], but gained attention over the last years.In general, to maximize the computational power for a given silicon area, or equivalently minimize the power-consumption per arithmetic operation, more and more computing units are replaced with lower-precision units.This trend is mostly driven by market considerations of the gaming and artificial intelligence industries, which are the target customers of hardware accelerators and naturally do not absolutely rely on full computing accuracy.
In the approach presented in this paper, we make effective use of low-precision special-purpose hardware for general-purpose scientific computing by leveraging the approximate computing (AC) paradigm [Klavík et al. 2014;Plessl et al. 2015].The general research goal of AC is to devise and explore ingenious techniques to relax the exactness of a calculation to facilitate the design of more powerful and/or more efficient computer systems.However, in scientific computing, where the exactness of all computed results is of paramount importance, attenuating accuracy requirements is not an option.Yet, we will demonstrate that it is nevertheless possible to rigorously compensate for numerical inaccuracies due to low-accuracy arithmetic operations and still obtain exact expectation values as obtained by ensemble averages of a properly modified Langevin equation.
The remainder of the paper is organized as follows.In Section 2 we revisit the basic principles of AC before introducing our modified Langevin equation in Section 3. Thereafter, in Section 4, we describe the computational details of our computational experiments.Our results are presented and discussed in Section 5 before concluding the paper in Section 6.

APPROXIMATE COMPUTING
A basic method of approximation and a key requirement for efficient use of processing hardware is the use of adequate data widths in computationally intensive kernels.While in many scientific applications the use of double-precision floating-point is most common, this precision is not always required.For example, iterative methods can exhibit resilience against low precision arithmetic as has been shown for the computation of inverse matrix roots [Lass et al. 2018a] and for solving systems of linear equations [Angerer et al. 2016;Haidar et al. 2018Haidar et al. , 2017;;Klavík et al. 2014].Mainly driven by the growing popularity of artificial neural networks [Gupta et al. 2015], we can observe growing support of low-precision data types in hardware accelerators.In fact, recent GPUs targeting the data center have started supporting half-precision as well, nearly doubling the peak performance compared to single-precision and quadrupling it compared to double-precision arithmetics [NVIDIA Corporation 2016].However, due to the low number of exponent bits, half-precision only provides a very limited dynamic range.In contrast, bfloat16 provides the same dynamic range as singleprecision, and just reduces precision.It is currently supported by Google's Tensor Processing Units (TPU) [The Next Platform 2018] and support is announced for future Intel Xeon processors [Top 500 2018] and Intel AgileX FPGAs.A list of commonly used data types, together with the corresponding number of bits used to store the exponent and the mantissa, are shown in Table 1.
Yet, programmable hardware such as FPGAs, as a platform for custom-built accelerator designs [Kenter et al. 2012[Kenter et al. , 2014;;Strzodka and Goddeke 2006], can make effective use of all of these, but also entirely custom number formats.Developers can specify the number of exponent and mantissa bits and trade off precision against the amount of memory blocks required to store values and the number of logic elements required to perform arithmetic operations on them.
In addition to floating-point formats, also fixed-point representations can be used.Here, all numbers are stored as integers of fixed size with a predefined scaling factor.Calculations are thereby performed using integer arithmetic.On CPUs and GPUs only certain models can perform 2019-07-22 00:29.Page 2 of 1-10.integer operations with a peak performance similar to that of floating-point arithmetic, depending on the capabilities of the vector units / stream processors.Nevertheless, FPGAs typically can perform integer operations with performance similar to or even higher than that of floating-point.Due to the high flexibility of FPGAs with respect to different data formats and the possible use of entirely custom data types, we see them as the main target technology for our work.For this reason, we consider both floating-point and fixed-point arithmetic in the following.

METHODOLOGY
To demonstrate the concept of approximate computing, we introduce white noise to the interatomic forces that are computed while running the MD simulation.In this section, we describe in detail on how we introduce the noise to mimic in software the behaviour that would be observed when running the MD on the actual FPGA or GPU hardware with reduced numerical precision.We classify the computational errors into two types: fixed-point errors, and floating-point errors.
Assuming that F I are the exact and F N I the noisy forces from a MD simulations with low precision on an FPGA for instance, fixed-point errors can by modelled by whereas floating-point errors are described by Therein, c 1 , c 2 and c 3 are random values chosen in the range [-0.5, 0.5], whereas F x I , F y I and F x I are the individual force components of F I , respectively.The floating-point scaling factor is denoted as α and the magnitude of the applied noise by β.
To rigorously correct the errors introduced by numerical noise we employ a modified Langevin equation.In particular, we model the force as obtained by a low-precision computation on a GPU or FPGA-based accelerator as where Ξ N I is an additive white noise for which where R I are the nuclear coordinates (the dot denotes time derivative), M I are the nuclear masses and γ N is a damping coefficient, which is chosen to compensate for Ξ N I .The latter, in order to guarantee an accurate canonical sampling, has to obey the fluctuation-dissipation theorem Substituting Eq. 3 into Eq. 5 results in the desired modified Langevin equation which will be used throughout the remaining of this paper.This is to say that the noise, as originating from a low-precision computation, can be thought of as the additive white noise of a thitherto unknown damping coefficient γ N , which satisfies the fluctuation-dissipation theorem of Eq. 6.The specific value of γ N is determined in such a way so as to generate the correct average temperature, as measured by the equipartition theorem

COMPUTATIONAL DETAILS
To demonstrate our approach we have implemented it in the CP2K suite of programs [Hutter et al. 2014].More precisely, we have conducted MD simulations of liquid Silicon (Si) at 3000 K using the environment-dependent interatomic potential of Bazant et al. [Bazant and Kaxiras 1996;Bazant et al. 1997].All simulations consisted of 1000 Si atoms in a 3D-periodic cubic box of length 27.155 Å.
Using the algorithm of Ricci and Ciccotti [Ricci and Ciccotti 2003], Eq. 5 was integrated with a discretized timestep of 1.0 fs with γ N = 0.001 fs −1 .Whereas the latter settings were used to compute our reference data, in total six different cases of fixed-point and floating-point errors were investigated by varying the exponent β between 0 (huge noise) and 3 (tiny noise) that is, ranging from 1/1000 of the physical force up to the same magnitude as the force.As already alluded to above, the additive white noise is compensated by means of Eq. 7 by varying γ N on-the-fly using a Berendsen-like algorithm until the equipartition theorem of Eq. 8 is satisfied [Berendsen et al. 1984;Khaliullin and Kühne 2013;Kühne et al. 2009].Alternatively, γ N can be computed by integrating the autocorrelation function of the additive white noise [Scheiber et al. 2018].In Table 2 the

RESULTS AND DISCUSSION
As can be directly deduced from Table 2, the smaller values of γ N for a given β immediately suggest the higher noise resilience when using floating-point as compared to fixed-point numbers.In Figs. 1 and 2, the Si-Si partial pair-correlation function д(r ), which describes how the particle-density varies as a function of distance from a reference particle (atoms, molecules, colloids, etc.), as computed using an optimal scheme for orthorombic systems [Röhrig and Kühne 2013], is shown for different values of β.As can be seen, for both fixed-point and floating-point errors, the agreement with our reference calculation is nearly perfect up to the highest noise we have investigated.As already anticipated earlier, the usage of floating-point errors is not only able to tolerate higher noise levels, but is also throughout more accurate.
To verify that the sampling is indeed canonical, in Fig. 3 the actual kinetic energy distribution as obtained by our simulations using noisy forces is depicted and compared to the analytic Maxwell distribution.It is evident that if sampled long enough, not only the mean value, but also the 2019-07-22 00:29.Page 5 of 1-10.distribution tails are in excellent agreement with the exact Maxwellian kinetic energy distribution, which demonstrates that we are indeed performing a canonical sampling.To further assess the accuracy of the present method, we expand the autocorrelation of the noisy forces, i.e.
Since the cross correlation terms between the exact force and the additive white noise is vanishing due to Eq. 4, comparing the autocorrelation of the noisy forces ⟨F N I (0)F N I (t)⟩ with the autocorrelation of the exact forces ⟨F I (0)F I (t)⟩ permits to assess the localization of the last term of Eq. 9c.The fact that ⟨F N I (0)F N I (t)⟩ is essentially identical to ⟨F I (0)F I (t)⟩, as can be seen in Fig. 4, implies that ⟨Ξ N I (0)Ξ N I (t)⟩ is very close to a δ -function as required by the fluctuation-dissipation theorem in order to ensure an accurate canonical sampling of the Boltzmann distribution.In other words, from this it follows that our initial assumption underlying Eq. 7, to model the noise due to a low-precision calculation as an additive white noise channel, is justified.

CONCLUSION
We conclude by noting that the present method has been recently implemented in the universal force engine i-PI [Kapil et al. 2019], which can be generally applied to all sorts of forces affected by stochastic noise such as those computed by GPUs or other hardware accelerators [Abraham et al. 2015;Anderson et al. 2008;Brown et al. 2012;Colberg and Höfling 2011;Eastman and Pande 2010;Le Grand et al. 2013;Stone et al. 2010], and potentially even quantum computing devices [Benhelm et al. 2008;Chow et al. 2014;Knill 2005;Steane 1999].The possibility to apply similar ideas to N-body simulations [Efstathiou et al. 1985;Hernquist et al. 1993] and to combine it with further algorithmic approximations [Lass et al. 2018b] is to be underlined and will be presented elsewhere.
) holds.Throughout, ⟨• • • ⟩ denotes Boltzmann-weighted ensemble averages as obtained by the partition function Z = Tr exp(−E/k B T ), where E is the potential energy, k B the so-called Boltzmann constant, and T the temperature.Given that Ξ N I is unbiased, which in our case is true by its very definition, it is nevertheless possible to accurately sample the Boltzmann distribution by means of 2019-07-22 00:29.Page 3 of 1-10.
resulting values of γ f ix N for fixed-point and γ f loat N for floating-point errors are reported as a function of β. 2019-07-22 00:29.Page 4 of 1-10.

Fig. 1 .Fig. 2 .
Fig.1.Partial pair correlation function for liquid Si at 3000 K with noisy forces introduced by fixed-point errors of magnitude 10 −3 (blue), 10 −2 (green) and 10 −1 (red).For comparison, the results, as obtained by our reference calculations without noise, are shown in black.

Fig. 3 .Fig. 4 .
Fig.3.Kinetic energy distribution of liquid Si at 3000 K, as obtained by our simulations using noisy forces (circles).For comparison the analytic Maxwell distribution is also shown (line).

Table 1 .
Common floating-point formats