Systematic Coarse-Grained Models for Molecular Systems Using Entropy †

: The development of systematic coarse-grained mesoscopic models for complex molecular systems is an intense research area. Here we ﬁrst give an overview of different methods for obtaining optimal parametrized coarse-grained models, starting from detailed atomistic representation for high dimensional molecular systems. We focus on methods based on information theory, such as relative entropy, showing that they provide parameterizations of coarse-grained models at equilibrium by minimizing a ﬁtting functional over a parameter space. We also connect them with structural-based (inverse Boltzmann) and force matching methods. All the methods mentioned in principle are employed to approximate a many-body potential, the (n-body) potential of mean force, describing the equilibrium distribution of coarse-grained sites observed in simulations of atomically detailed models. We also present in a mathematically consistent way the entropy and force matching methods and their equivalence, which we derive for general nonlinear coarse-graining maps. We apply, and compare, the above-described methodologies in several molecular systems: A simple ﬂuid (methane), water and a polymer (polyethylene) bulk system. Finally, for the latter we also provide reliable conﬁdence intervals using a statistical analysis resampling technique, the bootstrap method.


Introduction
The enormous range of length and time scales involved in complex materials presents a challenging computational task, mainly, due to a wide range of relaxation times. A standard methodology to overcome problems of long relaxation times is to abandon the chemical detail and describe the molecular system by fewer degrees of freedom. Thus, systematic coarse-grained (CG) models are developed by averaging out the details at the molecular level, and by representing groups of atoms by a single CG particle. The challenge is to derive reliable coarse models both for reproducing the structural and the dynamical properties of systems. That is, to identify and effective approximate force field, approximating the potential of mean force (PMF), and then approximations to kinetic coefficients such as the friction.
Methods to approximate the PMF are well studied in the literature. Examples include: (a) The Boltzmann inversion methods, also known as structural-based, which rely on matching the radial distribution function [1][2][3][4][5][6]. (b) The information theory based variational inference method relies on the minimization of the relative entropy (RE) between the configurational distributions of the system and the approximate one, [7][8][9][10]. (c) The Force Matching (FM) relies on minimizing the distance between the forces exerted on the CG particles and the approximate ones [11][12][13]. Recently, we have introduced a path-space variational inference methods were introduced, capable of inferring dynamical models of coarse-grained systems, [9,14]. There the Relative Entropy Rate (RER) is defined as the appropriate quantity to infer the coarse dynamics for stationary system, while the path space force matching is introduced.
The purpose of the current work is to present a short review of the information theoretic methodologies ( relative entropy, and relative entropy rate) and their relation to the force matching and path-space force matching methodologies, through the application to different molecular systems.

Methodology
Let a prototypical problem of N classical atoms in a box of volume V at temperature T. We denote q = (q 1 , ..., q N ) ∈ R 3N the position vector and p = (p 1 , ..., p N ) ∈ R 3N the momentum vector of the N atoms. The probability of an elementary configuration q is given by the Gibbs probability, where U(q) is potential energy of a state q, Zis the normalization constant (partition function), and β = 1 k B T with k B the Boltzmann constant and T the temperature. In the above relation the kinetic part of the Hamiltonian has been integrated out. Coarse-graining (CG) is a standard methodology to overcomes the large range of length and time scales by averaging out the details of the atomistic level at the molecular level through representing groups of atoms by a single particle. The CG map Π : R 3N → R 3M determines the position vectors of M CG particles (or beads)q = (q 1 , ...,q M ) ∈ R 3M . Note that M < N but still M >> 1. From now on, we will use the bar "¯" notation for objects related to the CG model. The probability that the CG system has configurationq is given bȳ The quantityŪ is the M−body potential of mean force (PMF). The corresponding conservative force is thusF PMF (q) = −∇Ū PMF (q). While the above formula is exact, the accurate calculation of the PMF for a realistic model of a complex molecular system is a challenging task. This challenge is due to the high dimensionality of the integral, and the M vector as well. Therefore, we develop methods to find an effective potential in a parameterized form, U e f f (q; θ), θ ∈ Θ, which best approximates the PMF, i.e.,: Moreover, we assume that the evolution of the particles is described by a continuous time process {X t } t≥0 = (q t , p t ) t≥0 , with path space distribution P [0,t] , and invariant measure the Gibbs probability, Equation (1). The approximate coarse space dynamics we adopt are described by a Markov process {X t } t≥0 in R m with a parametric path space distributionQ θ [0,t] , θ ∈Θ.

Information Theoretic Variational Inference: The Relative Entropy
Here we adopt the information theoretic variational inference approach as the methodology to derive optimal approximate coarse models bot at equilibrium and dynamical regimes. This variational approach encompasses the minimization of the Relative Entropy (RE) between probability measures.
The relative entropy (Kullback-Leibler divergence), [15], of two probability measures P(dω) and Q(dω) on a common measurable space (Ω, B) is given by provided P Q, i.e., P is absolutely continuous with respect to Q, and R (P|Q) = +∞ otherwise. The functional R (P|Q) defines a pseudo-distance between two measures as R (P|Q) ≥ 0 and R (P|Q) = 0 if and only if P = Q, P-a.s. In the case these probability measures have corresponding probability densities p(ω) and q(ω) Equation (2) where Π * µ denotes the push-forward of the microscopic measure µ. When the system is at equilibrium the optimization principle is When considering continuous time observations, in work [14] we prove that the path-space minimization principle (3) reduces to the path-space force matching (PSFM). In stationary dynamics the Relative Entropy Rate (RER) is defined by where P and Q denote the corresponding stationary processes.
For discrete time observations (a) from the microscopic Gibbs density D n t = {X 1 , . . . , X n t } , or (b) the path-space distribution P [0,T] at dynamical regimes, D n s ,n t = {X k 1 , . . . , X k n t } n s k=1 , consideringthe estimator for the RE, the optimal parameter estimate is given by [14], . p andq θ are the microscopic and coarse space transition probability densities of the Markov processes X t andX t , respectively. Note that if the time series are stationary, the RER optimization iŝ

Relative Entropy and Force-Matching
The Force-Matching (FM) method estimates an effective CG potential that reproduces best the potential at the reference all-atom system, by solving the optimization problem i.e., we minimize the average difference between the atomistic F(q) forces and the corresponding CG forcesF(Π(q); θ), where · denotes the Euclidean norm in R 3M and E µ [·] denotes the expectation with respect to the probability Gibbs measure µ(dq). The minimization problem for the discrete observations, and the linear parametric representation of the forceF(·; θ) = G(·)θ, is The path-space force matching optimization problem is, [14], for which the discrete optimization problem becomeŝ

Relative Entropy and Structural-Based Methods
The structural-based methods, such as the direct inverse Boltzmann (DBI), iterative Boltzmann inversion (IBI), and inverse Monte Carlo (IMC) methods, use the pair correlation function g (2) (q) and the assumption that the interactions depend only on the distance R between particles, that is g (2) (q) =ḡ(R).ḡ(R) is called the radial distribution function. Thus the CG effective interaction is given byŪ that is the average density of finding the CG particle 1 at a distance R from the particle 2.
The structural methods are thus based on the pair correlation function between CG particles, in contrast to the RE which is considering the joint probability distribution of the CG particles. In case the PMF can be exactly described by pair functions then the RE and structural methods coincide.

Results and Discussion
In the current section, we present the application of the variational inference methods, the RE, and the FM, for representative molecular systems: A simple fluid (bulk methane), a system of water molecules, and a polyethylene melt, at equilibrium conditions. We moreover study the bulk methane system out-of equilibrium, specifically we apply the PSFM at a transient time regime.

Bulk Methane
The molecular system consists of 666 methane molecules at temperature T = 100 K, and T = 80K. We employed molecular dynamics simulations to generate the microscopic space data based on which we applied the inference methods. Details on the atomistic simulations are given in [16]. For the coarse-grained representation of methane we have used a one-site representation with a pair potential. The pair potentials we have tested are (a) expansions with linear and cubic B-splines (with 48 parameters) and (b) the Lennard-Jones parametric form (with two parameters).
A comparison of the FM, and IBI methods is depicted in Figure 1 [17]. The result depicts slight difference of the FM method to the RE and IBI. Figure 2 presents the performance of the FM and PSFM methods at equilibrium verifying the validity of the PSFM and its reduction to the FM method. A study at transient time regimes is presented in work [16].

Water
The model system consists of 1192 molecules at ambient conditions (T = 300 K, P = 1 atm). Details on the atomistic simulations are given in [17]. For the coarse-grained representation of H 2 O, we have also used a one-site representation with a pair potential. Figure 3a depicts the resulting pair potential obtained with the RE and FM methods. The RE and FM potentials have a very similar structure with two minima, though the actual values of the potential are different. Figure 3b shows that the pair correlation function derived by CG simulations with the RE potential and the target one (from atomistic simulations) are very close. That is, the RE potential can reproduce with sufficient accuracy the pair correlation.

Polyethylene Melt
The model system consists of 96 polyethylene chains of 99 monomer units (−CH 2 − ), i.e., N = 9504. The simulations were performed under NVT conditions at temperature T = 450 K. For the coarse-grained representation we consider a 3:1 mapping representation, i.e., three monomer units form one CG particle. With this application we study the effect of the size of the available observations (system configurations), and quantify uncertainties due to the small number of observations. Figure 4 depicts the derived FM potential for a large set of observations. In addition, shows the 95% confidence set obtained with a statistical analysis resampling technique (bootstrap method) of a small observations set, which captures the large-set outcome.

Conclusions
In the current work we presented a short review of the information theoretic variational inference method for coarse-graining molecular systems, for systems at-and out-of-equilibrium. Moreover, we presented the connection to the Force Matching method and its relation to the structural based methods. The application of all methods to the methane system shows that the RE and IBI methods give similar results while the FM differs slightly. While for the water model the RE and FM resulting potentials differ substantially, which is not surprising as we know that the two methods are equivalent only asymptotically. We verify the validity of the PSFM, i.e., deriving the piar potential using time-series data, since it produces the same results to the FM, i.e., with identically distributed data. Finally, with the application to the polyethylene system, we show that when the availability of observations is limited the bootstrapping method can provide reliable confidence intervals to the pair potential.