Monte Carlo Simulation of Stochastic Differential Equation to Study Information Geometry

Information Geometry is a useful tool to study and compare the solutions of a Stochastic Differential Equations (SDEs) for non-equilibrium systems. As an alternative method to solving the Fokker–Planck equation, we propose a new method to calculate time-dependent probability density functions (PDFs) and to study Information Geometry using Monte Carlo (MC) simulation of SDEs. Specifically, we develop a new MC SDE method to overcome the challenges in calculating a time-dependent PDF and information geometric diagnostics and to speed up simulations by utilizing GPU computing. Using MC SDE simulations, we reproduce Information Geometric scaling relations found from the Fokker–Planck method for the case of a stochastic process with linear and cubic damping terms. We showcase the advantage of MC SDE simulation over FPE solvers by calculating unequal time joint PDFs. For the linear process with a linear damping force, joint PDF is found to be a Gaussian. In contrast, for the cubic process with a cubic damping force, joint PDF exhibits a bimodal structure, even in a stationary state. This suggests a finite memory time induced by a nonlinear force. Furthermore, several power-law scalings in the characteristics of bimodal PDFs are identified and investigated.


Introduction
Stochastic Differential Equations (SDEs) (Equation (5)) are used to model various phenomena in nature, including Brownian motion, asset pricing, population dynamics, COVID-19 spread and interaction [1][2][3][4][5][6], and various other non-equilibrium processes. Due to their stochasticity, SDEs do not have an unique solution, but a distribution of solutions. A Fokker-Planck Equation (FPE) [7] is a Partial Differential Equation (PDE) that describes how the probability density of solutions of a SDE evolves with time.
Comparing solutions of different SDEs can be achieved by looking at different statistics of the solutions like mean and variance. However, when we are interested in large fluctuations and extreme events in the solutions, simple statistics might not suffice. In such cases, quantifying and comparing the time evolution of probability density functions (PDFs) of solutions will provide us with more information [8]. The time evolution of PDFs can be studied and compared through the framework of information geometry [9], wherein PDFs are considered as points on a Riemannian manifold and their time evolution can be considered as a motion on this manifold. In general, in order to have a manifold structure on the probability space in information geometry, we need to define a metric. Several different metrics can be defined on a probability space [10][11][12][13].
Different metrics have different physical and mathematical significance. For example, the Wasserstein metric (also known as the Earth mover's distance) naturally comes up in optimal transport problems [14]; the Ruppeiner metric is based on the geometry of equilibrium thermodynamics [13]. In this work, we use a metric related to the Fisher Information [15], known as the Fisher Information metric [16,17]. (1) Here, p(x; {θ}) denotes a continuous family of PDFs parametrized by parameters {θ}. This metric was used to physically represent the number of statistically distinguishable states [11,18,19]. Note that two Gaussians with same standard deviation but different means are statistically indistinguishable, if the difference in their means are much smaller than their standard deviation. If a time-dependent PDF p(x, t) is considered as a continuous family of PDFs parameterized by a single parameter, time t, the scalar metric can be given by: However, time in classical mechanics is a passive quantity that cannot be changed by an external control. The infinitesimal distance dL on the manifold is then given by dL 2 = g(t)dt 2 . Here, L is the Information Length defined by: The Information Length L represents the dimensionless distance, which measures the total distance traveled on the manifold, or the total number of distinguishable states a system passes through during the course of its evolution. It has previously been used in optimization problems [20]. It was also used to study dynamical systems, thermodynamics, phase transitions, memory effects, and self-organization [21][22][23][24][25][26][27][28].
The gradient of L, lim dt→0 dL/dt ≡ Γ, then represents a velocity on this manifold Note that we use the notation Γ instead of √ g to make it clear that it is a quantity defined for a time-dependent PDF. Γ represents the rate of change of statistically distinguishable states in a time-evolving PDF, and is sometimes referred to in the literature [29][30][31] as the Information Rate. Note that in information theory [32], the term "Information Rate" is used for the rate at which information is passed over the channel [33]. As for the physical significance of Γ, in a non-equilibrium thermodynamic system, Γ is related to the entropy production [30].
Γ has also been used to study causality [29] and abrupt changes in the dynamics of a system [8]. Γ 2 is equivalent to the (symmetric) KL divergence of infinitesimally close PDFs, as shown in Appendix E. It should also be noted that Γ(t) defined by Equation (4) has the dimensions of t −1 , and the time-integral of Γ(t) gives a dimensionless distance L in Equation (3).
Due to the lack of general mathematical techniques to solve SDEs or its associated FPE, analytical study of SDEs using Information Geometry (Γ and L) has been limited to a few special cases [24,25,34,35]. To date, numerical studies have relied on solving the associated FPE [24,34], which has the advantage of generating smooth time-dependent PDFs and information diagnostics, but has the limitations outlined in Table 1. To overcome the limitations of a FPE solver, in this work, we develop a Monte Carlo (MC) method to study time-dependent PDFs and the Information Geometry of SDEs.

Grid-Based FPE Solver MC SDE Simulation
Accuracy Depends on grid size. No well-defined prescription on choosing grid-size.
Depends on number of samples(n) [36]. Typically less accurate for practical sample sizes.

Boundary condition
Requires carefully chosen non-trivial boundary conditions. Cannot handle discontinuous initial conditions such as Dirac delta function.
Requires only an initial distribution as boundary condition.
Here, n 1 , n 2 , . . . , n d are the number of grid points along each dimension.
Scales linearly with dimension d. O(nd). Here, n is the number of samples.

Correlation Study
Cannot study correlations and associated memory effects using FPE.
Can study correlations. See Section 4 for unequal time joint PDF estimates.
The main aims of this paper are twofold. The first aim is to develop a new MC SDE simulation method and validate it by recovering the previous results obtained using the FPE method. The second is to calculate unequal time joint PDFs and investigate the effect of nonlinear forces on PDF form and various power-scaling relations. The remainder of the paper is organized as follows: Section 2.1 gives a brief introduction of the theory of MC SDE simulation. Section 2.2 develops the methods to measure Information Geometry from the simulation. Using this method, we compare a linear and a nonlinear SDE in Section 3. In Section 4, we showcase the measurement of joint PDFs of the same variable but at unequal times, which is not possible using FPE solvers. We then study a type of phase transition in the joint PDF of the nonlinear SDE and numerically verify its theoretically calculated scaling relations. Discussions are found in Section 5.

SDE Simulation
A typical SDE for the variable x in d dimensions has the following form: Here, µ is known as the drift vector (drift coefficient in 1D), D := σ.σ T /2 the diffusion tensor (diffusion coefficient in 1D) and W t is the Weiner process [37]. d W t represents an infinitesimal random noise term, making the equation stochastic. Generalizations to SDEs can be achieved with more general noise terms and higher-order derivative terms, but are not pursued here. The associated FPE describes the time evolution of the PDF p( x, t) of solutions of the SDE.
Instead of numerically solving the FPE, in this work, we use a Monte Carlo (MC) method to estimate p( x, t) by simulating a large number of instances of an SDE. Explicit numerical solution of an SDE involves choosing an initial position x 0 and iteratively updating it to get the value at time t, x t . For the MC simulation, we start with a set of initial positions { x 0 } sampled from the desired initial distribution and numerically solve each of them independently. We can then use the set of samples at time t, { x t }, to compute the desired statistics. We can formally write this iteration step as follows: Here, M denotes an arbitrary update scheme. There are several methods [38] by which we can create this update. Throughout this work, we will be working with autonomous SDEs ( µ and σ do not explicitly depend on time) with diagonal noise (σ is a diagonal matrix). Therefore, we will use a simple update scheme called the Milstein scheme [39]: Here, ∆ W t is a random sample from N d (0, ∆t). Note that the Milstein scheme is only accurate up to O(∆t), and this error can accumulate through the course of the simulation. Therefore, to control the error in the numerical update scheme, we will adaptively choose the time step ∆t for updates by setting a local error tolerance. Local error at time t is defined as the deviation between a single step update made with time step ∆t and a two-step update made with the time step ∆t/2 each.
Here, we omitted the dependence of M on µ and σ for brevity. ∆ W 1 and ∆ W 2 are chosen in such a way that ∆ W 1 + ∆ W 2 = ∆ W t . To satisfy the local error tolerance Tol, we need to choose ∆t, such that err( x t , ∆t) < Tol. The exact prescription on how this choice is made can be found here [40].
Computing Γ requires computing the derivative of P( x t , t) of the numerical solutions. Since derivatives are sensitive to numerical noise, we need accurate estimates of the probability distribution. This is achieved by numerically solving a large number of SDEs (we use at least 2 × 10 7 samples in this work). This is an impractically large number for most computers. Numerically finding 2 × 10 7 solutions, with each requiring around 10,000 time steps (depending on the required accuracy) will require at least 1600 GB of memory, assuming 64-bit floating point values. Additionally, assuming a fast 2 ms per solution, the entire simulation will take around 11 h if carried out serially. To solve these problems, we use GPU computing. With GPU computing [41], we can perform updates on a set of values { x t }, simultaneously using Equation (7) to get { x t+∆t }. { x t } can then be removed from memory after computing the desired statistics, making it memory efficient. As for the runtime, a GPU-based parallel implementation [42] in Python takes around 4 min to simulate 2 × 10 7 samples for 10,000 time steps on a consumer laptop equipped with Nvidia RTX 2080 GPU. See Appendix A for detailed scaling relations of simulation runtime.

Estimating Γ
The form of Γ in Equation (4) makes it unsuitable for numerical computation due to the presence of p(x, t) term in the denominator, which can become zero. We therefore rewrite the equation using the redefinition q(x, t) := p(x, t).
To calculate Γ, we first estimate the PDF using histograms. (It would be more accurate to use kernel density estimators [43], but that is more computationally expensive). The derivative in Equation (10) is approximated as a finite difference and the integral as a Riemann sum. The specific methods and the error estimates are provided in Appendix B. After computing Γ, information length can be computed from the numerical integration of Equation (3).
A major source of error in the estimation of Γ is due to the compact support and overlap of numerically estimated probability densities. (This will also be a source of error with numerical FPE solvers). During the simulation, time steps are chosen adaptively by setting a tolerance for the local numerical error. However, in some regimes, this results in the system evolving too fast, so that there is little or no overlap between the probability densities at adjacent time steps ( Figure 1). Theoretically, these densities ( Figure 1) have the support of the entire real line, and the only error will be caused by approximating the derivative and the integral. However, in practice, we are running the MC simulation with a finite sample size, and computers have finite numerical precision; as such, the estimated densities have compact support. Figure 2 shows how the amount of overlap between the densities affects the error in Γ estimate. There are two main sources of sub-optimal overlap: changes in mean and changes in standard deviation.  The error Γ estimate is lowest when ratio of standard deviation is approximately 0.9 or 1.1 (≈0.9 −1 ). Discretized version of Equation (10) was used for the estimate and PDF was approximated by a histogram with 703 bins. We considered 2 × 10 7 samples for each distribution. dt is chosen to be 1. Estimates were repeated 40 times and mean of the error was taken.
Consider ∆t chosen adaptively to satisfy the local error tolerance. Now assume for some ∆t, we get optimal overlap. If ∆t > ∆t, we can wait for a few steps before estimating the Γ. However, if ∆t < ∆t, we can choose a ∆t as a temporary time step and generate a collection of points { x t+ ∆t } from { x t } using Equation (7), which results in optimal overlap between densities. The local error will still be smaller than the tolerance since ∆t < ∆t. In order to derive the value of ∆t, we use the Milstein update scheme and restrict ourselves to the one-dimensional case for simplicity. First, we look at the change in mean value.
Taking the expectation value of both sides, we get: Here, we use the fact that E[∆W t ] = 0 and E (∆W t ) 2 = ∆t. Note that x t and ∆W t are independent random variables. For X and Y, the independent random variables are In order to achieve optimal overlap (Figure 2 Now to derive the effect on change in standard deviation on ∆t, a similar calculation for Var[x t+∆t ] can be performed, which yields: Now, in order to achieve optimal overlap, we need Note that we have only considered the first and second moments here. It is potentially possible to improve the accuracy of the Γ estimate by considering higher moments. However, this improvement will be marginal, since overlap between two distributions is most affected by its first and second moments.
After calculating both the time steps ∆t mean and ∆t std , we take the minimum of the two to perform the update on {x t } to get {x t+ ∆t }, where ∆t = min ∆t mean , ∆t std . After the estimation, the Γ {x t+ ∆t } is discarded and we continue the simulation with {x t+∆t }. This prevents any significant slowing in the simulation if ∆t ∆t. Note that calculating ∆t mean and ∆t std using the entire set of points {x t } will be computationally expensive. Only a small subset is used to perform this calculation.
After Γ is estimated, we can calculate the information length by approximating the integral as a Riemann sum.
Looking at Figure 3, it can be seen than the percentage error in Γ blows up towards the end of the simulation. This is when the system approaches a stationary state and the probability density stops evolving. The exact Γ reaches zero, whereas the numerical estimate will have a small nonzero error (Figure 4). Even though the percentage error in Γ becomes large, the absolute error remains small (Figure 2 (left)) and will have minimal contribution to the error in Information Length calculation. However, when the initial distribution is 'closer' to the stationary distribution (small x 0 values in Figure 3), the absolute value of Information Length will be small, and the error in Γ estimate will have a more significant contribution to the Information Length calculation.  The Γ of (left) Linear SDE and (right) Cubic SDE for different initial conditions N (x 0 , 10 −10 ), as described in Section 3. Note that towards the end of the range of t, the values will be dominated by errors, as shown in Figure 3. Therefore instead of dropping to zero, they will have a finite nonzero value. For exact solution of Linear SDE, see Appendix D.
It is to be noted that measuring Γ of multi-dimensional problems is a significant computational challenge which requires further investigation. MC SDE simulation is better at handling such problems compared with FPE solvers due to its linear scaling with number of dimensions. However, this comes at the price of accuracy when estimating PDFs of higher dimensional problems. It is still possible to accurately study Γ from marginal PDFs of multi-dimensional problems using MC SDE simulation.
A python implementation of SDE Simulation, along with the Γ measurement used in this work, can be found here [42].

Linear vs. Cubic Statistics
In this section, we will use methods developed in the previous section and study nonlinear damping of the Information Geometry of a stochastic process. We will compare the Ornstein-Uhlenbeck process [44], a model for prototypical linear driven dissipative process, defined by linear SDE: with a cubic SDE defined by the equation: The cubic damping term has been previously used to model CO 2 emissions [45], phase transitions [24], and self-organized shear flows [46].
Throughout this section, unless otherwise stated, we restrict ourselves to the values θ = 1.0 and σ = 0.1. The initial distribution is always N x 0 , 10 −10 , and we run the SDE simulation from t = 0 to t = 100. The data in this section were generated using 2 × 10 7 samples, and PDF was approximated by a histogram with 703 bins, which was chosen using an empirical formula 2.59 3 √ n. The time steps were adaptively chosen by setting local error tolerance at 5 × 10 −4 .
Two of the simplest statistics that can be measured are the mean and the standard deviation of the distribution (Figures 5 and 6). The trends in mean can be readily seen by taking expectation value on both sides of Equations (16) and (17)   The approximate solution of the mean value of Cubic SDE is only valid when the standard deviation of the distribution is much smaller than the mean (Appendix C). Note that in Equation (20), when 1/x 2 0 θt, E[x t ] ≈ (2θt) −0.5 , making the trajectory independent of x 0 . This can be seen in Figure 5 (right), when all the lines merge into one. However, around t = 5, the approximation fails, since the value of the standard deviation (Figure 6 (right)) and the mean ( Figure 5 (right)) becomes comparable. Therefore, the curves fail to follow the same trajectory E[x t ] ≈ (2θt) −0.5 for t > 5.
The trends in standard deviation of the Linear SDE are readily explained by Equation (A14), which does not depend on initial mean position x 0 , but only on the initial standard deviation, the drift coefficient, and the diffusion coefficient. As we write this paper, no exact analytic solution exists for the Cubic SDE. However, an approximate analytic treatment can be found here [47].
We define the asymptotic Information Length (L ∞ ) to be the Information Length it took for the system to reach the stationary state of its PDF.
Analytically, a SDE reaches its stationary state as t → ∞. However, numerically, we see that the probability density stops evolving after a finite time. We can see this from Figure 7, as L becomes a constant. L will still continue to increase slightly due to numerical error, but this contribution will be negligible, as shown in Figure 3. L ∞ measures how many statistically distinguishable states the system passes through to reach its stationary state. From Figure 6, it is evident that, compared to the linear SDE, the PDF of the cubic SDE undergoes a lot more change before reaching its stationary state, for larger values of x 0 . For small values of x 0 , the trend in standard deviation is similar between linear and cubic SDEs, since the initial evolution of the cubic process is Gaussian. This is confirmed in Figure 7, which shows that for large values of x 0 , L ∞ is significantly larger for the cubic SDE for same values of x 0 and, for small x 0 values, the L ∞ values are comparable.

Information Length Scaling
In Figure 7, we have already seen that L ∞ depends on x 0 and has different behavior for linear and cubic SDEs. In [47], by numerically solving the FPE (also analytically for the linear SDE), it was shown that for large values of x 0 , L ∞ shows different scaling behavior for linear and cubic SDEs. For the linear SDE, L ∞ ∼ x 0 . For the cubic SDE, L ∞ ∼ x m 0 , where 1.5 < m < 1.9. We reproduce this result in Figure 8. Note that, since L ∞ is a dimensionless quantity, it is not possible to derive these values theoretically using dimensional analysis. In the absence of general analytic tools to study this property, numerical methods are indispensable.

Unequal Time Joint PDF: Bimodality for Cubic Force
A clear advantage of MC SDE Simulation over solving FPE is the ability to compute the unequal time joint PDFs P(x(t 1 ), x(t 2 )). Unequal time joint PDFs have been previously used for causality and to establish causal relations. In this section, we showcase the ability of MC SDE simulation to estimate P(x(t 1 ), x(t 2 )) and study its properties.

Unequal Time PDFs in the Stationary State
For the linear SDE P(x(t 1 ), x(t 2 )) is always a Gaussian, but with a covariance matrix that depends on t 1 and t 2 . However, for the cubic SDE, we see the emergence of a bimodal distribution (Figure 9 (right)) depending on t 1 and t 2 values. This behavior prevails even after the system has reached its stationary state, but now only depends on the difference ∆t := t 2 − t 1 . The bimodality indicates a finite memory induced by the non-linearity. In order to further study this behavior, we first ensure the system has reached a stationary state by simulating from t = 0 to t = 1500 with a sample size of 8 × 10 7 for the cubic process. After setting t 1 = 1500, t 2 values are chosen from the data generated by further simulating the system for 300 time units. The PDF was approximated by a histogram with 60 × 60 bins. The time steps were adaptively chosen by setting the local error tolerance as 0.01σ. Note that, because of the symmetry of diffusion term and the drift term, the bimodality of P(x(t 1 ), x(t 2 )) for cubic SDE is always symmetric with respect to the line x(t 1 ) = x(t 2 ). Therefore to make the numerical study of the bimodality easier, we restrict our attention to the diagonal of the joint probability density P(x(t 1 ) = x(t 2 )), which is then normalized to integrate to one.
To quantify the bimodality of P(x(t 1 ) = x(t 2 )), we use the following fitting function to perform a nonlinear fit and estimate the parameters.
To motivate this fitting function, note that for ∆t = 0, we get a purely quartic exponential, since it is nothing but the stationary state P s (x) of the cubic SDE. For ∆t → ∞, the correlation between points reaches zero, P(x(t 1 ), x(t 2 )) ∼ P s (x 1 (t 1 ))P s (x 2 (t 2 )), the product of two independent stationary distributions. The quadratic exponential terms are motivated by Gaussian distribution of the noise, and are found to describe the data accurately. There are two quadratic exponential terms, since the SDE is symmetric about the point x = 0 and the bimodal peaks occur symmetrically on opposite sides.
In the fitting function, Parameter a is a measure of the overall height of the density curve. Parameter b represents the ratio of contribution from the quadratic exponential to the quartic exponential, and denotes the degree of bimodality. Parameter e is the location of the peaks. Parameter c is a measure of the overall spread of the density function, while parameter d is a measure of the spread of bimodal peaks. Note that a nonzero value for parameter e and b denotes bimodality in the distribution.
From Figure 10, we can see that the joint PDF becomes bimodal for a range of ∆t values, since parameters b and e have nonzero values. The standard deviation of the quadratic term denoted by parameter d is almost a constant for a fixed σ, whereas the location of the bimodal peak denoted by parameter e changes slightly, but never becomes zero. That means that while transitioning from a bimodal to a unimodal distribution, the bimodal peaks do not continuously merge into one another, but slowly become less prominent and eventually disappear, as inferred from the value of parameter b. The artifacts towards the end of the curves of parameters b and e are due to the fact that it is not possible to consistently fit parameters b and e when there is negligible contribution from the quadratic exponential term as b → 0.   (22), obtained through nonlinear function fitting at each time point t 2 , expressed as a function of ∆t := t 2 − t 1 for different noise levels σ. To ensure the density function p(x, t) for the cubic process reached a stationary state, we chose t 1 = 1500. Note that for parameters b, d and e, the domain of the plots is restricted to the region where parameter b is nonzero. Noise starts dominating outside this region, since there is negligible contribution from the quadratic terms, and nonlinear fit cannot find a consistent unique value for these parameters.
The values of parameter a and c in Figure 10 can be understood by first considering the ∆t → 0 and ∆t → ∞ limits. For ∆t → 0, we have P(x(t 1 ), x(t 2 )) = δ(x(t 1 ) − x(t 2 ))P s (x), where P s (x) is the stationary state. The diagonal of joint PDF then becomes P(x(t 1 ) = x(t 2 )) = P s (x) := a s exp −(x/c s ) 4 . Here, after integrating x, we can find, a s ≈ 0.55/c s for all values of σ, which can be numerically verified. When ∆t → ∞, since there is no correlation, P(x(t 1 ), x(t 2 )) = P s (x(t 1 ))P s (x(t 2 )). The normalized diagonal then becomes P(x(t 1 ) = x(t 2 )) ∼ exp −2(x/c s ) 4 . The normalization factor can be derived by integrating out x. Therefore, when ∆t → ∞, we have a = 2 1/4 a s and c = c s /2 1/4 , which agrees with numerical results. For intermediate ∆t values, there will be correlation, and the behavior cannot be easily explained.
The trends in Figure 10 for different σ values can be explained by looking at scaling relations. For the cubic equation, we have dx t = −θx 3 t dt + σdW t . Looking at the individual terms, we can infer the dimensions, θ ∼ x −2 t −1 and σ ∼ x 1 t −1/2 . Therefore, we expect the following scaling relations: x ∼ θ −1/2 σ 1/2 (24) These relations are numerically verified in Figure 11 for a fixed value of θ, for parameters a, c, d, e, and the relationship between noise and ∆t, corresponding to peak value of parameter b. Note that a ∼ σ 0.5 in Figure 11, because parameter a is a normalization constant. ap(x)dx = 1 =⇒ a ∼ x −1 ∼ θ 1/2 σ −1/2 . The peak value of parameter b seems to be a constant with the value of 1/3. It cannot be derived from simple scaling arguments alone; further investigation is required to understand its origin.

Evolution of Bimodality in the Non-Stationary State
In this section, we will look at how the bimodality in the joint PDF evolves qualitatively before p(x, t) has reached a stationary state. Since bimodality occurs only for the cubic process, the following results are for the SDE dx t = −1.0x 3 t dt + 0.1dW t . To this end, we follow the same prescription from Section 4, but with a slightly modified version of the fitting function (Equation (22)).
This modification is made since for some t 1 and t 2 values, there is no contribution from the quartic exponential term, unlike in the case of stationary state, where there is always a quartic exponential contribution to the distribution.
The nonstationary phase exhibits ( Figure 12) rich behavior, which asymptotically transitions to the stationary state behavior as t 1 becomes large. For small t 1 values, Parameter a Parameter b, since P(x(t 1 ), x(t 2 )) has very little contribution from the quartic exponential term. This is because the initial distribution is a Gaussian distribution, and it is slowly evolving towards the quartic exponential distribution in the stationary state. For larger t 1 values, Parameter a starts dominating, since P(x(t 1 ), x(t 2 )) has predominant contribution from quartic exponential term, as expected of the system reaching its quartic exponential stationary state. For some intermediate t 1 values, depending on ∆t values, we see an interesting behavior where P(x(t 1 ), x(t 2 )) becomes purely a quartic exponential (Parameter b = 0) twice before becoming a mixture of quadratic and quartic exponential terms. Further investigation into this behavior is not undertaken at this time, and only serves to demonstrate the potential capabilities of GPU-accelerated MC SDE simulation for future work.

Discussion
In this work, we developed a method for fast and accurate study of the Information Geometry of SDEs using Monte Carlo simulation. We identified the computational challenges and overcame them by using GPU computing. Specifically, the major limitation with MC SDE simulation in the estimation of Γ was the sub-optimal overlap of PDFs at subsequent time points. We solved this problem by developing an interpolation method to compute PDFs with optimal overlap.
As an application of the new method, we compared the Information Geometry of SDEs with a linear and a cubic damping force. We were able to reproduce the analytic results for the linear SDE and the previous numerical results [34] for cubic SDE obtained using FPE solvers. This was particularly true of large values of x 0 , L ∞ ∼ x 0 for the linear case and L ∞ ∼ x m 0 for the cubic case, where m = 1.88 when θ = 1.0 and σ = 0.1. We further showcased the advantage of MC SDE simulation over FPE solver by computing the joint PDF P(x(t 1 ), x(t 2 )). Unlike the linear SDE, the cubic SDE led to an interesting bimodal PDF P(x(t 1 ), x(t 2 )), which was observed even after reaching a stationary state. After reaching the stationary state P(x(t 1 ), x(t 2 )) only depends on ∆t = t 2 − t 1 . In the stationary state, we further studied the bimodality by quantifying it and looking at the power-law scaling relations with respect to noise levels σ and provided theoretical scaling arguments. The maximum value of the ratio of quadratic to quartic contribution (Parameter b) is found to be a constant 1/3 irrespective of the noise levels, which requires further study. Finally, we qualitatively looked at P(x(t 1 ), x(t 2 )) in the nonstationary state. The MC SDE simulation can be an important tool for further studying this behavior.
It is important to note that the methods that we developed here for one stochastic variable are general, and can be extended for more than one variable, as well as for investigating different metrics or thermodynamic quantities. These will be addressed in future work. Furthermore, it will be of interest to investigate fast implementations of Kernel Density Estimators [48,49] which will improve the accuracy of joint PDF estimates. We note that in numerical experiments, it was seen that compared with histograms, using Kernel Density Estimators to estimate PDFs provided 2-5 times reduction in error with identical PDFs, albeit with a performance trade-off.

Data Availability Statement:
The data that support the findings of this study are available from the corresponding author, A.A.T., upon reasonable request.

Conflicts of Interest:
The authors declare no conflicts of interest.

Abbreviations
The following abbreviations are used in this manuscript:

Appendix A. Runtime Scaling
We use the linear O-U process dx t = −θx t dt + σdW t to study the runtime scaling relations of MC SDE simulation. Simulations start with n samples from an initial distribution N (x 0 , 10 −10 ). The simulation were run for 100 time units on a workstation with AMD EPYC 7451 24-Core CPU and Nvidia Titan RTX GPU, using the implementation here [42]. Milstein scheme was used with adaptive time steps with local error tolerance Tol. Unless otherwise stated, the parameters will have the values: x 0 = 10, σ = 0.1, θ = 0.1, Tol = 5 × 10 −4 and n = 2 × 10 6 . Note that computation only involves the MC SDE simulation. Calculating the statistics and probability density will require additional computation. This might increase the overall computational time, but the value of the power in the scaling relationship will remain the same.
For MC SDE simulation, as seen in Figure A1, initially the runtime weakly depends on n since computations are being done in parallel. But when n is significantly larger than the number of parallel processors, the computations need to be done in batches and the runtime scales linearly. Without adaptive time stepping, numerical error depends on the choice of step size ∆t. The runtime is expected to scale as ∆t −1 . But by setting a local error tolerance we see a better scaling Tol −0.67 as seen in Figure A1.
From Figure A2, the runtime of MC SDE simulation scales as x 0.53 0 with respect to initial position x 0 . This is better compared to fixed grid-size FPE solvers where at least an x 1 0 scaling is expected. We also see better than linear runtime scaling with respect to noise levels. Runtime ∼ D 0.34 ∼ σ 0.58 .
A detailed analysis of the problem can be found here [34].