Information Geometry of Nonlinear Stochastic Systems

We elucidate the effect of different deterministic nonlinear forces on geometric structure of stochastic processes by investigating the transient relaxation of initial PDFs of a stochastic variable x under forces proportional to -xn (n=3,5,7) and different strength D of δ-correlated stochastic noise. We identify the three main stages consisting of nondiffusive evolution, quasi-linear Gaussian evolution and settling into stationary PDFs. The strength of stochastic noise is shown to play a crucial role in determining these timescales as well as the peak amplitude and width of PDFs. From time-evolution of PDFs, we compute the rate of information change for a given initial PDF and uniquely determine the information length L(t) as a function of time that represents the number of different statistical states that a system evolves through in time. We identify a robust geodesic (where the information changes at a constant rate) in the initial stage, and map out geometric structure of an attractor as L(t→∞)∝μm, where μ is the position of an initial Gaussian PDF. The scaling exponent m increases with n, and also varies with D (although to a lesser extent). Our results highlight ubiquitous power-laws and multi-scalings of information geometry due to nonlinear interaction.


Introduction
There is increasing interest in a metric on probability from theoretical and practical considerations, with different metrics proposed depending on the question of interest (e.g., [1][2][3][4][5][6][7][8][9][10] and further references therein). Theoretically, the assignment of an appropriate metric to probability enables us to mathematically quantify the difference among different Probability Density Functions (PDFs), providing a beautiful conceptual link between a stochastic process and geometry. At a practical level, it can be utilized for optimising various desired outcomes. For instance, the Wasserstein metric has been extensively studied for the optimal transport problem to minimize transport cost, typically taken to increase quadratically with distance between two locations [1]. The Fisher (also called Fisher-Rao) metric has recently been used for optimization, including the minimization of entropy production [11], parameter estimation [12], controlling population [13], understanding the arrow of time [14], and analysing the convexity of the relative entropy [15].
However, compared with the Wasserstein metric, whose application has established itself as a branch of applied mathematics, the geometric structure associated with the information change in the Fisher metric and its utility have been explored much less. Unlike the Wasserstein distance, the Fisher metric yields a hyperbolic geometry in the upper half-plane (e.g., [9,10]) where the distance is measured in units of the width of the PDF. That is, the distance in the Fisher metric is dimensionless and represents the number of different statistical states. Consequently, for a Gaussian PDF, statistically distinguishable states are determined by the standard deviation; two PDFs that have the same standard deviation and differ in peak positions by less than one standard deviation are statistically indistinguishable (e.g., [11,[16][17][18][19][20][21][22]).
By extending the Fisher metric to time-dependent problems, we recently introduced a system-independent way of quantifying information change associated with time-evolution of PDFs [13,[23][24][25][26][27][28][29][30]. (Note that we use information for statistically different states, refraining from the debate on the exact definition of information, e.g., [5,31].) The key idea is to define an infinitesimal distance at any time by comparing two PDFs at adjacent times and sum these distances. The total distance gives us the number of statistically different states that a system passes through in time, and is called information length L. While the detailed derivation of L is given in [5,13,16,[23][24][25][26][27][28][29][30], it is useful to highlight that L is a measure of the total elapsed time in units of a dynamical timescale for information change. To show this, we define the dynamical time τ(t) as follows: Here, τ(t) is the characteristic timescale over which the information changes. Having units of time, τ(t) quantifies the correlation time of a PDF. Alternatively, 1/τ quantifies the (average) rate of change of information in time. A particular path which gives a constant valued E is a geodesic along which the information propagates at the same speed [13]. L(t) is then defined by measuring the total elapsed time t in units of τ as: L(t) is a Lagrangian quantity (unlike entropy or relative entropy), uniquely defined as a function of time t for a given initial PDF, and represents the total number of statistically distinguishable states that a system evolves through. It thus provides a very convenient methodology for measuring the distance between p(x, t) and p(x, 0) continuously in time for a given p(x, 0). One of the utilities of L(t) is to quantify the proximity of any initial PDF to a final attractor of a dynamical system. We note that for a linear process, we can express the information length in terms of a metric tensor g ij using the two parameters β = 1/2 (x − x ) 2 and x (e.g., [13]). However, even in the case where the control parameters are not known, we can still define L as long as we can compute time-dependent PDFs (e.g., from data [24]). Traditionally, concepts such as bifurcations, Lyapunov exponent or a distance to an equilibrium point are commonly used to understand dynamical systems (e.g., see [32]). An intriguing question arises as to how to define a distance between a point, say x, and an equilibrium which is not a point but a limit cycle, or chaotic attractor. One interesting possibility is to consider a narrow initial PDF around x and to measure the total information length L(t → ∞) = L ∞ as the initial PDF evolves toward the equilibrium PDF. L ∞ offers a Lagrangian distance that depends on the trajectory/history of the system (e.g., time-dependent PDF), being uniquely defined as a function of time for a given initial PDF. This enables us to map out the attractor structure by measuring L ∞ for different locations of a narrow initial PDF. In a chaotic system, L ∞ changes abruptly when a different initial PDF is used, as shown in Figure 4 in [24], where the very spiky curve represents a sensitive dependence of L ∞ on x(t = 0), the location of a very narrow initial PDF. This sensitive dependence of L ∞ on x(t = 0) means that a small change in the initial condition causes a large difference in a path that a system evolves through and thus L ∞ . This is quite similar to the sensitive dependence of the Lyapunov exponent on the initial condition. That is, our L provides a new methodology to test chaos. Furthermore, Figure 4 in [24] shows small L ∞ for unstable points, demonstrating that unstable points are more similar to chaotic attractors.
The purpose of this paper is to consider the effect of different orders of nonlinear interaction on the geometric structure by considering the case when the equilibrium is a stable point [26,33]. The remainder of this paper is organised as follows: Section 2 introduces the basic model. Section 3 derives exact analytic solutions in the absence of stochastic noise, as well as asymptotic scalings for the timescales, peak amplitudes and widths once noise plays a role. Section 4 presents numerical results, and shows how they compare with the analytic scalings. Section 5 summarizes the results.

Model
We consider the following nonlinear Langevin equation: Here, x is a random variable and ξ is a stochastic forcing, which for simplicity can be taken as a short-correlated (or δ-correlated, white-noise) random forcing as follows: where the angular brackets represent the average over ξ, ξ = 0, and D is the strength of the forcing. The parameter γ is a positive constant. n is the order of nonlinearity, which we take to be an odd integer to make x = 0 an attractor, and also preserve a reflectional symmetry (under x → −x) of the system. We note that there are two possible interpretations of Equation (3). Specifically, consider the linear case where ∂V ∂x ∝ x. The first interpretation is to view x as a velocity, in which case the force term would give a frictional force (e.g., Equation (1.2) in [34]) as dv dt = −γv + ξ, with γ as a frictional (damping) constant. The second interpretation is to take the overdamped limit of the coupled equations for x and v by dropping d 2 x/dt 2 (e.g., Equation (3.130) in [34]) to obtain one equation for x from the first line in Equation (3.130). In this case, x represents the position and γ −1 would be the frictional constant, and a harmonic potential would correspond to n = 1.
For a deterministic system with ξ = 0, the solution to Equation (3) is readily obtained as where x 0 is the initial condition x(t = 0). The corresponding Fokker-Planck (FP) equation is [34] ∂ t p(x, t) = ∂ x (γx n p) + D∂ xx p.
We first discuss some analytic limiting cases in Section 3, and then present numerical solutions in Section 4.

Exact Solutions for D = 0
In the absence of the stochastic noise D = 0, the diffusion term in the Fokker-Planck Equation (6) is zero. In this case, the PDF does not have a stationary solution, but continues to change in time due to the linear/nonlinear force. For instance, for n = 1, the PDF's width becomes exponentially narrow in time as the fluctuation (as well as the mean value) decreases exponentially due to γ > 0 (see below). To obtain an analytic solution to Equation (6), we use the method of characteristics. Specifically, we rewrite Equation (6) with D = 0 in terms of the total derivative along the characteristic as dp dt ≡ ∂p ∂t Here, the characteristic is given by which is Equation (3) without the stochastic noise ξ. Thus, the characteristic with the initial condition x(t = 0) = x 0 satisfies Equation (5), which can also be written as Along these characteristics, we rewrite Equation (7) as The integration of Equation (10) then gives us For simplicity, we consider an initial Gaussian PDF localised around µ as Then, we can find p(x, t) at a later time from Equations (9), (11) and (12) as where The maximum amplitude of p occurs at that x where xφ − µ = 0 in Equation (13), with the value The linear case n = 1 can be obtained by taking the limit n → 1 in Equations (13) and (14). One finds that φ = e γt so that where β = β 0 e 2γt . That is, the width of the PDF (∝ β −1/2 ) as well as the peak position decrease exponentially. However, the linear case can be solved analytically even when D = 0 [33], so we are here more interested in the nonlinear cases n = 3, 5, 7. Given p(x, t) in Equation (13), we can further calculate E (t) as follows where we use , and p(x 0 , 0) is the initial Gaussian PDF in Equation (12). Interestingly, Equation (17) is independent of time, with constant τ. That is, without a stochastic noise, the evolution of PDFs follows a geodesic due to the scaling relation satisfied along the characteristic. For example, we can show that, for n = 1, E = 2γ 2 (1 + β 0 µ 2 ) and for n = 3, E = γ 2 µ 4 2β 0 µ 2 + 24 + 99 2β 0 µ 2 + 21 2(β 0 µ 2 ) 2 . From these analyses, we can infer that for sufficiently large β 0 such that β 0 µ n− 1 1, E ∝ β 0 µ 2n to leading order.

Approximate Solutions for D = 0
According to the previous results, the peak amplitudes increase indefinitely in time. The corresponding widths also decrease, and eventually become so narrow that any non-zero D will start to play a significant role, and will start to broaden the PDFs again. However, during this phase of the evolution, the PDF width is still smaller than its mean position, so that we can use a quasi-linear analysis to approximate Equation (3) where (18) satisfies the Ornstein-Uhlenbeck process with the effective damping constant Thus, in this stage, the PDFs remain essentially Gaussian and evolve as where The term e −2G /2β 0 in Equation (22) represents the narrowing of the initial PDF width as discussed in the previous section. However, note that, when 1/2β 0 D/γ e , a PDF can maintain the same width set by the constant value D/γ e at the initial stage; that is, there is no nondiffusive evolution phase. See [33] for such a case with n = 3. For the 1/2β 0 > D/γ e case, we consider here, the transition from the nondiffusive evolution phase to the quasi-linear Gaussian evolution phase occurs when e −2G /2β 0 becomes comparable to D/γ e , leading to the following criterion for the first transition timescale t 1 : using γ e = nγ x n−1 . By using Equation (23) and For large time γ(n − 1)µ n−1 t 1 1, Equation (25) thus yields Note in particular how both exponents are negative for n = 3, 5, 7, so that smaller D and/or µ yield a larger t 1 . Next, for D = 0 and γ(n − 1)µ n−1 t 1, we approximate Equation (22) as using also γ e ∝ t −1 . Thus, the PDF width increases as (Dt) 1/2 , while the maximum amplitude decreases as typical of Brownian motion [34]. Using Equation (26) in either Equation (15) or (28)-t 1 after all is precisely the time when one scaling transitions to the other, so they should agree at that time-we obtain the scalings of the overall maximum amplitude p Max reached throughout the entire evolution as

Final Stationary Distribution
For even greater times, fluctuations become stronger while the mean values decrease. The quasi-linear analysis in the previous section is therefore eventually no longer applicable, and numerical solutions are crucial. The final stationary solution of Equation (6) can however be computed analytically from D∂ x p = −γx n p and becomes Thus, the maximum amplitude while the width of the PDF is proportional to D In the following section, we see how numerical solutions compare with some of these predictions such as the two timescales t 1 and t 2 , as well as explore other aspects of the solutions for which no analytic predictions are possible.

Numerical Solutions
For D = 0, an exact analytic solution to the Fokker-Planck Equation (6) only exists for the linear case n = 1 [33]. For the nonlinear cases n = 3, 5, 7 that are the focus of this paper, we must resort to numerical solutions. Without loss of generality, we set γ = 1. The interval in x is fixed to be [−1, 1], rather than the original [−∞, ∞]. This may seem drastic, but actually involves no real loss of generality either, since any finite interval can always be mapped to [−1, 1] by suitably rescaling x, t and D. As long as the initial condition (and D) are chosen such that p would be negligible outside [−1, 1] anyway, solving the FP equation only on [−1, 1], and with p = 0 boundary conditions is then equivalent in all essentials to the original infinite interval.
The numerical solution is done by second-order finite-differencing in x, with up to M = 2 × 10 6 grid points. The time-stepping is also second-order accurate, with increments as small as ∆t = 10 −6 . Both M and ∆t were varied to check the accuracy of the solutions. In the later stages, when the PDFs are evolving to increasingly broad profiles, M can also be decreased, and ∆t increased, while still preserving accuracy. Regridding the solutions in this way is indeed crucial, since the final adjustment timescale t 2 is extremely long for small D, and could not be reached if M and ∆t were kept fixed at their initial values.
In [33], we considered the initial condition and then compared numerical solutions for n = 3 and D = 10 −3 to 10 −7 with the corresponding analytical solutions for n = 1. That is, the initial peak was extremely narrow, and diffusion was greater than what we consider here. As a result of these choices, the peaks only became broader but never narrower; the nondiffusive regime in Section 3.1 simply does not exist in this case.
In contrast, in this work, we take the initial condition and D = 10 −6 to 10 −9 . By starting with broader peaks and having smaller D, we do have an initial nondiffusive regime here, and are able to observe the narrowing of the peaks predicted by Equations (13) and (14). We begin by fixing the initial peak position µ = 0.65; below, we also consider the range µ = [0.01, 0.75]. Figure 1 shows how the peak amplitudes evolve in time. We can clearly see the three regimes deduced in Section 3: the peaks initially increase, in excellent agreement with Equation (15), then they decrease in agreement with Equation (28), and finally they equilibrate to Equation (31). To more quantitatively compare the overall peak amplitudes p Max and the corresponding times t Max at which they occur with the analytic predictions given by Equations (26) and (29), we note first that D = 10 −6 is clearly not yet sufficiently small for there to be an initial nondiffusive regime at all (for this particular width of the initial condition). Even D = 10 −7 only follows the nondiffusive Equation (15) for a very brief time, not yet long enough to be in the γ(n − 1)µ n−1 t 1 regime where Equations (26) and (29) are expected to apply. Table 1 therefore compares the D = 10 −8 and 10 −9 cases, and uses them to extract scaling exponents of the form D −α . We see that the agreement of p Max with Equation (29) is virtually perfect. The corresponding times t Max are close to the expected scaling t 1 , but are not fully in the asymptotic limit yet. This is hardly surprising, since even for D = 10 −9 the t Max values in Table 1 only have γ(n − 1)µ n−1 t Max ∼ 8 (for all three n values).
To similarly test the scalings that are predicted for µ, further runs were done with fixed D = 10 −9 , and µ = 0.55 and 0.75. As Table 2 shows, the extracted exponents are again in very good (t Max ) and perfect (p Max ) agreement with the predicted asymptotic scalings. The interesting feature that larger initial positions µ have smaller times to reach the maximum peak amplitude is certainly very well reproduced. Note finally that the two µ values used to extract these exponents in Table 2 differ by less than a factor of two even, and would thus certainly not be enough to extract reliable scaling exponents if we did not already have robust analytic predictions. As we also show in more detail below, in principle, it would be possible to have the analytic predictions extend over an arbitrarily large range in µ, but that would require increasingly small D as well, which becomes numerically too time-consuming.
Returning to the µ = 0.65 runs in Figures 1 and 2 shows the expectation values x = xp dx, and how they compare with the nondiffusive result (Equation (5)). The agreement is close to perfect even after the first timescale t 1 is reached. It is only once the very long second timescale t 2 is reached that x approaches 0 exponentially, far more rapidly than the t − 1 n−1 power law scaling in (5).  (15)) that applies in the nondiffusive phase. The bottom row shows the equivalent widths at half-peak, which are inversely proportional to the peak amplitudes. The standard deviation (x − x ) 2 1/2 follows exactly the same pattern as the half-peak widths.  Table 1. The first two rows show the overall peak amplitudes p Max and the times t Max at which they occur, for the results in Figure 1. D = 10 −8 and 10 −9 , and n = 3, 5, 7 as indicated, and all results at µ = 0.65. The row labeled "Exponent" uses the ratios of the two D values to extract scaling exponents of the form D −α . The final row compares these numerically deduced exponents with the analytic predictions from Equations (26) and (29). That is, the exponent α should equal n−1 3n−1 for t Max , and n 3n−1 for p Max .  Table 2. The first two rows are as in Table 1, but now for fixed D = 10 −9 , µ = 0.55 and 0.75, and n = 3, 5, 7 as indicated. The row labeled "Exponent" uses the ratios of the two µ values to extract scaling exponents of the form µ δ . The final row compares these numerically deduced exponents with the analytic predictions from Equations (26) and (29). That is, the exponent δ should equal − 2n(n−1) 3n−1 for t Max , and n(n−1) 3n−1 for p Max .  Skewness is a measure of how asymmetric a PDF is about its peak, with a value of 0 indicating a symmetric peak. Kurtosis measures the flatness of a PDF, especially in comparison with a Gaussian, which has kurtosis = 3. We see that skewness ≈ 0 and kurtosis ≈ 3 is maintained all the way until the final equilibration timescale t 2 is reached, indicating that the PDFs remain largely Gaussian up to this time, as predicted in Section 3.2. For the final equilibrated profiles in Equation (30), the skewness is again 0, whereas the kurtosis has values that depend on n (the precise values can be evaluated analytically in terms of Γ functions, but are not particularly insightful). It is interesting though to note in Figure 3 that both skewness and kurtosis exhibit non-monotonic behavior during the final adjustment process, where the PDFs transition from being largely Gaussian to their final form. Note also how the maxima reached during this transition process are independent of D, and even broadly similar for the different values of n. Figure 4 compares the PDFs at the times when the skewness reaches its maximum (negative) value with the final equilibrated profiles in Equation (30). The entire evolution is then clear: as long as the PDFs are well outside their final profiles, they remain essentially Gaussian, with widths as in Figure 1 and positions as in Figure 2. The skewness and kurtosis start to deviate significantly from their Gaussian values when the PDFs start to reach their final positions, with the maximum skewness as in Figure 4. The final rearrangement for t > t 2 then merely adjusts to the final profiles (30), but with relatively little further movement of the PDFs.  (31), and t is shifted as in Figure 3 to consistently have the correct skewness values. Figure 5 shows the diagnostic quantities E (t) and L(t). As predicted by Equation (17), E remains constant in the initial nondiffusive phase. Once the first timescale t 1 is reached, E decreases as t − 3n−1 n−1 D −1 . Once the second timescale t 2 is reached, E decreases exponentially. (This is not included in Figure 5, however, as E ∝ D 2(n−1) n+1 at that point, and is thus already negligibly small.) The behavior of L is as expected: while E is constant, L increases linearly in time, and, once E starts decreasing, L levels off to its final value L ∞ . Based on the scaling for E , namely that E ∝ µ 2n initially, and the range t 1 for which this applies, we expect at least a lowest-order estimate for L ∞ to be given simply by Figure 6 shows numerically computed results, for the same D = 10 −6 to 10 −9 range considered throughout, and µ = [0.01, 0.75]. At least for sufficiently large µ, we do indeed see clear evidence of power-law behavior, with a negative exponent for D and a positive one for µ. Concentrating on the variation with µ, for n = 3, the slopes vary between 1.8 for D = 10 −6 and 1.6 for D = 10 −9 ; for n = 5, they vary between 2.7 and 2.3; and for n = 7, they vary between 3.5 and 2.9. By comparison, according to Equation (35), the exponent should be n(n+1) 3n−1 , which yields 1.5, 2.1, 2.8 for n = 3, 5, 7, respectively. The agreement is thus quite good, especially when we recall (Tables 1 and 2) that even D = 10 −9 was not quite small enough yet to obtain precise agreement with the asymptotic formula for t 1 , which in turn affects Equation (35) as well.
L ∞ as a function of µ provides a mapping from the physical distance µ (the distance between the peak position µ of an initial PDF and the final equilibrium point 0) to the information length (the total number of statistically different states of a system reaching a final stationary PDF from an initial PDF). From Equation (35), this mapping is linear for a linear force; that is, L ∞ is linearly proportional to the physical distance. This linear relation breaks down for nonlinear forces as L ∞ depends nonlinearly on the physical distance µ. Specifically, for n = 3, L ∞ ∝ µ 1.5 ; for n = 5, L ∞ ∝ µ 2.1 ; and, for n = 7, L ∞ ∝ µ 2.8 . Thus, we can envision that nonlinear forces affect the information geometry, changing a linear (flat) coordinate to a power-law (curved) coordinate. This is reminiscent of gravity changing a flat to a curved space-time. Furthermore, interestingly, Equation (35) shows that L ∞ is independent of D for the linear force, manifesting the information geometry as independent of the resolution (set by D). In contrast, L ∞ decreases with n = 3, 5, 7 as D −1/4 , D −2/7 , D −3/10 , suggesting that the information geometry is fractal, depending on the resolution. This would be equivalent to the I theorem of [14].
Finally, what about the small µ limit in Figure 6, which clearly does not follow the expected scaling (35)? The explanation is that the initial position µ is already within the O(D 1 n+1 ) width of the final distribution in Equation (30). In this limit, the behavior is different, and the peak merely spreads out, resulting in a small and relatively uniform L ∞ . For any given µ, D must therefore satisfy D µ n+1 to be in the regime where Equation (35) is expected to apply. That is, in principle, it would be possible to have Equation (35) apply over several orders of magnitude in µ, but D would have to be far smaller than is numerically feasible.

Conclusions
The motivation of this research was to elucidate the link between force and geometry in a stochastic system. To this end, we investigated the transient relaxation of initial PDFs of a stochastic variable x under different nonlinear forces proportional to −x n (n = 3, 5, 7) and different strength D of δ-correlated stochastic noise. We identified the three main stages consisting of nondiffusive evolution, quasi-linear Gaussian evolution and settling into stationary PDFs. The strength of the stochastic noise is shown to play a crucial role in determining these timescales t 1 ∝ D − n−1 3n−1 and t 2 ∝ D − n−1 n+1 , as well as the peak amplitude and width of PDFs. Note in particular how both timescales have (n-dependent) power-law scalings with D, in sharp contrast with the linear n = 1 case, where the entire evolution occurs on t = O(1) timescales, with no dependence on D.
We further computed the rate of information change, and the information length L(t) representing the number of distinguishable statistical states that the system evolves through in time. We identified a robust geodesic (where the information changes at a constant rate) in the initial nondiffusive stage, and mapped out a geometric structure of an attractor as L ∞ ∝ µ m , where µ is the peak position of the initial Gaussian PDF. For sufficiently small D the exponent m was shown to be n(n+1) 3n−1 , but still exhibits (moderate) variation with D even for values as small as 10 −9 . This suggests that the geometry is curved by the nonlinear interaction, in contrast with the linear geometry of the Ornstein-Uhlenbeck process which has m = 1 and no variation with D. Our results thus highlight ubiquitous power-laws and multi-scalings of information geometry.
This work will form a basis for future investigation of a more general case of the superposition of nonlinear forces with different ns. Multiplicative noise would also be interesting, but the situation becomes somewhat more complicated as a deterministic force could appear as −x n as here, but the noise ξ could appear as ξx m , with m not necessarily the same as n. Fractional noise would also be worth pursuing, although this would be challenging both analytically and numerically. Finally, it will also be interesting to extend our work to analyse a heterogeneous, linear system [35][36][37], and to explore applications of our work to estimators.