Monte Carlo Simulation of a Modiﬁed Chi Distribution Considering Asymmetry in the Generating Functions: Application to the Study of Health-Related Variables

: Random variables in biology, social and health sciences commonly follow skewed distributions. Many of these variables can be represented by exGaussian functions; however, in practice, they are sometimes considered as Gaussian functions when statistical analysis is carried out. The asymmetry can play a fundamental role which can not be captured by central tendency estimators such as the mean. By means of Monte Carlo simulations, the effect of a small asymmetry in the generating functions of the chi distribution is studied. To this end, the k generating functions are taken as exGaussian functions. The limits of this approximation are tested numerically for the practical case of three health-related variables: one physical (body mass index) and two cognitive (verbal ﬂuency and short-term memory). This work is in line with our previous works on a physics-inspired mathematical model to represent the reaction times of a group of individuals.


Introduction
The chi and chi-squared distributions are well-known continuous probability distributions widely used in Applied Statistics [1][2][3][4][5]. The fact that they can be generated by a set of Gaussian-distributed random variables makes them amenable to simulations. We devoted a previous work to study the percentile ratios in a chi distribution [6].
A chi distribution of k = 3 degrees of freedom is found in physics to model the velocities of the independent particles of an ideal gas in thermodynamic equilibrium. Similarly, a chi-squared distribution models the energies of the particles in the same physical system. Another typical case from physics is the Rayleigh distribution (chi of k = 2 degrees of freedom).
In one of our previous works [7], we found a new interesting application of the chi distribution of k = 3 degrees of freedom, that is, of the Maxwell-Boltzmann (MB) distribution. In reference [8] we proved that the reaction times (RTs) of children responding independently to visual stimuli in a short time (hundreds of milliseconds) and without exchanging information are correlated. We interpreted this fact as an experimental evidence for the existence of a system of individuals (or collective). In order to gain insights into this correlation, we developed a physics-inspired mathematical model in reference [7] to represent these correlations. In fact, we could elucidate a correspondence between a system of particles and a group of correlated individuals.
We are rather interested in the conceptual modelling of different situations using the chi distribution. In this respect, in our recent work we have been studying the limits within which the chi distribution can still represent well probability distributions originated from generating functions which are not necessarily Gaussians of equal variances. For example, in reference [9], we studied the limits of the chi modelling for the case of unequal variances in the generating Gaussians. We also proposed a discrete model and an ansatz to calculate the only parameter of this distribution as a function of the unequal variances of the generating Gaussian.
In line with our previous works [7,9], we have extended here the simulation study of the chi distribution for the case of asymmetric generating functions. In this respect, the exGaussian function is a simple, flexible and intuitive function that can be used to represent a skewed distribution as it results from the convolution between the Gaussian and exponential decay functions. The convolution between two functions can be easily simulated as the sum of the respective randomly generated variables. Two practical examples which can be represented by exGaussians are the reaction time distributions in Experimental Psychology [10][11][12][13][14][15][16][17] or the peaks in Chromatography [18].
In this paper we will carry out Monte Carlo simulations to study the distribution Z that originates from combining k generating functions with certain asymmetry (Z j ), as , and will evaluate the result by means of a fit to a chi distribution. The level of asymmetry is considered by using exGaussians as generating functions. Our aim is to explore the level of asymmetry for which the fit to chi distribution can be still considered reasonably good for practical applications. Our approach is useful to model multiple situations in health and social sciences where random variables commonly follow asymmetrical distributions. In this respect, an example involving health-related variables is also included in this work.

Generalities on the Chi Distribution
The chi distribution is a continuous probability distribution of a random variable defined as where each of the X j , j = 1, . . . , k, is a Gaussian-distributed independent random variable. Each one of the k variables X j follows a Gaussian distribution with mean zero and variance to achieve unity. In Ref. [9] we analysed the case of different values of the variance for each Gaussian component j. In the present paper we propose to study the case in which the X j components deviate from exact Gaussianity, instead exhibiting a certain degree of asymmetry. We propose to model the deviation of each component from pure Gaussianity, considering them as exGaussian distributions. The exGaussian distribution [10] is given by where erfc is the complementary error function. The above f (x; µ, σ, τ) is the result of convoluting the pure Gaussian with the exponential distribution where µ and σ are the mean and standard deviation of the Gaussian component and τ the decay constant of the exponential. Let us recall that, if a random variable X is distributed according to (3), and a random variable Y is distributed according to (4), then the sum X + Y is distributed according to the exGaussian (2). In this way, the parameters µ, σ are not the mean and the standard deviation of the exGaussian distribution (2), but those of its Gaussian component (3). Thus the parameter τ is a measure of the skewness of the exGaussian distribution, i.e., of its deviation from pure Gaussianity. The value τ = 0 corresponds to a pure symmetric Gaussian [9]. On the other hand, the probability density function corresponding to (1) is given by In the case when all the variances of the k generating Gaussians take a value different from one, then the above f (x; k) generalises to [9] where B is related to the variance of distribution (6). The chi distribution as stated in (1) describes a k-dimensional ideal gas of free, independent particles. In this latter case, the k variances are all equal. The cumulative distribution function corresponding to (6) is given by where Γ(a, b) is the upper incomplete gamma function [19]. A particularly interesting case of the above appears in the statistical mechanics of ideal gases [20]. This is the case of a chi distribution with k = 3 degrees of freedom. Then the random variables X j are the three components v x , v y , v z of the velocities of the particles. These components are Gaussian-distributed, and their modulus (v 2 x + v 2 y + v 2 z ) 1/2 is distributed according to (6). This special case, called the Maxwell-Boltzmann distribution [21,22], is such that all three component distributions are centered around v j = 0, and the three variances are all equal (and proportional to the temperature of the gas). A k-dimensional ideal gas would be represented by (6).

Monte Carlo Simulations for Non-Gaussian Generating Distributions Z j
In this paper we will perform Monte Carlo simulations to generate one random variable Z obtained as Z = ∑ k j=1 Z 2 j . Each one of the Z j is a random variable whose probability density function resembles a Gaussian but however has some degree of asymmetry γ = 2τ 3 /(σ 2 + τ 2 ) 3/2 (for the exGaussian [10]). In this work we considered γ < 1.7 (see Figure 1, and Tables 1 and 2). This asymmetry will be implemented considering distributions such as the exGaussians presented in (2). As we have stated above, a random variable following an exGaussian can be simulated by summing a random variable following a Gaussian distribution and another random variable following an exponential decay distribution. We will first simulate k generating exGaussians, each one with a vanishing value of the mean and standard deviations σ j all equal to one. A total of 10 6 random numbers were generated to obtain the probability distribution of the variable Z. The generating exGaussian random variables (Z j ) will be chosen to have different levels of asymmetry. All fittings are performed by using the non-linear fitting algorithm of Levenberg-Marquardt [23,24]. We used the FORTRAN 90 programming language to make all calculations. The machine epsilon is 2.220446 × 10 −16 for the "double precision" real type. The same methodology as in our previous article [9] to study the modified chi distribution for the case of unequal variances in the generating Gaussians was followed. The cases of k = 3 and k = 5 degrees of freedom are developed hereafter in a general way.

Measurement of Goodness of Fit between Data and Models
Statistical analysis was performed to measure the quality of the proposed model with respect to the data, including both simulations and real health-related variables. The goodness of fit of the model with respect to the data was first studied by the coefficient of determination R 2 . It quantifies the percentage of data variance that can be explained by the model [25] with values in the range [0, 1] representing from a null fit to a perfect fit.
In addition, a non-parametric test was assessed to quantify the equality between the two continuous probability distributions under comparison in each case: the ex-Gaussians, and the distributions for the simulated or real data. Kolmogorov-Smirnov (KS) distance is the maximum vertical distance between the cumulative density functions (CDFs) of the simulated/real data, and the model [26,27]. This statistic is sensitive to differences in both location and shape of the CDFs [28,29]. We also checked the associated p-value to check whether both distributions can be considered to follow the same distribution (i.e., the null hypothesis is true).
Finally, quantile-quantile (Q-Q) plots are also depicted to study the goodness of fit between data and models [30]. Each point of the plot (x, y) corresponds to one of the quantiles of the first distribution (simulated/real data) which is compared against the same quantile of the second distribution (the model). Thus, points in the Q-Q plot lie approximately on the line y = x when data follow the same distribution. We used these probability plots to confirm that both probability distributions (simulated/real data and the model) had good fitting agreement. Figure 1 shows the probability densities of the exGaussian random variables (Z j ) obtained for four different values of the parameter τ and therefore of the asymmetry γ in the generating exGaussians. The coefficient of determination (R 2 ) of a Gaussian fit (red solid line) has also been included. The larger the value of γ the lower the values of R 2 , as expected. For the sake of clarity, quantile-quantile plots of these probability densities are also shown in Figure 2. The good quality of the fittings can be observed since the points lie approximately in a line. This is also remarked with the low values of Kolmogorov-Smirnov distances, and the p-values higher than 0.5 which show strong acceptance of the null hypothesis: data follow the same distribution.  Figure 3 includes the asymmetry of the generating exGaussian random variables (Z j ) as a function of τ for µ = 0 (mean value) and σ = 1 (standard deviation). It can be seen that almost a constant value of the asymmetry γ is reached when τ increases. Values of asymmetry in the linear region of this curve were chosen for this work (0.18 < γ < 1.67). A linear fit for 0.18 < γ < 1.67 yields a coefficient of determination of 0.97. The resulting distributions of the variable Z for three and five generating exGaussians, and for different values of the asymmetry were fitted by using chi distributions of k = 3 and k = 5 degrees of freedom, respectively. All the generating exGaussians were centered at and divided by the corresponding mode in order to standardise the resulting distribution. The mean asymmetry is calculated over the asymmetry values within each of the sets shown in Tables 1 and 2. The values of the coefficient of determination R 2 are represented in Figure 4 versus the mean asymmetry for several increasing values of asymmetry. A total of 10 6 random numbers were used to obtain the probability distributions to which chi distributions are fitted. Table 1. Results for different levels of asymmetry when considering k = 3 generating ex-Gaussians. The columns in order show: the set number, the values of τ for each of the generating exGaussians (τ 1 , τ 2 , and τ 3 ), the corresponding percentage difference between the smallest and the largest value of τ (e τ ), the mean asymmetry among the three generating exGaussians ( γ ), the calculated parameter (B calc ), the fitted parameter (B fit ), the mean coefficient of determination ( R 2 ), and the mean percentage difference between B calc and B fit over the values within the set ( e B ).  Table 2. Results for different levels of asymmetry when considering k = 5 generating exGaussians. The columns in order show: the set number, the values of τ for each of the generating exGaussians (τ 1 , τ 2 , τ 3 , τ 4 , and τ 3 ), the corresponding percentage difference between the smallest and the largest value of τ (e τ ), the mean asymmetry among the five generating exGaussians ( γ ), the calculated parameter B calc , the fitted parameter (B fit ), the mean coefficient of determination ( R 2 ), and the mean percentage difference between B calc and B fit over the values within the set ( e B ). A change in the τ values of the generating exGaussians leads to a change in their variances (S 2 ) as they depend on this parameter as S 2 = σ 2 + τ 2 [10]. In Figure 5, the values of B f it (the fitted B parameter of the chi distribution (6)) are compared with the B value as calculated from B calc = [(S 1 + S 2 + S 3 )/3] 2 − γ 2 . This expression is an extended version of the ansatz defined in our previous reference [9], B = [(S 1 + S 2 + S 3 )/3] 2 , but for the case when a small asymmetry is present. It should be noticed that the exGaussian parameters involved in the calculation of B calc (i.e., σ and τ) should be divided by the corresponding mode of each generating Gaussian.

Set
The results shown in Figures 4 and 5 are summarised in Tables 1 and 2, respectively. For very low values of asymmetry (<0.8), the difference between B calc and B f it (i.e., the error e B ) remains very small (<6%) and R 2 is reasonably good (>0.97). However, for asymmetry values larger than 1.2 the results for e B are higher and R 2 lower, especially for k = 5, where R 2 is below 0.9 and e B higher than 18%.   In order to illustrate the presented work with real data, three health-related variables were chosen from the seventh wave of SHARE (Survey of Health, Ageing and Retirement in Europe) (released 17 December 2020) [31][32][33]. The three variables in this example were: one related to physical health (body mass index-BMI), and two related to cognitive frailty (verbal fluency, measured as the number of animals named within a minute; and shortterm memory, measured as the number of words the participant was able to repeat from a 10-word list). We considered a sample formed by 1503 participants from several European countries. We chose those three variables to illustrate results since they were numeric and not categorical, in which case the proposed representation may be compromised.
The empirical probability distributions of these variables were fitted by exGaussian functions (Figure 6A-C) with a good coefficient of determination. For the sake of clarity, quantile-quantile plots of these probability densities are also shown in Figure 7, where it is depicted that both probability distributions match to a straight line. In addition, the good quality of the fittings can be visually checked with the small Kolmogorov-Smirnov distance values, while p-values are higher than 0.05, showing no-significant differences. Figure 6D shows the probability distribution of the random variable Z = ∑ 3 j=1 (Z j ) 2 1/2 and the corresponding fit to a MB distribution (see (6) for k = 3). The variables Z j stand for body mass index, verbal fluency, and short-term memory, which were standardised [7,9]. The parameters, uncertainties, and coefficients of determination (R 2 ) from the exGaussian and MB fittings are included in Table 3. This new variable, Z, combines the values of the three health-related variables in a unique value for each individual, which can be considered as a new index able to characterise each individual in the sample. The MB-like distribution in Figure 6D models the probability distribution of Z in the sample. Thus, the entire sample can be modelled by only one parameter, namely, the parameter B of the MB distribution. We illustrated this methodology for three variables but it can be extended for any number of k variables. The methodology developed in this work can have potential applications in diverse areas, for instance, to model health-related [34] and psychological variables [16,35].  Figure 6: (a) BMI, (b) verbal fluency, (c) short-term memory, and the corresponding fit to a MB distribution (d). Each panel also shows the Kolmogorov-Smirnov distance and the associated p-value. Table 3. Parameters (µ, σ, and τ), uncertainties (∆µ, ∆σ, and ∆τ), and coefficient of determination (R 2 ) from the exGaussian fitting of the analysed variables. In the last two rows, the results for the MB fitting are included. The fitted parameter (B f it ) is compared with the calculated parameter (B calc , with the ansatz introduced in this work) yielding a percentage difference of e B = 7.17 %.

Conclusions
The influence of the asymmetry in the chi distribution was investigated by means of fitting this function to the distribution resulting by taking three and five generating exGaussian functions. The results indicate that, for very small asymmetries (γ < 0.8) in the generating functions, good values for the coefficient of determination (R 2 > 0.97) are still obtained when the simulated distribution is fitted with a chi function for both k = 3 and k = 5. The results for k = 3 are also good for asymmetries larger than 1.2 while they worsen for k = 5. We also extend the ansatz proposed in [9] to include small asymmetries. As a practical example to illustrate the results of the Monte Carlo simulations, three health-related non-dichotomic variables (body mass index, verbal fluency and short-term memory) were studied. These variables were combined by taking the square root of the sum of their squares. The resulting new variable can be fitted by a Maxwell-Boltzmann (MB) distribution. Thus, the entire sample can be characterised by a one-parameter distribution, namely, B. The values of the MB variable can be considered as a new index Z able to characterise each individual in the sample. In this article, we chose three variables but this methodology can be extended to any number of variables that can be combined into a single scalar which is the variable of the resulting chi distribution.