A Bayesian Analysis of Plant DNA Length Distribution via κ-Statistics

We report an analysis of the distribution of lengths of plant DNA (exons). Three species of Cucurbitaceae were investigated. In our study, we used two distinct κ distribution functions, namely, κ-Maxwellian and double-κ, to fit the length distributions. To determine which distribution has the best fitting, we made a Bayesian analysis of the models. Furthermore, we filtered the data, removing outliers, through a box plot analysis. Our findings show that the sum of κ-exponentials is the most appropriate to adjust the distribution curves and that the values of the κ parameter do not undergo considerable changes after filtering. Furthermore, for the analyzed species, there is a tendency for the κ parameter to lay within the interval (0.27;0.43).

The first three species cited are the most economically important as a popular food resource [4]. The fruits of the species are incredibly diverse, differing greatly in shape, surface topography, color, size, and color pattern [5]. Among them, C. pepo is the genus' most phenotypically variable species and has eight cultivar groups with edible fruits (groups) [6]. The second most diversified species in the genus is thought to be C. moschata [7].
There are few estimates of genome size in the genus Cucurbita. However, studies have shown relatively small genome sizes. The genome sizes of C. maxima and C. moschata were estimated to be 271. 40 and 269.90 Mb, respectively, [9], while the genome size in C. pepo was estimated to be 263.0 Mb [10]. Concerning the number of genes, the estimated values for C. maxima, C. moschata, and C. pepo were 32.076; 32.205 [9]; and 27.868 genes [10], respectively.
On the other hand, numerous models based on statistical physics consistently attempt to represent statistical features, such as long-range and short-range correlations, in light of the large DNA sequence data. Some approaches used statistical tools in connection with random-walk simulations [12][13][14], wavelet transforms [15,16], 1D Ising models [17] (see e.g., [18] and references therein), and Tsallis' statistics together with Machine Learning [19]. Many live creatures' coding and non-coding sequence length distributions have been studied by some models in relation to long-and short-range correlations [20][21][22][23]. Non-additive entropy-based statistical physics methods have recently been actively advocated for use in complex system research [24,25]. In this case, the Kaniadakis entropy yields a power-law distribution rather than an exponential one and depends on a free parameter (the κ parameter) [26][27][28]. The κ-statistics arose as a useful statistical tool for many systems (see [29] and references therein). For problems associated with human DNA, see e.g., [30,31].
Additionally, the Bayesian inference has been effectively applied as a useful tool to investigate a number of issues in physics [32] and biophysics [33]. Which DNA models should be valid from the perspective of Bayesian inference is an intriguing subject. Additionally, the challenge in the context of this work would be to investigate an expansion of a model from Ref. [31], but this time in the context of other living structures, such as vegetables.
More recently in [34], statistical models of the Tsallis type provided the distribution of nucleotide chain lengths, successfully capturing the statistical correlations between the parts of the plant (for both coding and non-coding) DNA strands for two species of the Cucurbitaceae family. We expand the paradigm proposed in [31] in the context of vegetables in this article. We especially evaluate the distribution of nucleotide chain lengths measured in base pairs for Cucurbita maxima, Cucurbita moschata, and Cucurbita pepo utilizing κ-deformed statistics in light of the social and economic significance of cucurbits. The most practical model is then chosen using a Bayesian statistical analysis based on the κ-distributions. To the best of our knowledge, this is the first time the size distribution of plant DNA has been realized using a κ-statistical analysis.

Materials and Methods
We use the κ-statistics, developed by Kaniadakis [26][27][28], to analyze the correlations between the DNA length distributions of some species of the Cucurbitaceae family. There are some works in this direction using the Tsallis q-statistics [34][35][36]. The κ-entropy and power-law distribution functions naturally arise from the kinetic foundations of κ-statistics. Formally, the κ-framework is based on the κ-exponential and κ-logarithm functions (see Ref. [26]), defined as The parameter κ is restricted to values belonging to the range |κ| < 1; for κ = 0, these expressions reduce to the usual exponential and logarithmic functions. From the optimization of entropy S κ (see Ref. [37]),we can obtain the probability distributions (P κ,1 (l)) associated with the quantities of base pairs (bp) for each of the chromosomes of Cucurbita maxima, Cucurbita moschata, and Cucurbita pepo. Mathematically, the Kaniadakis entropy S κ is given by The optimization process is well described in Refs. [26,[37][38][39][40][41] and gives us P κ,1 (l) Rewriting (4) with the explicit form of exp κ (−βl) given by (1), and using constraints as in Ref. [41], we get Here, L κ is an adjustable parameter that is related to the mean value of the length distribution, κ is the model's free parameter which measures the interaction between the nucleotides in the sample, and l is the chain of nucleotides' length, expressed in number of base pairs.
We employ the cumulative probability distribution because the probabilities for lengthy lengths l of the nucleotide chain are subject to significant fluctuations.
We employ the cumulative probability distribution because the probabilities for lengthy lengths l of the nucleotide chain are subject to significant fluctuations, (5) can be found by solving Φ(l) = p(l < l) = l 0 p(l )dl , which provides where Here, Φ(l) denotes the probability of finding the sizes of the bases between 0 and l. In Ref. [34], it was proposed a comparison between the q-exponential and a sum of q-exponentials to explain the DNA length distribution of two species of cucurbits, Cucumis melo and Cucumis sativus. Based on this work, we propose an analysis of the same type but using the κ-statistics. We assume that the sum of Kaniadakis-type generalized probabilities (already normalized) is given by where κ, γ 1 , and γ 2 are adjustable parameters and l is the length of the nucleotides, respectively. By employing the identical steps as those leading to (6), the cumulative probability distribution is found to be where Initial analyses indicate that, as occurred for the Tsallis' q-statistics [34], the κ-exponential sum model best fits the DNA length distributions of the species studied here. Therefore, we chose to make a comparison between the sum of κ-exponentials (9) and the κ-Maxwellian model (11) below, proposed in [31] to explain the length distribution of human DNA.
The best model to describe the length distributions of the nucleotides for three species of the Cucurbitaceae family is obtained by comparing, via Bayesian analysis, the distributions Φ κ,2 (l) and Φ κ,3 (l), which are represented by Equations (9) and (11), respectively.

Results
We use the public database of the National Center for Biotechnology Information (NCBI) [42] and the Comparative Genomics (CoGe) [43]. They are databases that give users access to genetic and biological data. In our analysis, we considered only the coding bases (exons). We define a nucleotide sequence's length in terms of the l (bp) base pairs. All graphical and data modeling was written in R, a free statistical software [44].
By plotting the cumulative probability distribution function (CDF) and a box plot for chromosome 02 of one of the species studied here (Figure 1), we can see that some points are very far from the distribution and can be considered outliers. There are various techniques for defining, spotting, and dealing with outliers [45]. In this work, we decided to use the box plot approach. Outliers in this approach are points that are below the region Q1 − 1.5 × IQR and above Q3 + 1.5 × IQR, where Q1, Q2, and Q3 are first, second, and third quartile, respectively, and IQR is the interquartile region defined as IQR = Q3 − Q1.
To prevent these points from influencing the behavior of the proposed models, we decided to remove them. The cut was made around 1% of the cumulative distribution, designated by the hatched square in the lower right corner of Figure 1a. A similar approach has been proposed in [46] to analyze the length distribution of human DNA. Table A1 describes the statistical characteristics of some chromosomes of the three species of Cucurbitaceae after removing these outliers. We decided to analyze the impact this action had on the value of κ, taking into account the cumulative distribution functions (9) and (11). In Tables A2 and A3, we have the number of nucleotides (N) and the best fit values per κ. The subscripts 0 and f represent the values before and after the outliers are removed, and (RD) represents the relative difference between them. The values of RD are smaller than the errors associated with the values of κ in Tables A4-A6. This work deals with a statistical analysis of the distribution of DNA lengths in plants. Possible biological effects caused by removing nucleotides with large amounts of base pairs were not taken into account.
In Figures 2-4, we show the cumulative distributions, for exons, for some chromosomes of Cucubita maxima, Cucurbita moschata, and Cucubita pepo, with the other chromosomes behaving similarly. To get the best fit values for κ, the distribution functions (9) and (11) were fitted to the lengths (l). Tables A4-A6 show all numerical results for the parameters κ, γ 1 and γ 2 for distribution (9) in addition to κ and σ κ for distribution (11).
Chromosome numbers are displayed in the first column (CHR), and the number of nucleotide chains is displayed in the second column (N) (exons). The correlations between the values of l are measured by the values of κ [26][27][28]39]. According to [36,47], the coding part of human DNA tends to present short-range correlations. The same behavior for plant DNA can be observed in [34]. This implies κ values close to zero. It is worth remembering that in the limit κ → 0, we return to the well-known Boltzmann-Gibbs-Shannon statistics [26].  (9) and (11). The other chromosomes follow the same pattern.
The models that fit the length distribution Φ(l) the best are determined via Bayesian statistics. By taking into account the probability distribution of the hypotheses, conditioned on the evidence, Bayesian inference describes the relationship between the model and the data, and enables a rational and effective selection of one or more hypotheses [48]. The Bayes' theorem, offers us the likelihood that, given the data D, a posterior model Φ will be correct. For this, the probability of the prior model P(Φ|M) is multiplied by the likelihood function L(D|Φ, M) and divided by the Bayesian evidence E (D|M). Here, we assume the pattern χ 2 = (P(l obs ) − P(l the )) 2 /σ 2 obs for the likelihood function, where P(l obs ), P(l the ) and σ obs are the cumulative probabilities associated with the observed and the theoretical nucleotide lengths, and observed errors, respectively.
The input parameters used in the prior uniform distribution were obtained from the best fit found by the R-code. This approach, which defines the model parameters' potential range and significantly affects the Bayesian evidence, is a crucial phase in the study. This condition ensures that the parameters will fall inside the previously identified optimal adjustment range.  11, 13, 15, 18, and 19. The blue and red curves are, respectively, the distributions (9) and (11). The other chromosomes follow the same pattern.
In order to compare the models, we make use of the Bayes factor, which is given by Here, E j is the evidence of the base model, which is used as a reference. In our case, this is the distribution (9), and E i is the evidence of the model we want to compare, given by distribution (11). We employ the Bayes factor interpretation provided by Jeffrey's theory [35,[54][55][56] to measure whether a model has favorable evidence in comparison to the base model. Table A7 contains the findings for each chromosome.
The Bayesian analysis is performed from each model's range of definite parameters. Therefore, the better we understand the behavior of the parameters, the more accurate our analysis will be, and we can guarantee that the evidence found will represent the curve with the best fit [48]. In Figures 5-7, we have scatter plots for the parameters of the models (9) (a) and (11) (b). For all chromosomes of all species analyzed here, we found strong correlations between the parameters γ 1 and γ 2 present in the distribution (9). This was expected, as this model appears as a variation of the model (6), as carried out in [34]. These two adjustable constants together (γ 1 and γ 2 ) have an inverse role to what L κ has in the distribution (6), and when γ 1 = γ 2 , we obtain the model (6) again. This implies that these parameters are related to the κ parameter in the same way, resulting in similar images for scattering but with different ranges. This behavior was repeated for all chromosomes.
The κ S parameter (that is, the κ value that provides the best fit, when using the sum of κ-exponentials, Equation (9)) in Tables A4-A6, measures the correlation between lengths l, and belongs to the range (0.27(4); 0.37 (2)) in the case of Cucubita maxima, (0.28(3); 0.40(4)) for Cucubita moschata, and (0.32(3); 0.43(3)) for Cucubita pepo. It can be seen in Figure 8 that the values of κ, for different species, seem to specify a universal behavior. Therefore, all of these findings lead us to the conclusion that for all the species under study, the model (9) (sum of κ-exponentials) is strongly preferred over the distribution model (11) (κ-Maxwellian).    Cucurbita Maxima Cucurbita Moschata Cucurbita Pepo Figure 8. κ values, from the best fit model, Equation (9), for different species. In red, blue, and black, we have, respectively, Cucurbita maxima, Cucurbita moschata, and Cucurbita pepo.

Conclusions
A statistical model based on non-additive statistics was developed to describe the size distribution of nucleotide chains in the DNA of species belonging to the Cucurbitaceae family, namely Cucurbita maxima, Cucurbita moschata, and Cucurbita pepo [26][27][28]31]. Specifically, the proposed distribution, Equation (9), expands on a distribution studied in [41] through the sum of the κ-exponentials, which added the parameters γ 1 and γ 2 to capture the statistical correlations between the DNA strands. Another model investigated was the κ-Maxwellian distribution, Equation (11), proposed in [31] for human DNA. We tested the statistical feasibility of models, as well as methods based on Bayesian statistical analysis using the NCBI project database. The cumulative distribution function (9) best fitted the nucleotide base for all chromosomes, of the three species, with the parameter κ belonging to the range (0.27(4); 0.37(2)) for Cucurbita maxima, (0.28(3); 0.40(4)) for Cucurbita moschata, and (0.32(3); 0.43 (3)) in the case of Cucurbita pepo. It can be seen in Figure 8 that the values of κ for different species of the coding parts (exons) of the DNA appear to be within a common and relatively narrow range.
Regarding the Bayesian analysis, we compared the κ-exponential-sum distribution with the κ-Maxwellian model. We demonstrated that the first has solid and favorable evidence compared to the κ-Maxwellian distribution. This was reasonably expected given that the distribution (9) has a free parameter for potential future adjustments. A general task should be to expand the model presented in this study to include additional species, determining whether they fall within the same range of κ for exons (0.35 ± 0.08) discovered for the species investigated here.

Data Availability Statement:
The DNA code data that support the findings of this study are available in NCBI [42].

Acknowledgments:
The authors thank William J. da Silva for technical discussions on the Bayesian analysis.

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.
Appendix A Table A1. Statistical characteristics of the data after outliers are removed. The first column indicates the chromosome, the second the number of exons, the third and fourth the minimum and maximum lengths. Finally, the quartiles Q1, Q2, and Q3 are in the fifth, sixth, and seventh columns, respectively. In the later columns, we have indicated the same parameters for the other species of cucurbitaceae.

CHR
Cucurbita maxima Cucurbita moschata Cucurbita pepo  Table A3. The same as Table A2, but for model (11).  Table A4. The average of the best fit parameters for the Cucurbita maxima species. The sub-index S and M represent the κ-exponential sum function (9) and the κ-Maxwellian function (11), respectively. σ κ , γ 1 , and γ 2 are free parameters related to the length of the nucleotide chain. The numbers in parenthesis denote the calculated errors.    Table A7. Bayesian analysis for exons of each chromosome. The column ln(E ) gives us the Baysian evidence for each of the models, Equation (11) for (i) and (9) for (j). The indices max, mos, and pep represent, respectively, the species Cucurbita maxima, Cucurbita moschata, and Cucubita pepo. The numbers in parenthesis indicate the calculated errors.