# Size of the Whole versus Number of Parts in Genomes

^{1}

^{2}

^{3}

^{4}

^{*}

## Abstract

**:**

## 1. Introduction

**against any special multiscale link between genome-level and chromosome-level patterns.**(boldface is ours)” [5]. Here it will be shown that dependencies between chromosome number and genome size are not a peculiarity of flowering plants, as it may be concluded from the pioneering work of Vinogradov [1], by examining various groups of organisms from different kingdoms: fungi, plants, and animals. As the size of the genomes increases, it will be shown that the number of chromosomes increases in some groups while in others it decreases. Evidence that these dependencies are not simply linear will be provided.

## 2. Results

_{g}are defined, respectively, as length of a genome in million base pairs (Mb) and the size of the genome in chromosomes.

#### 2.1. Correlations between Genome Size and Chromosome Number

_{g}for the major groups of organisms analyzed in [6]. It can be seen that certain groups of organisms such as reptiles, birds and fungi, cluster in different regions of the space defined by G and L

_{g}. For certain groups of organisms (e.g., reptiles), a dependency between G and L

_{g}can be seen. However, a rigorous statistical correlation test is necessary. Separate plots of the relationship between G and L

_{g}for each group are provided in Appendix A. Table 1 shows a significant correlation between G and L

_{g}is found in 9 out of 11 groups of organisms at a significance level of 0.05. The only groups where no significant correlation is found are birds and cartilaginous fishes. Therefore, G is not indeed a constant function of L

_{g}for the majority of groups.

#### 2.2. Non-Linearity

_{g}can be shed. If the relationship was purely linear, the point estimation of the slope should not show any dependency with either G or L

_{g}. Table 2 shows that this linearity test (see Methods for further details) rejects the null hypothesis that G is a purely linear function of L

_{g}for all groups (p-value < 10

^{−7}). Non-linearity is consistent with the plots in Figure 1 and Figure 2 and in the Appendix A where it can easily be seen that the slope of a linear approximation in double logarithmic scale deviates, in many cases, clearly from one, the expected slope if the relationship was linear. However, our test cannot exclude that linearity is present in some part of the series despite the fact that pure linearity has been rejected for the whole series.

**Figure 2.**Genome size G (in Mb) versus the number of chromosomes L

_{g}(in 1n) for the major groups of plants analyzed in [6].

**Table 1.**Summary of the correlation analysis between genome size G (in Mb) and genome size L

_{g}in number of chromosomes. N, ρ, and p are defined, respectively, as the number of different organisms, the value of Spearman’s rank correlation statistic for G versus L

_{g}, and the p-value of ρ within a group of organisms. The values of ρ were rounded to leave only three decimals and the p-values were rounded to leave only one significant digit.

Group | N | ρ | p |
---|---|---|---|

Fungi | 56 | 0.280 | 0.04 |

Angiosperm plants | 4706 | −0.38 | 0.008 |

Gymnosperm plants | 170 | 0.315 | 3 × 10^{−5} |

Insects | 269 | 0.220 | 0.0003 |

Reptiles | 170 | 0.243 | 0.001 |

Birds | 99 | 0.008 | 0.9 |

Mammals | 371 | 0.297 | 5 × 10^{−9} |

Cartilaginous fishes | 52 | −0.129 | 0.4 |

Jawless fishes | 13 | −0.744 | 0.004 |

Ray-finned fishes | 647 | 0.487 | <10^{−17} |

Amphibians | 315 | 0.446 | 9 × 10^{−17} |

**Table 2.**Summary of the correlation analysis between genome size G (in million base pairs) and a = (G − c)/L

_{g}, where L

_{g}is the genome size in number of chromosomes and c is the intercept of a linear approximation of the dependency between G and L

_{g}by a non-parametric linear regression method. N, ρ, and p are defined, respectively, as the number of different organisms, the value of Spearman’s rank correlation statistic for G versus a, and the p-value of ρ within a group of organisms. The values of ρ were rounded to leave only three decimals and the p-values were rounded to leave only one significant digit.

Group | N | ρ | p |
---|---|---|---|

Fungi | 56 | 0.666 | 2 × 10^{−8} |

Angiosperm plants | 4706 | 0.925 | <10^{−17} |

Gymnosperm plants | 170 | 0.992 | <10^{−17} |

Insects | 269 | 0.802 | <10^{−17} |

Reptiles | 170 | 0.791 | <10^{−17} |

Birds | 99 | 0.771 | <10^{−17} |

Mammals | 371 | 0.278 | 5 × 10^{−8} |

Cartilaginous fishes | 52 | 0.886 | <10^{−17} |

Jawless fishes | 13 | 0.951 | <10^{−17} |

Ray-finned fishes | 647 | 0.812930 | <10^{−17} |

Amphibians | 315 | 0.983 | <10^{−17} |

## 3. Discussion

_{g}can be classified into three qualitative types:

_{g}, i.e., G = aL

_{g}+ c with a > 0. First, imagine that all chromosomes are of about the same size a (and that a does not depend on the number of chromosomes). Then genomes size G would be proportional to L

_{g}, i.e., G = aL

_{g}. Second, consider the case of genome duplication. Imagine that a new species is produced by adding k copies copy of the genome of an origin species (with k = 1 for genome duplication). The genomes that would be generated by this mechanism would satisfy the relationship G = aL

_{g}, where a = G

^{0}/L

_{g}

^{0}would be the ratio between G

^{0}and Lg

^{0}, respectively, the genome size and the chromosome number of the origin species. Here it has been shown that a linear relationship between G and L

_{g}is not supported for any group. In sum, a purely increasing linear dependency between G and L

_{g}is not supported for any group in our dataset. This has an important biological implication: Simple genome duplication is unlikely to be the only force shaping the class of organisms where “the more parts, the larger the whole”.

_{c}as a mean, i.e., L

_{c}= G/L

_{g}, which leads to L

_{c}= a/L

_{g}where a is a constant. However, L

_{c}= a/L

_{g}holds if and only if G is a constant function of L

_{g}. In other words, the relationship between the mean size of the parts and the number of parts is trivial if and only if G is constant. In contrast, here it has been shown that G and L

_{g}are significantly correlated in many groups of organisms. The classes “The more parts, the larger the whole” and the classes “The more parts, the smaller the whole” violate the constancy assumption of [5]. Furthermore, it has been shown that, when such significant correlation is not found, the possibility that this is due to the small size of the group sample cannot be denied. Notice that [5] evaluates the goodness of the fit of L

_{c}= a/L

_{g}to actual data with a flawed test, which consists of fitting L

_{c}= a/L

_{g}

^{b}to actual data. If b = −1 is obtained this implies that the hypothesis L

_{c}= a/L

_{g}is correct, according to [5]. However, obtaining b = −1 from data is a necessary but not a sufficient condition for L

_{c}= a/L

_{g}. In contrast, here we have investigated a sufficient condition for L

_{c}≠ a/L

_{g}: if G is not a constant function of L

_{g}then L

_{c}= a/L

_{g}cannot be true, at least in some region.

- G is chosen uniformly at random within the interval (G
^{m}, G^{M}). - L
_{g}(the number of chromosomes of the organism) is chosen uniformly at random within the interval (L_{g}^{m}, L_{g}^{M}).

_{g}are chosen independently in this model. Such independence is totally unrealistic as our analyses and previous research [1] have revealed. Notice that the independence between G and L

_{g}needs (if genomes with chromosomes of length zero are considered as not allowed or totally unrealistic) that the condition L

_{g}

^{M}≤ G

^{m}+ 2 is satisfied so that all chromosomes can have length greater or equal than one. This condition follows from L

_{g}≤ L

_{g}

^{M}− 1, G

^{m}+ 1 ≤ G and the condition for non-empty chromosomes, i.e., L

_{g}≤ G.

## 4. Methods

#### 4.1. Data

#### 4.2. A Test of Pure Linearity between G and L_{g}

_{g}, if and only if G = aL

_{g}+ c, where a and c are constants. If G was a purely linear function of L

_{g}, one would have that a = (G − c)/L

_{g}is a constant function of G with c obtained from least squares linear regression. A two-sided Spearman rank correlation test was used to determine if there is a correlation between (G − c)/L

_{g}and G. Notice that here the term ‘pure’ or ‘purely’ is not used to mean that the relationship between G and L

_{g}is deterministically linear but to mean indeed that E[G|L

_{g}], the expectation of G given L

_{g}is exactly linear, i.e., E[G|L

_{g}] = aL

_{g}+ c. The general assumption of regression (and also ours) is that G = E[G|L

_{g}] + ε, where ε is an error that is typically assumed to be normally distributed with mean zero and constant standard deviation [15]. However, a non-parametric linear regression method, Theil’s incomplete method [16], was used to estimate a. This method has the following advantages over a simple parametric least squares linear regression [16]:

- It does not assume that all the errors are only in the y-direction.
- It does not assume that either the x- or y-direction errors are normally distributed.
- It is robust in the sense that it is not affected by the presence of outliers.

## Acknowledgments

## References and Notes

- Vinogradov, A.E. Mirrored genome size distributions in monocot and dicot plants. Acta Biotheoretica
**2001**, 49, 43–51. [Google Scholar] [CrossRef] [PubMed] - Trivers, R.; Burt, A.; Palestis, B.G. B chromosomes and genome size in flowering plants. Genome
**2004**, 47, 1–8. [Google Scholar] [CrossRef] [PubMed] - Sankoff, D.; Ferretti, V. Karyotype distributions in a stochastic model of reciprocal translocation. Genome Res.
**1996**, 6, 1–9. [Google Scholar] [CrossRef] [PubMed] - De, A.; Ferguson, M.; Sindi, S.; Durrett, R. The equilibrium distribution for a generalized Sankoff-Ferretti model accurately predicts chromosome size distribution in a wide variety of species. J. Appl. Probab.
**2001**, 38, 324–334. [Google Scholar] [CrossRef] - Solé, R.V. Genome size, self-organization and DNA’s dark matter. Complexity
**2010**, 16, 20–23. [Google Scholar] [CrossRef] - Ferrer-i-Cancho, R.; Forns, N. The self-organization of genomes. Complexity
**2009**, 15, 34–36. [Google Scholar] [CrossRef] - DeGroot, M.H. Probability and Statistics, 2nd ed.; Addison-Wesley: Reading, MA, USA, 1989; p. 215. [Google Scholar]
- Gregory, T.R. Genome size evolution in animals. In The Evolution of the Genome; Gregory, T.R., Ed.; Elsevier: San Diego, CA, USA, 2005; pp. 4–71. [Google Scholar]
- Altmann, G. Prolegomena to Menzerath’s law. Glottometrika
**1980**, 2, 1–10. [Google Scholar] - Boroda, M.G.; Altmann, G. Menzerath’s law in musical texts. Musikometrika
**1991**, 3, 1–13. [Google Scholar] - Fuquan, K.; Kui, Z.; Yong, Z.; Tianguang, C.; Meinan, N.; Li, S.; Minghui, C.; Yizhong, Z. Analysis of length distribution of short DNA fragments induced by
^{7}Li ions using the random-breakage model. Chin. Sci. Bull.**2005**, 50, 841–844. [Google Scholar] [CrossRef] - Becker, T.S.; Lenhard, B. The random versus fragile breakage models of chromosome evolution: A matter of resolution. Mol. Genet. Genomics
**2007**, 278, 487–491. [Google Scholar] [CrossRef] [PubMed] - Schubert, I. Chromosome evolution. Curr. Opin. Plant Biol.
**2007**, 10, 109–115. [Google Scholar] [CrossRef] [PubMed] - Li, X.; Zhu, C.; Lin, Z.; Wu, Y.; Zhang, D.; Bai, G.; Song, W.; Ma, J.; Muehlbauer, G.J.; Scalon, M.J.; et al. Chromosome size in diploid eukaryotic species centers on the average length with a conserved boundary. Mol. Biol. Evol.
**2011**. [Google Scholar] [CrossRef] [PubMed] - Ritz, C.; Streibig, J.C. Nonlinear Regression with R; Springer: New York, NY, USA, 2008. [Google Scholar]
- Miller, J.C.; Miller, J.N. Statistics for Analytical Chemistry, 3rd ed.; Prentice Hall: London, UK, 1993; pp. 159–161. [Google Scholar]
- Pearl, J. Causality: Models, Reasoning, and Inference; Cambridge University Press: Cambridge, UK, 2000. [Google Scholar]

## Appendix A

**Figure 4.**Genome size G (in Mb) versus the number of chromosomes L

_{g}(in 1n) for angiosperm plants.

**Figure 5.**Genome size G (in Mb) versus the number of chromosomes L

_{g}(in 1n) for gymnosperm plants.

**Figure 10.**Genome size G (in Mb) versus the number of chromosomes L

_{g}(in 1n) for cartilaginous fishes.

**Figure 12.**Genome size G (in Mb) versus the number of chromosomes L

_{g}(in 1n) for ray-finned fishes.

## Appendix B

_{g}for a certain groups of organisms (Table 1) could change when these groups are subdivided using taxonomic information. Subdividing could yield paradoxical results such as (a) that a group of organisms shows no significant dependency but its subgroups do show a significant correlation or the opposite, that the significant correlation of the group is lost in the subgroups [7] or (b) that the sign of the significant correlation of the original group is the opposite of that of its subgroups [17].

**Table 3.**Summary of the correlations between genome size (G) and chromosome number (L

_{g}) at different taxonomic levels. Boldface is used to indicate the taxonomic groups that are the target of our main analysis. +, −, ? are attached to the name of each target group to indicate, respectively, that the correlation between G and L

_{g}was significant and positive, significant and negative, and none of them (at a significance level of 0.05). Below each target group of organisms, the total number of organisms in our dataset is shown. In each cell for which taxonomic data is available, a triple of numbers is shown above and a pair of numbers is shown below. The triple follows the format x,y,z, where x, y are respectively, the number of subgroups with significant positive and significant negative correlations, and z is the total number of subgroups. The pair follows the format x’,y’, where x’ and y’ are the number of organisms involved in significant positive and significant negative correlations, respectively.

Kingdom | Phylum/Division | Class | Order | Family | Genus |
---|---|---|---|---|---|

Fungi + 56 | 0,3,5 0,55 | 0,4,5 0,34 | 0,1,40 0,5 | ||

Plants | Angiosperm − 4706 | 22,7,194 2374,965 | 66,8,1114 1608,186 | ||

Gymnosperm + 170 | 0,4,14 0,122 | 0,2,52 0,13 | |||

Animals | Arthropoda | Insects + 269 | 3,1,7 189,56 | 0,1,26 0,13 | |

Chordata | Reptiles + 170 | 0,0,4 0,0 | 1,1,34 14,18 | ||

Birds ? 99 | 0,0,17 0,0 | 0,0,33 0,0 | |||

Mammals + 371 | 2,1,17 162,54 | 5,0,63 89,0 | |||

Cartilaginous fishes ? 52 | 0,1,9 0,24 | 1,0,20 7,0 | |||

Jawless fishes − 13 | 0,0,2 0,0 | 0,0,2 0,0 | 0,0,2 0,0 | ||

Ray finned fishes + 647 | 4,0,30 262,0 | 3,0,115 214,0 | |||

Amphibians + 315 | 1,0,3 185,0 | 3,1,26 42,72 |

_{g}when splitting a group into subgroups (b) the emergence or the loss of significant correlations between G and L

_{g}when splitting a group into subgroups. Table 3 shows that, after splitting,

- The sign of the significant correlations was totally reversed, in full agreement with definition (a) of Simpson’s paradox, only in fungi and gymnosperm plants.
- The sign of the significant correlation was totally maintained only in ray-finned fishes.
- The significant correlation was lost in jawless fishes, in agreement with definition (b) of the paradox.
- Significant correlations became a mixture of positive and negative correlations in angiosperm plants, insects, reptiles, mammals and amphibians.
- Non-significant correlations remained totally for birds.
- Significant correlations emerged only exceptionally in cartilaginous fishes (the number of significant correlations was very small with regard to the total number of subgroups), consistently with definition (b) of the paradox, but the sign of the correlation was not coherent.

© 2011 by the authors; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).

## Share and Cite

**MDPI and ACS Style**

Hernández-Fernández, A.; Baixeries, J.; Forns, N.; Ferrer-i-Cancho, R. Size of the Whole *versus* Number of Parts in Genomes. *Entropy* **2011**, *13*, 1465-1480.
https://doi.org/10.3390/e13081465

**AMA Style**

Hernández-Fernández A, Baixeries J, Forns N, Ferrer-i-Cancho R. Size of the Whole *versus* Number of Parts in Genomes. *Entropy*. 2011; 13(8):1465-1480.
https://doi.org/10.3390/e13081465

**Chicago/Turabian Style**

Hernández-Fernández, Antoni, Jaume Baixeries, Núria Forns, and Ramon Ferrer-i-Cancho. 2011. "Size of the Whole *versus* Number of Parts in Genomes" *Entropy* 13, no. 8: 1465-1480.
https://doi.org/10.3390/e13081465