# Clustering of Giant Virus-DNA Based on Variations in Local Entropy

^{1}

^{2}

^{3}

^{*}

## Abstract

**:**

## 1. Introduction

_{1}, x

_{2}, …, x

_{N}} drawn from a discrete sampling space S. The p(x

_{i}) is the probability that x

_{i}occurs. In the case of genomes, S is the nucleotide alphabet of the DNA. When applied at the nucleotide level to genomic sequences, the sequence entropy for the full genome can be reduced to the question of CG-content [1,2]. This is a global property and related to the mutation rate, while the selective advantage reveals itself locally in, e.g., gene products such as regions coding for proteins.

## 2. Superinformation Revisited

^{th}block has entropy H(X

_{i}). By definition, H(X

_{i}) is a non-negative quantity. Then, a probability measure can be derived by the following algorithm:

_{i}) values, i.e.,

_{j}(X

_{i}, M)} = Histogram ({H(X

_{i})}), i = 1, 2, … N

_{i})} into M equally spaced bins and returns the number of elements in each bin.

_{j}(X

_{i},M) is the frequency of H(X

_{i}) in the j

^{th}bin.

## 3. Clustering of Genomic Sequences

**Figure 1.**Sensitivity of the probability density function (pdf) of sequence entropies with respect to block size B for (

**a**) the viral genome of PBCV1; and (

**b**) Chlorella NC64A. We also show the superinformation H

_{s}.

_{s}is, however, unsuitable. It should be noted that due to Chargaff's second parity G%~C%, A%~C%, plus the fact that GC% + AT% = 1, four base compositions (three are independent) in a block are reduced to one variable: GC%. So the entropy of a block, more or less has, a one-to-one correspondence with the GC%, and distribution of block-entropies are similar to distribution of a function of GC%. Thus the superinformation for the whole sequence corresponds to some measure of the window-GC% distribution (e.g., variance).

**Table 1.**Superinformation values H

_{s}for various genomic sequences. Note the tendency of increased H

_{s}for organisms in comparison to viruses.

Genomic Sequence | H_{s} [Bit] |
---|---|

AR158 | 4.21 |

ATCV | 3.62 |

CVM1 | 3.87 |

FR483 | 3.83 |

MT325 | 3.77 |

NY2A | 4.22 |

PBCV1 | 4.38 |

TN603 | 3.65 |

EsV-1 | 3.54 |

Arabidopsis thaliana | 5.27 |

Chlorella NC64A | 5.10 |

Oryza sativa | 4.74 |

Ectocarpus siliculosus | 4.10 |

_{j}(X

_{i},M) instead. These distributions describe, in detail, the local variability of the sequences and thus any distance of such p

_{j}(X

_{i},M) clusters entities based on the variability of entropies. The distributions of the block entropies of the genomic sequences are shown in Figure 2. Upon visual inspection, there appears to be a distinct pattern for viruses and plant genomes. The distributions for plant genomes show a larger variance as compared to that of the viruses. This motivates us to explore clustering based on the distributions of block entropies. We note that characterization of sequences based on the distribution of their sub-sequences have been explored earlier [12,13,14].

**Figure 2.**Plot of the probability density function (pdf) of the entropy for 8 plant viral genomes (AR158, ATCV, CVM1, FR483, MR325, NY2A, PBCV1, TN603), Arabidopsis thaliana, Chlorella NC64A, Ectocarpus siliculosus and EsV-1. The plots were obtained for a block size of B = 100.

_{j}(X

_{i}, M) by generalized Kullback-Leibler divergences [15], the Jensen-Shannon-distances in particular [16] as discussed in [17,18]. The Jensen-Shannon-divergence, D

_{JS}(p,q), between the entropy distributions p

_{i}(X

_{i}, M) and q

_{j}(X

_{i}, M) for two data sources is given by

_{j}(X

_{i}, M) is an “intermediate” distribution and the D

_{KL}(p||m) and D

_{KL}(q||m) are the Kullback-Leibler distances between p and m or between q and m, respectively. A suitable metric for clustering is then d (p,q) [16]. The application of Jensen-Shannon distance to DNA has also been carried out by previous researchers [19].

**Figure 3.**Clustering of viral genomes, annotated by their respective hosts—in this case variants of the Chlorella algae. The branch lengths are proportional to the phylogenetic Jensen-Shannon distance of Equation (4), the scale is indicated at the bottom.

**Figure 4.**Clustering by the entropy-variability distance of Equation (1) for the extended genomic sequence set. We show the distance matrix d(p,q) for any genomic sequence pairing (p,q) for each genomic sequence with respect to each sequence (center plane, red = small values, green = high values). The bar on the left indicates whether the sequence is of viral origin (blue) or a living organism (orange). Note that the tree is not rooted.

## 4. Conclusions

## Acknowledgments

## Author Contributions

## Conflicts of Interest

## References and Notes

- Aïssani, B.; Bernardi, G. CpG islands, genes and isochores in the genomes of vertebrates. Gene
**1991**, 106, 185–195. [Google Scholar] [CrossRef] - Pozzoli, U.; Menozzi, G.; Fumagalli, M.; Cereda, M.; Comi, G.P.; Cagliani, R.; Bresolin, N.; Siron, M. Both selective and neutral processes drive GC content evolution in the human genome. BMC Evol. Biol.
**2008**, 8, 1–12. [Google Scholar] - Bose, R.; Chouhan, S. Alternate measure of information useful for DNA sequences. Phys. Rev. E
**2011**, 83, 051918. [Google Scholar] [CrossRef] - Van Etten, J.L.; Lane, L.C.; Dunigan, D.D. DNA viruses: the really big ones (giruses). Ann. Rev. Microbiol.
**2010**, 13, 83–99. [Google Scholar] [CrossRef] - Cock, J.M.; Sterck, L.; Rouzé, P.; Scornet, D.; Allen, A.E.; Amoutzias, G.; Anthouard, V.; Artiguenave, F.; Aury, J.M.; Badger, J.H.; et al. The Ectocarpus genome and the independent evolution of multicellularity in brown algae. Nature
**2010**, 465, 617–621. [Google Scholar] [CrossRef] - Yoon, H.S.; Hackett, D.J.; Ciniglia, C.; Pinto, G.; Bhattacharya, D. A molecular timeline for the origin of photosynthetic eukaryotes. Mol. Biol. Evol.
**2004**, 21, 809–819. [Google Scholar] [CrossRef] - Iyer, L.M.; Burroughs, A.M.; Aravind, L. The prokaryotic antecedents of the ubiquitin-signaling system and the early evolution of ubiquitin-like beta-grasp domains. Genome Biol.
**2006**, 7, R60. [Google Scholar] [CrossRef] - Raoult, D.; Audic, S.; Robert, C.; Abergel, C.; Renesto, P.; Ogata, H.; La Scola, B.; Suzan, M.; Claverie, J.M. The 1.2-megabase genome sequence of Mimivirus. Science
**2004**, 306, 1344–1350. [Google Scholar] [CrossRef] - Villarrea, L.P.; DeFilippis, V.R. A hypothesis for DNA viruses as the origin of eukaryotic replication proteins. J. Virol.
**2000**, 74, 7079–7084. [Google Scholar] [CrossRef] - Bernardi, G. Isochores and the evolutionary genomics of vertebrates. Gene
**2000**, 241, 3–17. [Google Scholar] [CrossRef] - Zhang, R.; Zhang, C.T. Isochore structures in the genome of the plant Arabidopsis thaliana. J. Mol. Evol.
**2004**, 59, 227–238. [Google Scholar] [CrossRef] - Herzel, H.; Ebeling, W.; Schmitt, A.O. Entropies of biosequences: The role of repeats. Phys. Rev. E
**1994**, 50, 5061–5071. [Google Scholar] [CrossRef] - Schmitt, A.O.; Herzel, H. Entropies of biosequences: The role of repeats. J. Theor. Biol.
**1997**, 188, 369–377. [Google Scholar] [CrossRef] - Karlin, S.; Burge, C.; Campbell, A.M. Statistical analyses of counts and distributions of restriction sites in DNA sequences. Nucl. Acids Res.
**1992**, 20, 1363–1370. [Google Scholar] [CrossRef] - MacKay, D.J.C. Information Theory, Inference and Learning Algorithms; Cambridge University Press: Cambridge, UK, 2007. [Google Scholar]
- Li, M.; Chen, X.; Li, X.; Ma, B.; Vitanyi, P.M.B. The similarity metric. IEEE Trans. Info. Theory
**2004**, 50, 3250–3264. [Google Scholar] [CrossRef] - Hoffgaard, F.; Weil, P.; Hamacher, K. BioPhysConnectoR: Connecting sequence information and biophysical models. BMC Bioinformatics
**2010**, 11, 199. [Google Scholar] [CrossRef] - Hamacher, K. Protein Domain Phylogenies—Information Theory and Evolutionary Dynamics. In Biomedical Engineering Systems and Technologies; Springer: Berlin/Heidelberg, Germany, 2010; pp. 114–122. [Google Scholar]
- Bernaola-Galván, P.; Román-Roldán, R.; Oliver, J.L. Compositional segmentation and long-range fractal correlations in DNA sequences. Phys. Rev.
**1996**, E53, 5181–5189. [Google Scholar] - Murtagh, F. COMPSTAT Lectures No. 4; Physica-Verlag: Würzburg, Germany, 1985. [Google Scholar]
- Fitzgerald, L.A.; Graves, M.V.; Li, X.; Feldblyum, T.; Hartigan, J.; Van Etten, J.L. Sequence and annotation of the 314-kb MT325 and the 321-kb FR483 viruses that infect Chlorella Pbi. Virology
**2007**, 358, 459–471. [Google Scholar] [CrossRef] - Blanc, G.; Duncan, G.; Agarkova, I.; Borodovsky, M.; Gurnon, J.; Kuo, A.; Lindquist, E.; Lucas, S.; Pangilinan, J.; Polle, J.; et al. The Chlorella variabilis NC64A genome reveals adaptation to photosymbiosis, coevolution with viruses, and cryptic sex. Plant Cell
**2010**, 22, 2943–2955. [Google Scholar] [CrossRef]

© 2014 by the authors; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).

## Share and Cite

**MDPI and ACS Style**

Bose, R.; Thiel, G.; Hamacher, K.
Clustering of Giant Virus-DNA Based on Variations in Local Entropy. *Viruses* **2014**, *6*, 2259-2267.
https://doi.org/10.3390/v6062259

**AMA Style**

Bose R, Thiel G, Hamacher K.
Clustering of Giant Virus-DNA Based on Variations in Local Entropy. *Viruses*. 2014; 6(6):2259-2267.
https://doi.org/10.3390/v6062259

**Chicago/Turabian Style**

Bose, Ranjan, Gerhard Thiel, and Kay Hamacher.
2014. "Clustering of Giant Virus-DNA Based on Variations in Local Entropy" *Viruses* 6, no. 6: 2259-2267.
https://doi.org/10.3390/v6062259