Novel Graphical Representation and Numerical Characterization of DNA Sequences

Modern sequencing technique has provided a wealth of data on DNA sequences, which has made the analysis and comparison of sequences a very important but difficult task. In this paper, by regarding the dinucleotide as a 2-combination of the multiset t8 ̈A,8 ̈G,8 ̈C,8 ̈Tu, a novel 3-D graphical representation of a DNA sequence is proposed, and its projections on planes (x,y), (y,z) and (x,z) are also discussed. In addition, based on the idea of “piecewise function”, a cell-based descriptor vector is constructed to numerically characterize the DNA sequence. The utility of our approach is illustrated by the examination of phylogenetic analysis on four datasets.


Introduction
The rapid development of DNA sequencing techniques has resulted in explosive growth in the number of DNA primary sequences, and the analysis and comparison of biological sequences has become a topic of considerable interest in Computational Biology and Bioinformatics.The traditional measure for similarity analysis of DNA sequences is based on multiple sequence alignment, which uses dynamic programming techniques to identify the globally optimal alignment solution.However, the sequence alignment problem is NP-hard (non-deterministic polynomial-time hard), making it infeasible for dealing with large datasets [1].To overcome the limitation, a lot of alignment-free approaches for sequence comparison have been proposed.
The basic idea behind most alignment-free methods is to characterize DNA by certain mathematical models derived for DNA sequence, rather than by a direct comparison of DNA sequences themselves.Graphical representation is deemed to be a simple and powerful tool for the visualization and analysis of bio-sequences.The earliest attempts at the graphical representation of DNA sequences were made by Hamori and Ruskin in 1983 [2].Afterwards, a number of graphical representations were well developed by researchers.For instance, by assigning four directions defined by the positive/negative x and y coordinate axes to the four nucleic acid bases, Gates [3], Nandy [4,5], and Leong and Morgenthaler [6] introduced three different 2-D graphical representations, respectively.While Jeffrey [7] proposed a chaos game representation (CGR) of DNA sequences, in which the four corners of a selected square are associated with the four bases respectively.In 2000, Randic et al. [8] generalized these 2-D graphical representations to a 3-D graphical representation, in which the center of a cube is chosen as the origin of the Cartesian (x,y,z) coordinate system, and the four corners with coordinates (+1,´1,´1), (´1,+1,´1), (´1,´1,+1), and (+1,+1,+1) are assigned to the four bases.Some other graphical representations of bio-sequences and their applications in the field of biological science and technology can be found in [9][10][11][12][13][14][15][16][17][18][19][20][21][22][23][24].
Numerical characterization is another useful tool for sequence comparison.One way to arrive at the numerical characterization of a DNA sequence is to associate the sequence with a vector whose components are related to k-words, including the single nucleotide, dinucleotide, trinucleotide, and so on [25][26][27][28][29][30].In addition, the numerical characterization can be accomplished by associating with a graphical representation given by a curve in the space (or a plane) structural matrices, such as the Euclidean-distance matrix (ED), the graph theoretical distance matrix (GD), the quotient matrix (D/D, M/M, L/L), and their "higher order" matrices [8][9][10][11][12][13][14][15][16][17][18][31][32][33].Once a matrix representation of a DNA sequence is given, some matrix invariants, e.g. the leading eigenvalues, can be used as descriptors of the sequence.This technique has been widely used in the field of biological science and medicine, and different types of matrices are defined to construct various invariants of DNA sequences.However, the order of these matrices is equal to n, the length of the DNA sequence considered.A problem we must face is that the calculation of these matrix invariants will become more and more difficult with larger n values [17,24,32].
In this paper, based on all of the 2-combinations of the multiset t8¨A, 8¨G, 8¨C, 8¨Tu, we propose a novel graphical representation of DNA sequences.Then, according to the idea of "piecewise function", we describe a particular scheme that transforms the graphical representation of DNA into a cell-based descriptor vector.The introduced vector leads to more simple characterizations and comparisons of DNA sequences.

The 3-D Graphical Representation
As we know, the four nucleic acid bases A, G, C, and T can be classified into three categories: R " tA, Gu{Y " tC, Tu; M " tA, Cu{K " tG, Tu; W " tA, Tu{S " tG, Cu.
In fact, these groups are just all of the non-repetition 2-combinations of set {A,G,C,T}.If repetition is allowed, in other words, if we consider multiset t8¨A, 8¨G, 8¨C, 8¨Tu instead of the set {A,G,C,T}, then the number of 2-combinations equals 10 (see Table 1).Let V be a regular tetrahedron whose center is at the origin O " p0, 0, 0q.V 1 = (+1,+1,+1), V 2 = (´1,´1,+1), V 3 = (+1,´1,´1), and V 4 = (´1,+1,´1) are its four vertices.To each of the vertices we assign one of the four nucleic acid bases A, C, G and T.Moreover, to the midpoint of the line segment AC we assign M, and K to the midpoint of the line segment GT, R to that of the line segment AG, Y to that of the line segment CT, W to that of the line segment AT, and S to that of the line segment CG.We thus obtain ten fixed directions: Obviously, the ten unit vectors are ten points on a unit sphere.An idea arises naturally: each of the ten 2-combinations can be associated with one of the ten unit vectors.In detail, we have To obtain the spatial curve of a DNA sequence, we move a unit length in the direction that the above assignment dictates.Taking sequence segment ATGGTGCACCTGACTCCTGATCTGGTA as an example, we inspect it by stepping two nucleotides at a time.Starting from the origin O " p0, 0, 0q, we move in the direction dictated by the first dinucleotide AT, r W , and arrive at P 1 , the first point of the 3-D curve.From this point, we move in the direction dictated by the second dinucleotide TG, r K , and arrive at the second point P 2 .From here we move in the direction dictated by the third dinucleotide GG, r G , and come to the third point P 3 .Continuation of this process is illustrated in Table 2, and the corresponding 3-D graphical representation is shown in Figure 1.As the characterization of a research object, a good visualization representation should allow us to see a pattern that may be difficult or impossible to see when the same data is presented in its original form.In order to provide a direct insight into the local and global characteristics of a DNA sequence, the proposed 3-D curve can be projected on planes (x,y), (y,z) or (x,z), and thus three different 2-D graphical representations will be yielded.Figure 2 shows the projections of 3-D curves of 18 different DNA sequences listed in Table 3.It is easy to see that, in each projection, the trend of curves of the two non-mammals (Gallus, Muscovy duck) is distinguished from that of the mammals.On the other hand, the Primates species are similar to one another, so it is with the curves of bovine, sheep, goat, and mouflon.Also, the curves of rabbit and European hare show their great similarity.In addition, both Figure 2b, the projection on yz-plane, and Figure 2c, the projection on xz-plane, show opossum has relatively low similarity with the remaining mammals, while mouse and rat look similar to each other because both of their curves wind themselves into a mass and need a relatively small space.

Numerical Characterization of DNA Sequences
The graphical representations not only offer the visual inspection of data, helping in recognizing major differences among DNA sequences, but also provide with the numerical characterization that facilitates quantitative comparisons of DNA sequences.One way to arrive at the numerical characterization of a DNA sequence is to convert its graphical representation into some structural matrices, and use matrix invariants, e.g., the leading eigenvalues, as descriptors of the DNA sequence [8][9][10][11][12][13][14][15][16][17][18]31,32].It is expected that effective invariants will emerge and enable to uniquely characterize the sequences considered.However, the difficulties associated with computing various parameters for very large matrices that are natural for long sequences have restricted the numerical characterizations, for instance, leading eigenvalues and the like [17,24].The search for novel descriptors may be an endless project.The art is in finding useful descriptors, and those that have plausible structural interpretation, at least within the model considered [8].In this section, we bypass the difficulty of calculating the invariants like the leading eigenvalue and propose a novel descriptor to numerically characterize a DNA sequence.
As described above, the pattern, including shape and trend, of curves for the 18 DNA sequences provides useful information in an efficient way.This inspires us to numerically characterize a DNA sequence with an idea of "piecewise function" as below.
For a given 3-D graphical representation with n vertices, by the order in which these vertices appear in the curve, we partition it into K parts, each of which is called a cell.All the cells contain m " Y n K ] vertices except the last one.For the i-th cell, i = 1,2,...,K, the geometric center U i " px i , y i , z i q is viewed as its respective.Then we have where U 0 " p0, 0, 0q.It is not difficult to find that Ñ U i´1 U i reflects a certain "growing trend" of these cells.For convenience, we call Ñ U i´1 U i the trend-point.On the basis of the K trend-points, a DNA sequence can be characterized by a 3K-dimensional vector V tp : In this paper, K is determined by ˇˇs j ˇˇ, N is the cardinality of the dataset Ω considered, and ˇˇs j ˇˇstands for the length of sequence s j P Ω. Taking for example the two non-mammals of the 18 species, the corresponding vectors can be calculated as

Results and Discussion
In this section, we will illustrate the use of the proposed cell-based descriptor V tp of a DNA sequence.For any two sequences S a and S b , suppose their descriptor vectors are a " pa 1 , a 2 , ¨¨¨, a 3k q and b " pb 1 , b 2 , ¨¨¨, b 3k q, respectively.Then, their similarity can be examined by the following Euclidean distance.Clearly, the smaller the Euclidean distance is, the more similar the two DNA sequences are.
Firstly, we give a comparison for CDS (Coding DNA Sequence) of β-globin gene of 18 species listed in Table 3.The lengths of the 18 sequences are about 434 bp.Thus K is taken to be 4, and each of these sequences is converted into a 12-D vector.According to Equation (7), we calculate the distance between any two of the 18 DNA sequences.Then an 18 ˆ18 real symmetric matrix D 18 is obtained.On the basis of D 18 , a phylogenetic tree (see Figure 3) is constructed using UPGMA (Unweighted Pair Group Method with Arithmetic Mean) program included in MEGA4 [34].Observing Figure 3, we find that the CDS are more similar for Primate group {Gorilla, Chimpanzee, Human, Homo, CebusaPella, LagothrixLagotricha, Lemur}, Cetartiodactyla group {bovine, sheep, goat, mouflon}, Lagomorpha group {Rabbit, European hare}, and Rodentia group {mouse, rat}, respectively.On the other hand, CDS of the two kinds of non-mammals {Gallus, Muscovy duck} are very dissimilar to the mammals because they are grouped into an independent branch.This is analogous to that reported in the literature [8,12,14,31], and the relationship of these species detected by their graphical representations as well.From this result, a conclusion one can draw is that the cell-based descriptors of the new graphical representation may suffice to characterize DNA sequences.In order to further illustrate the effectiveness of our method, we test it by phylogenetic analysis on other three datasets: one consists of mitochondrial cytochrome oxidase subunit I (COI) genes of nine butterflies, another includes S segments of 32 hantaviruses (HVs), and the last is composed of 70 complete mitogenomes (mitochondrial genomes).For convenience, we denote the three datasets by COI, HV and mitogenome, respectively.In the COI dataset (see Table 4), which is taken from Yang et al. [12], eight belong to the Catopsilia genus and one belongs to Appias genus, which is used as the out-group.The average length of these COI gene sequences is 661 bp, and thus K, the number of cells, is calculated as 4. According to the method mentioned above, a distance matrix is constructed, and then a phylogenetic tree (see Figure 4) is generated.Figure 4 shows that the five pomona sub-species have relatively high similarity with each other, while the two pyranthe sub-species cluster together.In addition, scylla sub-species is situated at an independent branch, whereas the Appias lyncida stays outside of all the Catopsilia.This result is consistent with that reported in [12,35].The hantavirus (HV), which is named for the Hantan River area in South Korea, is a relatively newly discovered RNA virus in the family Bunyaviridae.This kind of virus normally infects rodents and does not cause disease in these hosts.Humans may be infected with HV, and some HV strains could cause severe, sometimes fatal, diseases in humans, such as HFRS (hantavirus hemorrhagic fever with renal syndrome) and HPS (hantavirus pulmonary syndrome).The later occurred in North and South America, while the former mainly in Eurasia [12,36].In Eastern Asia, particularly in China and Korea, the viruses that cause HFRS mainly include Hantaan (HTN) and Seoul (SEO) viruses, while Puumala (PUU) virus is found in Western Europe, Russia and northeastern China.The HV dataset analyzed in this paper includes 32 HV sequences.Phlebovirus (PV) is another genus of the family Bunyaviridae.Here, two PV strains KF297911 and KF297914 are used as the out-group.The name, accession number, type, and region of the 34 sequences are described in Table 5.The lengths of these sequences are in the range of 1.30-1.88kbp.Thus K is calculated as 5, and each of the 34 viruses is converted into a 15-D vector.The phylogenetic tree constructed by our method is shown in Figure 5.  From Figure 5, we find that the two PV strains form an independent branch, which can be distinguished easily from the HV strains, while the 32 HVs are grouped into three separate branches: the strains belonging to PUUV are clearly clustered together, the strains belonging to SEOV appear to cluster together, and so do the ones belonging to HTNV.A closer look at the subtree of HTNV, all CGRn strains whose host is Rattus norvegicus tend to cluster together, so it is with the CGHu strains whose host is Homo sapiens.In addition, all the four CGAa strains whose host is Apodemus agrarius are grouped closely.Needless to say, the phylogeny is not only closely related to the isolated regions, but also has certain relationship with the host.This result is similar to that reported in [12,37].
The mitogenome dataset comprises 70 complete mitochondrial genomes of Eukaryota.The name, accession number, and genome length are listed in Table 6.Among them, two species (Argopecten irradians irradians and Argopecten purpuratus) belong to family Pectinidae are used as the out-group.Four species belong to the Order Caudata under the Class Amphibia, while four species belong to the Order Anura under the same Class.The remaining belongs to the Class Actinopterygii.The average length of the 70 genome sequences is about 16817 bp.Thus, K is calculated as 6, and each of these genome sequences is converted into an 18-D vector.The phylogenetic tree constructed by our method is shown in Figure 6.It is easy to see from Figure 6 that the two Pectinidae species stay outside of the others, while the four Hynobiidae species and four Ranidae species form an independent branch.In the subtree of the Class Actinopterygii, the 60 genomes are separated into six groups: group 1 corresponds to genus Anguilla under family Anguillidae; group 2 includes genera Bangana and Acrossocheilus under family Cyprinidae; group 3 includes genera Brachymystax and Hucho under family Salmonidae; group 4 is genus Alepocephalus under family Alepocephalidae; group 5 is the family of Clupeidae; group 6 includes genera Trichiurus, Amphiprion and Apolemichthys under Acanthomorphata.This result agrees well with the established taxonomic groups.In addition, we make a comparison for the 70 genome sequences by using ClustalX2.1 [38], and the corresponding tree is shown in Figure 7. Observing Figure 7, we find that the tree includes four branches: the outside is the Argopecten branch, the following is Babina, then Batrachuperus, and the subtree consisting of the other 60 species.A closer look at the subtree shows that Trichiurus is distinguished from the remaining, which seems to be a disappointing phenomenon in the evolutionary sense.

Concluding Remarks
By means of a regular tetrahedron whose center is at the origin, we associate the ten 2-combinations of multiset t8¨A, 8¨G, 8¨C, 8¨Tu with ten unit vectors (points on a unit sphere), and then a novel 3-D graphical representation of a DNA sequence is proposed.Moreover, we partition the graph into K cells, and then a 3K-dimensional cell-based vector is used to numerically characterize a DNA sequence.The proposed method is tested by phylogenetic analysis on four datasets.In comparison with other methods, our approach does not depend on multiple sequence alignment, and avoids the complex calculation as in the calculation of invariants for higher order matrices.Nevertheless, K, the number of cells, is dataset specific, which may restrict our approach.We will make efforts in our future work to find a possible formula for K that is independent of the dataset.

Figure 1 .
Figure 1.3-D graphical representation of the sequence ATGGTGCACCTGACTCCTGATCTGGTA.

Figure 2 .
Figure 2. (a) The projection on the xy-plane of 3-D curves of 18 DNA sequences; (b) The projection on the yz-plane of 3-D curves of 18 DNA sequences; (c) The projection on the xz-plane of 3-D curves of 18 DNA sequences.

Figure 3 .
Figure 3.The relationship tree of 18 species.

Figure 4 .
Figure 4.The relationship tree of nine COI (cytochrome oxidase subunit I) gene sequences.

Figure 5 .
Figure 5.The relationship tree of 34 viruses.

Figure 6 .
Figure 6.The tree of 70 genome sequences constructed with the current method.

Figure 7 .
Figure 7.The tree of 70 genome sequences constructed with multiple alignment.

Table 3 .
The CDS (Coding DNA Sequence) of β-globin gene of 18 species.

Table 4 .
The COI (cytochrome oxidase subunit I) genes of nine butterflies.

Table 5 .
Sequence information of S segment of hantavirus.