The matrix method of representation, analysis and classification of long genetic sequences

The article is devoted to a matrix method of comparative analysis of long nucleotide sequences by means of a presentation of each sequence in a form of three digital binary sequences. This method uses biochemical attributes of nucleotides and it also uses a possibility of presentation of every whole set of n-mers in a form of one of members of a Kronecker family of genetic matrices. Due to this method, a long nucleotide sequence can be visually represented as an individual fractal-like mosaic or another regular mosaic of binary type. In contrast to natural nucleotide sequences, artificial random sequences give non-regular patterns. Examples of binary mosaics of long nucleotide sequences are shown, including cases of human chromosomes and penicillins. Interpretation of binary presentations of nucleotide sequences from the point of view of Gray code is also tested. Possible reasons of genetic meaning of Kronecker multiplication of matrices are analyzed. The received results are discussed.

1. Introduction 2. Matrix representations of whole sets of n-plets (or n-mers) 3. The description of the matrix method for long nucleotide sequences 4. Long random sequences 5. Kronecker multiplication, fractal lattices and the problem of coding an organism on different stages of its onthogenesis 6. Patterns of human chromosomes 7. Patterns of penicillin 8. About 3D-representations 9. Gray code and binary representations of long nucleotide sequences 10. Some concluding remarks

Introduction.
The article is devoted to a matrix method proposed by I. Stepanyan to study long genetic sequences on the base of approaches of "matrix genetics" from works [Petoukhov, 2001[Petoukhov, , 2008[Petoukhov, , 2011[Petoukhov, , 2012a[Petoukhov, ,b, 2013aPetoukhov, He, 2010]. The matrix genetics studies matrix representations of natural ensembles of molecular genetic elements to reveal hidden regularities in cooperative genetic structures and to model inherited biological phenomena, whose features should be agreed with the structural organization of the genetic code for their transferring along a chain of generation.
The notion of "a long nucleotide sequence" usually means a sequence with more then 50000 nucleotides [Prahbu, 1993]. To get more representative visual patterns in a result of application of the described method, longer lengths of nucleotide sequences are preferable.
Known scientific methods for studying nucleotide sequences usually concentrate their attention on those fragments (or n-mers, or n-plets), which exist inside sequences. In contrary to them, our method concentrates on studying those fragments of nucleotide sequences, which are missing in them. In other words, the described meth-od investigates a deficit of different types of n-plets (or-mers) in nucleotide sequences. The authors suppose that this method can be useful for comparative analysis and classification of long genetic sequences and also for deeper understanding of genetic phenomena.
One should also emphasized that this method introduces in the field of molecular genetics and bioinformatics an important theme and notion of binary fractals, which were known in mathematics, physics, informatics and technologies. The theme of binary fractals can be used as a new useful bridge among biological and nonbiological fields of sciences and technologies.

Matrix representations of whole sets of n-plets (or n-mers)
The genetic code system is based on sets or alphabets of n--plets (or n--mers): • the set of 4 1 monoplets (A, C, G, T/U); • the set of 4 2 =16 duplets (AA, AC, AG, AT, ….); • the set of 4 3 =64 triplets (AAA, AAC, ACA, ACG, ACT, ….); • etc. Each whole set of 4 n n--plets coincides with the whole set of 4 n entries in a (2 n *2 n )--matrix, which belongs to the Kronecker family of genetic matrices [A G; C T] (n) , where (n) means Kronecker (or tensor) power. Figure  1 shows the first three members of this Kronecker family for n=1, 2, 3. Also Figure  1 shows that -inside such matrix [A G; C T] (n) --each n--plet has its individual binary coordinates (or appropriate coordinates in decimal notation) due to biochemical attributes of n--plets, but this should be explained now specially. , each row and each column has its individual binary numeration due to genetic sub-alphabets (see explanation in text below). Correspondingly each n-plet, which is located on a row-column crossing, has two digital binary coordinates in such matrix. The decimal equivalents of these binary numbers are shown by red color.
The four nitrogenous bases -adenine A, guanine G, cytosine C, thymine T (or uracil U in RNA)) -represent specific poly-nuclear constructions with special biochemical properties. The set of these four constructions is not absolutely heterogeneous, but it bears the substantial symmetric system of distinctive-uniting attributes (or, more precisely, pairs of an "attribute-antiattribute"). This system of pairs of opposite attributes divides the genetic four-letter alphabet into various three pairs of letters by all three possible ways; letters of each such pair are equivalent to each other in accordance with one of these attributes or with its absent.
The system of such attributes divides the genetic four-letter alphabet into various three pairs of letters, which are equivalent from a viewpoint of one of these attributes or its absence: 1) С=T and A=G (according to the binary-opposite attributes: "pyrimidine" or "non-pyrimidine", that is purine); 2) А=С and G=T (according to the binary-opposite attributes "keto" or "amino" [Karlin, Ost, Blaisdell, 1989]); 3) С=G and А=T (according to the attributes: three or two hydrogen bonds are materialized in these complementary pairs). The possibility of such division of the genetic alphabet into three binary sub-alphabets is known from the work [Karlin, Ost, Blaisdell, 1989]. We will utilize these known sub-alphabets by means of the following approach in the field of matrix genetics. We will attach appropriate binary symbols "0" or "1" to each of the genetic letters from the viewpoint of each of these sub-alphabets. Then we will use these binary symbols for binary numbering the columns and the rows of the genetic matrices of the Kronecker family.
Let us mark the mentioned three kinds of binary--opposite attributes by numbers N = 1, 2, 3 and let us ascribe to each of four genetic letters the symbol "0N" (the symbol "1N") in case of presence (of absence correspondingly) of the attribute under number "N" at this letter. In result we receive the following representation of the genetic four--letter alphabet in the system of its three "binary sub--alphabets to attributes" (Figure 2  The table on Figure 2 shows that, on the basis of each kind of the attributes, each of the letters A, C, G, T/U possesses three "faces" or meanings in the three binary sub-alphabets. On the basis of each kind of the attributes, the genetic four-letter alphabet is curtailed into the two-letter alphabet. For example, on the basis of the first kind of binary-opposite attributes we have (instead of the four-letter alphabet) the alphabet from two letters О 1 and 1 1 , which one can name "the binary sub-alphabet to the first kind of the binary attributes".
Accordingly, any genetic message as a sequence of the four letters C, A, G, T consists of three parallel and various binary texts or three different sequences of zero and unit (such binary sequences are used at storage and transfer of the information in computers). Each from these parallel binary texts, based on objective biochemical attributes, can provide its own genetic function in organisms.
In view of these three binary sub--alphabets, any nucleotide sequence can be represented as three binary sequences. For example, the sequence ATGGC ... is represented as: • 10110 ... (in accordance with the first sub--alphabet; its decimal equiva-lent can be located on the "X" axis of a Cartesian system of coordinates); • 01110 ... (in accordance with the second sub--alphabet; its decimal equiv-alent can be located on the "Y" axis of a Cartesian system of coordinates); • 11000 ... (in accordance with the third sub--alphabet; its decimal equiva-lent can be located on the "Z" axis of a Cartesian system of coordinates). For an unambiguous determination of the nucleotide sequence is sufficient to know its binary representations in any two of the three sub-alphabets [Petoukhov, 2001[Petoukhov, , 2008[Petoukhov, , 2010. In particularly, in this example of the sequence ATGGC... , to get its third binary representation 11000 ... (in accordance with the third sub-alphabet) it is enough to summarize two its other representations 10110... and 01110... (received in accordance with the first two sub-alphabets) by means of modulo-2 addition.
In genetic matrices of the Kronecker family (see Figure 1), each row has its individual binary number, which is connected with the fact that all n-plets inside this row have identical binary representation from the point of view of the first sub-alphabets on Figure Figure 1, the third column has its binary numeration 010 because every of its triplet (AGA, AGC, ATA, ATC, CGA, CGC, CTA, CTC) is a sequence "amino-ketoamino" that corresponds to binary number 010 from the point of view of the second sub-alphabet on Figure 2. Respectively, each n-plet, which is located in an appropriate genetic matrix on crossing "column-row", obtains its individual 2-dimensional coordinates on the base of binary numeration of its column and row. For example, the triplet AGC, which is located on crossing of the mentioned column and row (Figure 1), obtains its individual binary coordinates (010, 110) or in decimal notation (2,6).
Any long nucleotide sequence can be divided into equal pieces of arbitrary length, and a binary record of these fragments can be read in the decimal notation. Then any long nucleotide sequence is represented in the form of three different sequences of decimal numbers, and for its unique identification is sufficient to know its decimal representation in any two sub-alphabets.
If one divides a long nucleotide sequence into equal fragments, whose lengths are equal to "n" (n-mers or n-plets), then each of these fragments is defined by means of its two binary representations (from points of view of the two sub-alphabets) or by means of their equivalents in decimal notations. For example the 5-mer ATGGC is represented as 10110 (in accordance with the first sub-alphabet) and 01110 (in accordance with the second sub-alphabet). Its appropriate decimal meanings are 22 and 14. In such way, this 5-mer ATGGC can be represented not only as the appropriate cell with coordinates (22, 14) inside the genomatrix [A G; C T] (5) but also as the point with decimal coordinates (22,14) in the orthogonal Cartesian system of coordinates (x,y). Taking into account the chosen connection ( Figure 2) between each sub-alphabet and one of axes X, Y, Z of the Cartesian system of coordinates, the following correspondence exists between Kronecker families of genomatrices and 2-dimensional planes (x,y), (x,z) and (y,z) of the Cartesian system: • the plane (x,y) corresponds to matrices [A G; C T] (n) , whose rows and columns are binary numerated from the point of view of the first subalphabet and the second sub-alphabet respectively; • the plane (x,z) corresponds to matrices [G A; C T] (n) , whose rows and columns are binary numerated from the point of view of the first subalphabet and the third sub-alphabet respectively; • the plane (y,z) corresponds to matrices [G T; C A] (n) , whose rows and columns are binary numerated from the point of view of the second sub-alphabet and the third sub-alphabet respectively. Taking into account this 2-dimensional representation of each n-plet, one can introduce a notion of Euclidean distance R between any pair of n-plets V(a 1 ,b 1 ) and W(a 2 ,b 2 ): R = [(a 2 -a 1 ) 2 + (b 2 -b 1 ) 2 ] 0.5 . One can also introduce notions of distance of some other types.
The method, which is described below, uses many variants of a division of a nucleotide sequence into fragments of equal lengths (n-plets). Each whole set of n-plets, which contains 4 n members, is located inside one of matrices of the Kronecker family of matrices like as [A G; C T] (n) . Correspondingly this method is closely connected with Kronecker multiplication of matrices, which is widely used in mathematics, informatics, physics, etc. and which is one of the main mathematical operation in the field of matrix genetics [Petoukhov, 2008[Petoukhov, -2013Petoukhov, He, 2010]. Kronecker multiplication of matrices is used when one needs to go from spaces of smaller dimension into associated spaces of higher dimension. If one uses the mathematical language of vector spaces for modeling the ontogenetic complication of a living organism, it is natural to apply the ideology of a gradual transition from the spaces of low dimensions into spaces of higher dimensions. Such gradual transition is described by means of a series of Kronecker multiplication of matrices.

The description of the matrix method for long nucleotide sequences
In a general case the proposed method includes the following algorithmic steps: 1. Any long nucleotide sequence, which contains K nucleotides, is divided into equal fragments of length «n» (n-plets or n-mers), where «n» takes different values: n=1, 2, 3, …, K; in the result, an appropriate set of different symbolic representations of this sequence as a chain of n-plets appears; 2. Each n-plet in every of these representations of the sequence is transformed into three kinds of n-bit binary numbers by means of its reading from the point of view of the three sub-alphabets ( Figure 2). Each of these binary numbers is transformed into its decimal equivalent. In the result, an appropriate set of different decimal representations of the initial symbolic sequence appears in a form of three kinds of sequences of decimal numbers respectively for positive integer coordinates on Cartesian axes X, Y, Z (or for numeration of rows and columns of appropriate genetic matrices). In a result of these algorithmic steps, different black-and-white mosaics arise as representations of any long nucleotide sequence in different cases of its division into n-plets. Figure 3 shows examples of fractal-like and other visual patterns, which have been obtained on the base of the described method for some long nucleotide sequences. Figure 3. Examples of visual patterns, which have been received on the base of the described method for different nucleotide sequences (see explanation in the text). Two symbols are shown from the right side of each pattern to indicate what kinds of the sub-alphabets from Figure 2 were used to construct the pattern.
The numbered patterns on Figure 3 correspond to the following sequences: 1. Homo sapiens contactin associated protein--like 2 (CNTNAP2), RefSeqGene on chromosome 7 (n=63) (http://www.ncbi.nlm.nih.gov/nuccore/163954933 ) 2. Homo sapiens contactin associated protein--like 2 (CNTNAP2), RefSeqGene on chromosome 7 (n=63) (http://www.ncbi.nlm.nih.gov/nuccore/163954933 ) 3. Sorangium cellulosum So0157--2, complete genome (n=63) (http://www.ncbi.nlm.nih.gov/nuccore/CP003969.1 ) 4. Burkholderia multivorans ATCC 17616 genomic DNA, complete genome, chromosome 2 (n=63) (http://www.ncbi.nlm.nih.gov/nuccore/AP009386.1) Such mosaic pattern shows phenomenology of «presence-and-absence» of different n-plets. One should note that a division of a long nucleotide sequence only into a single of possible variants of its equal fragmentation (for example, a division into 16-plets) doesn't give an unambiguous definition of this sequence: such separate case of a division represents this sequence as a set of fragments but without a reflection of their order in the sequence (any permutation of these fragments gives a new sequence with the same set of n-plets). To get an unambiguous definition of the sequence, one should take into consideration all (or many) possible variants of its equal partitions (n = 1, 2, 3,….). In practice for many tasks of a comparison analysis and classification of different long nucleotide sequences it is enough to consider some chosen variants of fragmentations of these sequences, for example, variants with n = 16, 32, 64.
Another possible way of an unambiguous representation of a long nucleotide sequence in a case of its division with a certain value n (for example, with n=8) is connected with a construction of additional visual patterns, which reflect an order of n-plets in the sequence. Figure 4 shows two examples of such mosaic patterns for Homo sapiens chromosome 22 genomic scaffold and for Arabidopsis thaliana mitochondrion in the case of their representations as sets of 16-mers. On these mosaics, white places correspond to dispositions of those 16-mers on a corresponding 2-dimensional plane, which are missing in such representations of the sequences. The mosaic pattern depends on a concrete choice of two kinds of sub-alphabets from Figure 2. Figure 4 shows two mosaic patterns on 2-dimensional Cartesian planes (x,y) and (y,z), which are identical to black-and-white mosaics of the genetic matrices [A G; C T] (16) and [G T; C A] (16) respectively, where cells with existing 16-plets of the sequence are shown by black color and cells with missing 16-plets are shown by white color. and which is divided into a sequence of 16-mers; these 16-mers are transformed into 16bit binary numbers on the basis of the first sub-alphabet and of the second subalphabet ( Figure 2); then their decimal equivalents are postponed on the axes "x" and "y" respectively. Right side: the visual pattern of the nucleotide sequence Arabidopsis thaliana mitochondrion, which has 366924 nucleotides (http://www.ncbi.nlm.nih.gov/nuccore/NC_001284.2) and which is divided into a sequence of 16-mers; these 16-mers are transformed into 16-bit binary numbers on the basis of the second sub-alphabet and of the third sub-alphabet ( Figure 2); then their decimal equivalents are postponed on the axes "y" and "z" respectively.  Binary representations of n-mers are expressed in a form of n-bit binary numbers, the quantity of kinds of which is equal to 2 n . For example, the set of 3-bit binary numbers contains 2 3 =8 members: 000, 001, 010, 011, 100, 101, 110, 111 (their equivalents in decimal notation are 0, 1, 2, 3, 4, 5, 6, 7). Decimal equivalent of the biggest n-bit binary member in a set of n-bit binary numbers is equal to 2 n -1. Such sets of nbit binary numbers are named "dyadic groups" (see details in [Petoukhov, 2013]).
The most interesting application of this matrix method is realized for a case of long nucleotide sequences, which are divided into relative long n-mers (n=8, 9, 10, …). Reasons for this are the following (see Figure 6): • a long nucleotide sequence, which is divided into relative short n-mers (n=1, 2, 3, 4), contains usually all possible kinds of such short n-mers; correspondingly its visual pattern is trivial because it contains all possible points with positive integer coordinates (x,y) inside an appropriate numeric range; • a long nucleotide sequence, which is divided into relative long n-mers (n=8, 9, 10, …), usually generates a regular non-trivial mosaic of a fractal-like or other character. This was detected using a special computer program in a course of initial investigations of different long nucleotide sequences by means of the described method.  Fractal patterns, which are obtained by means of the described matrix method, sometimes resemble fractal patterns of long nucleotide sequences and amino acid sequences, which were previously obtained by means of the known method "Chaos Game Representation" (CGR-method) in works [Jeffrey, 1990;Petoukhov, He, 2010; etc.] though both methods are quite different in their algorithmic essence. In particularly, CGR-method deals with representations of nucleotide sequences or other long sequences by means of four numbers 0, 1, 2, 3 but not by means of binary numbers 0, 1. In addition our new method seems to be more simple for understanding and using by biologists.

Long random sequences
What kinds of visual patterns are produced by the described method in cases of long random sequences of nucleotides? To answer on this question, different random sequences were generated by a computer program. Their study is revealed that appropriate visual patterns have non-regular characters in contrast to cases of real genomes. Figure 7 shows examples of visual patterns for a case of the random sequence with 100000 nucleotides in cases of its division into n-plets with n = 8, 16, 28 (this sequence is disposed at website pentagramon.com for its possible additional testing). Each of visual patterns of this random sequence for two other 2-dimensional planes (x,z) and (y,z) has a similar non-regular character. n=8 n=16 n=28 Figure 7. Examples of visual patterns of a random nucleotide sequence with 100000 nucleotides (pentagramon.com) in cases of its division with n = 8, 16, 28.

Kronecker multiplication, fractal lattices and the problem of coding an organism on different stages of its onthogenesis
Previous Sections have shown that the described method gives very different types of visual patterns for random nucleotide sequences (where non-regular patterns arise as on Figure 7) and for real nucleotide sequences (where fractal-like patterns arise frequently as on Figures 3-6). The authors note that in many cases these fractallike patterns of long nucleotide sequences resemble fractal lattices, which are automatically generated for matrices of Kronecker families. One should explain this in more details.
Let us take a square (k*k)-matrix M, whose entries are equal only to 0 or 1. Any integer Kronecker power (n) of this matrix generates a new (k n *k n )-matrix M (n) with a fractal location of entries 0 and 1 inside it (Figures 8, 9). These fractal mosaics inside such matrices of Kronecker families are called "fractal lattices". The theme "Kronecker multiplication and fractal lattices" are accurately described in the book [Gazale, 1999, Section X]. Such fractal lattices (Figure 8) are generated due to a general definition of Kronecker multiplication of matrices as a special mathematical operation.  (2) and M (3) , which are (16*16)-matrix and (64*64)-matrix respectively. Here black color corresponds to matrix cells with entries 1 and white color corresponds to cells with entries 0.
One should note that, in many cases, significant features of fractal-like patterns of real nucleotide sequences can be simulated by means of fractal lattices of matrices of a Kronecker family, if a matrix kernel of the Kronecker family is adequately chosen. For example let us take the pattern (from Figure 4) of the nucleotide sequence Homo sapiens chromosome 22 genomic scaffold, which has 648059 nucleotides (http://www.ncbi.nlm.nih.gov/nuccore/NW_004078110.1?report=genbank) and which is divided into a sequence of 16-mers. If this pattern is covered by the uniform (8*8)-grid, 8 cells of this grid will be almost white color in contrast to the remaining 56 cells (Figure 9, upper level, left side). In such case this (8*8)-mosaic of black-andwhite type is identical to mosaic of the genetic (8*8)-matrix [A G; C T] (3) of 64 triplets where those 8 triplets are missing, which are located in this matrix on the same places and which are marked by red color on Figure 9 1 1 1 1 1 1  1 1 1 1 1 1 1 1  1 0 1 1 1 0 1 1  1 1 1 1 1 1 1 1  1 1 0 0 1 1 1 1  1 1 0 0 1 1 1 1  1 0 1 1 1 0 1 1  1 1 1 1 1 1 1  Kronecker exponentiation of the matrix S generates matrices S (2) , S (3) ,…, whose visual patterns demonstrate appropriate fractal lattices, one of which for the matrix S (2) is shown on Figure 9 (bottom level, right side). The numeric matrix S (16) contains the whole set of 16-plets with an appropriate fractal lattice, which resembles the visual pattern of the real nucleotide sequence Homo sapiens chromosome 22 genomic scaffold on Figure 4. One should note that the visual pattern of this real sequence contains more white places (than in the matrix S (16) ) because many additional 16-plets are absent since the sequence has a finite length in 648059 nucleotides.
Fractal-like lattices in visual patterns of long nucleotide sequences testify in favor of significance of Kronecker multiplication for structuration of these genetic sequences. This is not an isolated fact about a genetic significance of Kronecker multiplication. Previously we have received other evidences about a biological significance of Kronecker multiplication of matrices in phenomenology of natural ensembles of molecular-genetic alphabets [Petoukhov, 2008[Petoukhov, , 2012a[Petoukhov, ,b, 2013aPetoukhov, He, 2010] and also in a structure of Punnet squares in the field of Mendelian genetics in a connection with the Mendelian laws of independent inheritance of traits [Petoukhov, 2011].
What possible reasons of the genetic significance of Kronecker multiplication can be named? One of main tasks of the genetic encoding system of an organism is that this system should encode a combined ensemble of biostructures which becomes more and more complex in a course of onthogenesis. Onthogenesis of an organism is a process of receiving a great number of new degrees of freedom for the organism.
From mathematical point of view, it means that internal space of the organism becomes more and more n-dimensional («n» increases step by step in a course of onthogenesis). It is obvious that these are very different things: encoding the embryonic body, and encoding the adult organism, which has grown from it. Mathematics of genetic coding should correspond to this ontogenetic development of the organism with increasing organismic degrees of freedom, which means an increasing dimension of its internal vector space. (A similar conception of an increasing n-dimension of internal spaces can be also applied to phylogenetic increasing complexity of organisms in biological evolution). Respectively such mathematics of multidimensional spaces should model appropriate facts of onthogenetic and phylogenetic increasing complexity of organisms. This aspect seems to be one of the main reasons of biological significance of Kronecker multiplication. The increasing complexity and evolution of the genetic system itself can take place on similar principles: they are also associated with increasing n-dimensions of the corresponding genetic spaces, which can be simulated by means of using Kronecker multiplication. We suppose that the connection of the genetic system with Kronecker multiplication reflects the ability of this system to encoding inherited structures inside associated multi-dimensional internal spaces of developing organisms. The authors suppose that it is impossible to understand the system of genetic coding enough deeply without taking into account the fundamental task of the genetic encoding inside an individual set of multidimensional spaces, which accompanies onthogenetic development of organism.

Patterns of human chromosomes
What kinds of binary mosaics are generated by means of the described method for all 23 pairs of human chromosomes, data of which can be taken from website ftp://ftp.ncbi.nih.gov//genomes/H_sapiens/April_14_2003/? Our results of their testing testify that they are represented by binary mosaics of analogical types. Figure 10 shows mosaics of the first 15000000 (fifteen millions) nucleotides of the following sequences in the case of their division into 63--plets: Homo sapiens chromosomes X and Y together with Homo sapiens chromosome 1 (they have 152634166, 50961097 and 245203898 nucleotides respectively).