Note on DNA Analysis and Redesigning Using Markov Chain

The paper contains a discussion on mathematical modifying and redesigning DNA with the use of Markov chains. We give a simple mathematical technique for overwriting missing parts of DNA. With a certain probability (without even knowing the function of the missing codon) we can find a synonymous codon, so that there is no frequency change in amino acid sequences of proteins. We use Markov Chain to analyze the dependencies in DNA sequence of the human gene Alpha 1,3-Galactosyltransfe rase 2. We include a theoretical introduction which facilitates the understanding of the paper for non-mathematicians, especially for biologists not familiar with the theory of Markov chains.


Motivation and Methods
Modeling and analyzing DNA sequences using statistical methods have been a challenge for statisticians and biologists for many years. For many years, the most common approach was based on the theory of Markov chains. Well known simple models appeared in the mid-1980s in the papers by B.E. Blaisdell [1] and V. Brendel, J.S. Beckmann, E.N. Trifonov [2]. This was later followed by more advanced models developed to study different biological aspects of DNA (see [3][4][5]) that have been also using Markov chains (both homogeneous and non-homogeneous ones, possibly of higher orders). There are some statistical results showing that first-order Markov chains are not an adequate model for DNA sequences (see, e.g., [6]). Moreover, the latest comparison studies (see [7]) show that in general DNA sequencing by models based on even higher order Markov chains does not fit perfectly. Thus the latest research in DNA sequencing in bioinformatics is focusing on deep learning methods (see [8]). The methods presented in this paper, however, are local rather than statistical. We apply our model to DNA sequences less than 55 base pairs long, which is not enough for statistical methods. Markov chain theory is being applied to modeling music and literature. For example, a random song generated by a Markov chain based on some given piece of music, can achieve a similarity level comparable to this piece (see [9][10][11]). The question arises whether DNA can be handled similarly to a piece of music. Numerous attempts to write an understandable text as the realization of a low-order Markov chain have not proved successful. For high-order Markov chains, however, such a text becomes understandable as it contains complete sentences from the original text (see [12][13][14]). Therefore, the question whether Markov chains of any order are suitable to study DNA sequencing becomes the question about the complexity of the DNA structure. In this paper we show methods for filling in short gaps in DNA sequences. The results obtained by our method are then compared with the original DNA sequence.

Notations
We consider a probability space (Ω, F , P), where Ω = {S 1 , S 2 , . . . , S n } denotes a finite state space, i.e., the set of all possible results of an experiment, F is a σ-field of events, and P : F → [0, 1] denotes a probabilistic measure. By random variable we denote a function Z : Ω → R such that for any a ∈ R a preimage Z −1 ((−∞, a]) is in the σ-field F . In our case a random variable will allow us to assign natural numbers to states from the state space Ω: let E be a single experiment with possible results forming Ω, if in t-th repetition (i.e., in time t) of experiment E we obtain result S j ∈ Ω, then we put Z t = j, where Z t is a random variable with natural values.

Stochastic Matrices
..,n} is called stochastic if all p ij are non-negative and for any i we have ∑ n j=1 p ij = 1 (right stochastic matrix) or for any j we have ∑ n i=1 p ij = 1 (left stochastic matrix). A doubly stochastic matrix is both left stochastic and right stochastic. A vector with non-negative real elements is a stochastic vector if its elements sum to one.

Remark 1.
Rows of right (columns of left) stochastic matrix are row (vertical) stochastic vectors.
Using basic algebra one can prove the following remark.

Remark 2.
The product of two right (two left) stochastic matrices is a right (left) stochastic matrix.
Definition 3. Let A be a square matrix. An eigenvalue of A is a complex number λ such that det(A − λI) = 0 (i.e., λ is a zero with multiplicity k ≥ 1 of the characteristic polynomial p A (λ) = det(A − λI)), where I denotes the identity matrix. The set of all eigenvalues is called spectrum.

Definition 4.
The eigenvector of a of a square matrix A is a column vector v such that Av = λv, where λ is an eigenvalue of A. The left eigenvector of a square matrix A is a column vector w such that w T A = λw T , where λ is an eigenvalue of A. Some authors define the left eigenvector as a row vector w T .

Markov Chains
A Markov chain is a sequence of random variables forming a probabilistic model describing a memoryless type of dependency: the future may depend only on the present and must be independent of the past.
Z t is a random variable defined as follows: if in t-th repetition of experiment E we obtain result S j ∈ Ω, then we put Z t = j ∈ S.

Definition 5.
A sequence of random variables (Z t ) ∞ t=0 with values in a state space S is a Markov chain if for all t ∈ N and all j 0 , j 1 , . . . , j t ∈ S

Remark 3.
In the general case a state space S can be an arbitrary countable subset of N (positive integers). (1) is called Markov property and its right-hand side is called a transition operator i.e., the probability of moving from state j t−1 to state j t in one step.

Remark 4. The Equation
for all i such that P(Z t−1 = i) > 0. Since the total of transition probability from one state to all other states must be equal to one, thus this matrix is a right stochastic matrix.
is independent of t, then we call this Markov chain homogeneous.

Remark 5.
For our convenience we index states by t, but we remind the reader of the following property of homogeneous Markov chains t=0 is homogeneous then there exists a stochastic matrix Π = (p ij ) i,j∈S such that for all t ≥ 1 transition matrix Π(t) = Π, where each value p ij is the probability of moving from state i to state j in one step. Now we are going to consider probabilities of moving from state to state in larger number of steps.
for all i such that P(Z 0 = i) > 0.
Proof. For m = 1 theorem is true because Π (1) = Π. For m ≥ 2 from the law of total probability we obtain Example 1 (Random walk on a complex plane). Identify adenine with −i, cytosine with 1, guanine with −1, and thymine with i. Let (U n ) ∞ n=0 be a sequence of independent random variables such that for any n ≥ 1: where p 1 + p 2 + p 3 + p 4 = 1, and P(U 0 = 0) = 1.
For n ≥ 1 define Z n = U 0 + U 1 + . . . + U n which is a random variable with values in state space of Gaussian integers Z[i] = {a + bi : a, b ∈ Z}. Gaussian integers form a commutative ring and 2-dimensional integer lattice. We show that (Z n ) ∞ n=0 is a Markov sequence. Let j 0 , j 1 , . . . , j n ∈ Z[i] be such that P(Z 0 = j 0 , Z 1 = j 1 , . . . , Z n−1 = j n−1 ) > 0. Then, from definition of Z n , P(Z n = j n |Z 0 = j 0 , Z 1 = j 1 , . . . , Z n−1 = j n−1 ) Observe that variables U n and Z n−1 are also independent, thus One can show that E(|Z n |)-the expected translation distance after n steps is of order √ n, more precise, lim n→∞ E(|Z n |) √ n = π 2 . With our identification of adenine (−i), cytosine (1), guanine (−1), and thymine (i) using Markov chain (Z n ) ∞ n=0 we can consider probability that from n-th to mth place in DNA strand, n < m, number of adenine equals number of thymine and, simultaneously, number of cytosine equals number of guanine. This situation means that our random walk made a loop, that is Z n = Z m (see Figure 1). If one needs to research other pairwise equalities it suffices to change the identification. Denote Z[n 1 , n 2 ] = {Z n : n 1 ≤ n ≤ n 2 }. One can show that for all k ≥ 2 we have P(Z[0, k] ∩ Z[2k, 3k] = ∅) > 0. That means that with positive probability there exists n ∈ [0, k] and m ∈ [2k, 3k] such that Z n = Z m , i.e., we have a loop.

Classification of States and Chains
In this subsection, we will give some mathematical background and also state some well known results, see [7,15].

Definition 9.
A state i is called accessible from state j if there exists n ≥ 0 such that P(Z n = i|Z 0 = j) > 0. If state i is accessible from state j and vice versa we say that states i and j communicate.
Observe that communication is an equivalence relation that divides states into equivalence classes called communicating classes.

Definition 10. A Markov chain is called irreducible if its state space forms a single communicating class.
In other words, in irreducible Markov chain it is possible to get from any state to any state (every two states communicate).

Definition 11.
A state i is called inessential if there exists a state j and n ≥ 1 such that P ij is a probability that we access state j from state i exactly in n steps, F ij is a probability that we ever access state j from state i, P i a trace of the transition matrix in n steps, N i is a random variable that counts how many times state i is accessed and M i is a random variable that counts how many steps are needed to reappear in state i for the first time.

Remark 7.
where E denotes an expected value. Thus P i is an average time a Markov chain is in state i (averagely how many times Markov chain is in state i). If we define Thus µ i is an average number of steps needed to reappear in state i.
We will need the following result.

(a)
A state i is recurrent if and only if P( A state i is recurrent if and only if P i = ∞.
Proof. Note that for all states i, j and natural n we have It follows from the fact that accessing state j from state i after n steps means that we access state j for the first time after exactly m steps (for some m ≤ n) and then after next n − m steps we return to it (perhaps reaching state j a few times on the way). Thus we have Hence we get an inequality Since n is arbitrary, as n tends to infinity we obtain Assume that state i is transient. Then from (6) (6) we get a contradiction. ii from the matrix Π n . Then P i = ∞ for each i ∈ {a, c, g, t}. From Theorem 3 we once again obtain that each state is recurrent.

Definition 15.
A state which is aperiodic, recurrent, and positive recurrent is called ergodic.
Remark 11. In case of our matrix Π all states are ergodic.
For irreducible matrices we have the following property. independent of i, where p j , j ∈ {1, . . . , n}, form a unique solution of the following system of equations ∑ n i=1 p i p ij = p j , j ∈ {1, . . . , n} ∑ n j=1 p j = 1.
A special case of irreducible Markov chain is a regular Markov chain.
Definition 16. A irreducible Markov chain is called regular if there exists k ∈ N such that all entries of the matrix Π k are positive. In other words there exists k ∈ N such that from any state we can reach any state in exactly k steps.
Definition 17. Let Π be a transition matrix of a Markov chain. A stationary probability vector is a stochastic vector (see Definition 1) such that π = πΠ. In other words π is a stochastic eigenvector associated with eigenvalue λ = 1 of matrix Π.
Theorem 5 (see [15]). Let Π (m) = (p a vector π = (π j ) j∈S is a unique stationary probability vector, moreover, where µ j is an average number of steps needed to reappear in state j.
For regular Markov chains we have the following result.
Theorem 6. Let Π be a transition matrix of an irreducible aperiodic Markov chain with finite state space. Then matrix Π (m) converges to a positive stochastic matrix W such that if π is a row of matrix W, then π = πΠ.

Analysis of Alpha 1,3-Galactosyltransferase 2
We show that a time homogeneous Markov chain is an appropriate simple model of Alpha 1,3-Galactosyltransferase 2 (A3GALT2). We use transition matrices as a criterion for identifying similarities in structure of this particular gene. A3GALT2 is a Protein Coding gene (a region of DNA) located in chromosome 1, position 33,306,766, consisting of 14,333 bases [16].
Let Ω = {S 1 , S 2 , S 3 , S 4 }, where S 1 = A, S 2 = C, S 3 = G, S 4 = T, and S = {1, 2, 3, 4} be a corresponding state space. We form a stochastic matrix (7) as follows. For example we would like to know how probable is that after adenine (A) occurs cytosine (C). We count all occurrences of a pair AC in gene A3GALT2 and divide it by number of all occurring pairs which start from A. Number of all such pairs is equal to number of occurrences of A provided that A is not the last nucleotid base in gene A3GALT2.
Note that because the last nucleotide base in gene A3GALT2 is T, in the denominator in last row we have 3370 instead of 3371.

Remark 12.
One can pose a question of biological interpretation of matrix Π. Does the occurrences of nucleotid bases can be used to identify a specific gene or, in general case, to identify an individual?
Because all entries of matrix Π are positive, in the case of gene A3GALT2 a suitable Markov chain is irreducible (see Definition 10) and each d i = 1 for i ∈ S thus our chain is aperiodic (see Definition 14). From Theorem 5 for any i, j ∈ S there exists a limit lim m→∞ p (m) ij = π j > 0, suitable Markov chain is recurrent, π = (π j ) j∈S is a unique stationary probability vector and π j = 1 µ j where µ j is an average number of steps needed to reappear in state j.  All of the above considerations can be repeated for pairs of bases, see Table 2.

Pairs of Bases
Occurence in A3GALT2 Probability

Pairs of Bases
Occurence in A3GALT2

The Markov Process Model of Nucleotide Substitution
We assume that nucleotide substitution is follow a homogeneous Markov process. We take Ω = {S 1 , S 2 , S 3 , S 4 }, where S 1 = A, S 2 = C, S 3 = G, S 4 = T. Let S = {1, 2, 3, 4} be a corresponding state space. Let P(t) = {P µν (t)} is a matrix of transition probabilities in time t. We assume that P (t) = QP(t) where Q = {Q µν } is the rate matrix of the process. Figures 2-5 show four situations in which the sequence starts with adenine, cytosine, guanine, and thymine, respectively. In each case, the probabilities of occurence of a given base stabilize. Finally, they converge to the probabilities forming a stationary probability vector. Note that, as predicted by Corollary 1, in each case stabilization is achieved after about four steps.

Application in DNA Sequencing, Redesigned DNA
The meaning of DNA sequencing is here deemed to cover all methods used to determine the order of nucleotides along a DNA strand. The objective of this section is to present an example of applying Markov chains to complete short fragments of a DNA strand. Markov chains have been applied as mathematical models of real-life processes. Such real-life dynamical systems, examined with Markov-chain method, include queues of passengers arriving at an airport, -currency exchange rates or animal population dynamics.
Markov chains are also applied to build algorithms calculating the PageRank value for a website (see [17]). The website PageRank value reflects the probability that a random internet surfer will land on this webpage upon clicking a link. Markov processes of various orders are used to model DNA sequencing (see also [6]). A Markov process of order m is one for which the probability of any event depends exclusively on the m preceding events. Statistically, DNA does not have the features of a first-order Markov chain. Higherorder models have been proposed for analyzing interrelations within a DNA sequence, see [6]. Statistical tests used in [6] properly determined the order of the Markov chain being tested for sequences of length 2 9 base pairs or higher. Nevertheless, the authors consider it important to present the method described below for local problems, that is for very short DNA sequences (54 base pairs in our example). We believe it is worth examining the results of local DNA completion based on the properties of first-order Markov chains, when the length of a DNA sequence is less than 2 9 base pairs. The method presented below is a simple, local tool for completing short DNA segments. The method has also educational value. Moreover, instead of analyzing a DNA sequence of the five unit nucleotides, we may use first-order Markov chains to analyze codons or, more precisely, amino acids encoded with those codons. It is worth recalling here some basic facts and conventions: - Codon is a sequence of three nucleotides (a triplet) occurring in mRNA, a unit encoding a specific amino acid during protein synthesis; -Proteins are built of 20 different amino acids; - The sequence of amino acids in a protein exactly follows the sequence of the relevant codons in mRNA; -Most amino acids are encoded in several ways (with different codons, which, however, differ from one another usually on the third place in the triplet only); owing to this, certain changes in the genetic information (mutations) do not affect the amino acid sequence; -There are 61 codons encoding amino acids and 3 non-encoding codons (they are STOP codons: UAG, UAA, UGA); all in all: 4 3 various triplets; - The AUG codon, read as the first one in mRNA by a ribosome during protein synthesis is known as the initiation or start codon; -Since a mutation of a single nucleotide changes a single amino acid, the genetic code has to be read as non-overlapping, i.e., any given codon may be followed by any other codon; -To get the form typical of DNA, each U in an mRNA codon should be replaced by T; for instance, TAA is the DNA equivalent of the mRNA codon UAA; - In the case of a sequence of amino acids, understood as resulting from first-order Markov process action, the transition matrix is a square matrix of degree at most 21.
One state is reserved for the three STOP codons, which do not encode amino acids.

Example 2.
Let us consider the human SATB1 gene, which, as research has revealed, is a major growth factor for breast cancer, see [18]. Let us generate a DNA sequence based on SATB1 (this gene is on chromosome 3, locus p23, on the minus strand). The table below sets forth a 54-base-pair-long segment of SATB1, position 18,389,139. Data is sourced from website [19], accessed upon entering human gene SATB1. The relevant state space comprises four nucleotides: Ω 1 = {A, C, G, T}. The corresponding amino acid sequence is: Val Lys Arg Leu Ser Asp Lys Asn Lys Ser Ser Leu STOP Gln Leu Cys Cys STOP.
We give another sequence in Table 3, as a variation of the method consists in examining the sequence of amino acids and not the DNA sequence of base pairs. For the sequence of the DNA segment under this analysis, the state space is: In the application of the method described below, it is important that in both tables the last element occurs at least twice. Assume that in the sequence in Table 4, the TCC codon (corresponding to the amino acid serine (Ser)) is missing. We want to properly complete the following sequence including three adjacent gaps: GTCAAAAGACTCTCCGACAAAAACAAA AGTCTC TAGCAGTTATGTTGTTAG (8) In the other approach, using representation with amino acids (see Table 3), we want to complete the following corresponding sequence including a single gap: Val Lys Arg Leu Ser Asp Lys Asn Lys Ser Leu STOP Gln Leu Cys Cys STOP.
The (extensive) transition matrix corresponding to sequence (8) is: Let us observe that the number of occurrences of G in sequence (8) is 8, as we do not count the last occurrence, because it is not paired. The number of occurrences of A in sequence (8) is 18, as we do not count the last occurrence of A, before the lacking fragment, because this occurrence is not paired. Analogously, the transition matrix corresponding to sequence (9) is: Note that, after rounding, the stationary probability vector of matrix Π 1 is equal to: π 1 ≈ (0.367347, 0.204082, 0.163265, 0.265306).
For the sake of comparison, we below give rounded relative frequencies of nucleotide occurrences, as disclosed in Table 4:  Table 4 are not identical to the respective coordinates of the vector π 1 , given that: the sequence of 54 nucleotides is short, 2.
Using the transition matrix Π 1 , we will run an experiment, described below, which will allow us to complete sequence (8). One can proceed analogously using the matrix Π 2 , which we will not do, given the symmetry of the method. The description of the experiment makes it possible to repeat it with no IT tools.
Prepare four boxes, labeled A, C, G and T. In each box, there are assorted balls labeled A, C, G and T. The numbers of balls of individual colors in box A are in proportion to respective entries of the first row of the matrix Π 1 . Thus there are 9 balls labeled A, 3 balls labeled C, 5 balls labeled G, and 1 ball labeled T, a total of 18 balls. We fill the other boxes (C, G and T) analogously. Now the experiment begins. In sequence (8), there is a gap after A. Therefore, we draw one ball from box A at random. Assume we have drawn T, which can be done with probability 1 18 . In the next step we draw from the box labeled the same way as the most recently drawn ball; in our experiment it is box T. Assume we have drawn C from box T, which can be done with probability 5 13 . Proceeding this way, we now draw a ball from box C. Assume we have again drawn C, which can be done with probability 1 10 . Thus we have generated three consecutive elements of the sequence GTCAAAAGACTCTCCGACAAAAACAAA, namely TCC, and we fill the gap in sequence (8) with this result. We have recovered the original sequence presented in Table 4. The algorithm works as shown in Figure 6.
For the reader's convenience, we present in Figure 7 below a computer program written in the Python 3 language, the source code is also available in GitHub [20].

Conclusions
The method just presented is primarily of educational value. The procedure is described suggestively and, the authors believe, explanatory, which makes it possible for the method to be used in a more general context, also by non-mathematicians. The authors do not imply that the probability of achieving the proper completion of a DNA genome is satisfactory; they only present a tool which may be used for such completion and with which they would like to familiarize the reader. The authors are aware that the contemporary efforts in the area of DNA genome completion are focusing on deep learning rather than on Markov chains even of higher orders, see [8]. This paper proposes an alternative tool that can be explained suggestively and deeply. It is now clear that the deep learning methods lead to more exact completions than the Markov chains methods. Yet, our method allows for understanding of what happens behind the process of proper completion and sequencing. It should therefore be treated as of explanatory and educational value, with a potential for future research. It is worth asking, if the algorithms presented in Example 2 might be used to easily generate test data for more advanced deep learning algorithms (see [21]).

Conflicts of Interest:
The authors declare no conflict of interest.