On State Occupancies, First Passage Times and Duration in Non-Homogeneous Semi-Markov Chains

: Semi-Markov processes generalize the Markov chains framework by utilizing abstract sojourn time distributions. They are widely known for offering enhanced accuracy in modeling stochastic phenomena. The aim of this paper is to provide closed analytic forms for three types of probabilities which describe attributes of considerable research interest in semi-Markov modeling: (a) the number of transitions to a state through time (Occupancy), (b) the number of transitions or the amount of time required to observe the ﬁrst passage to a state (First passage time) and (c) the number of transitions or the amount of time required after a state is entered before the ﬁrst real transition is made to another state (Duration). The non-homogeneous in time recursive relations of the above probabilities are developed and a description of the corresponding geometric transforms is produced. By applying appropriate properties, the closed analytic forms of the above probabilities are provided. Finally, data from human DNA sequences are used to illustrate the theoretical results of the paper.


Introduction
Human populations can be divided into categories (states and classes) taking into account some of their basic characteristics, such as place of residence, social class or rank in a hierarchy system. People usually move from a category to another category in a probabilistic manner and a person's history contains a sequence of sojourn times in the various categories and a set of transitions that have taken place. These are the basic parameters that construct a semi-Markov chain (SMC), according to which a mathematical model can be developed for the study of those systems [1,2]. These systems do not necessarily have to include humans, instead, they can describe any potential system characterized by and composed of historical observations, such as stay times in situations as well as transitions from one category to another. If, for the study of a population system, we reside on a Markov chain, we assume that the probability of transition from one category in another does not depend on the length of stay. Nonetheless, this time dependence is, in some cases, desirable to include in the process since it provides additional useful information. In this case, the transitions of such a system are not merely described by a typical Markov chain procedure and Semi-Markov models are introduced as the stochastic tools that provide a more rigorous framework accommodating a greater variety of applied probability models [3][4][5]. Various applications of semi-Markov processes include manpower planning, credit risk, word sequencing and DNA analysis [6][7][8][9][10][11][12][13][14].
In addition to semi-Markov processes, the non-homogeneous semi-Markov system (NHSMS) was defined, introducing a class of broader stochastic models [15,16] that provide a more general framework to describe the complex semantics of the system involved. Semi-Markov systems, which deploy a number of Markov chains evolving in parallel, are mostly applied in manpower planning, where the most important issues pertain to the evolution, control and asymptotic behavior [17][18][19]. In the last two decades, there has been an extended body of literature regarding the theory and results about NHMS [20][21][22][23][24][25][26][27][28][29]. The dynamic characteristics of the semi-Markov systems influence the number of times the chain occupies a state, of how long it takes to leave a state as well as the probability of first passage to a state. Therefore, in order to accompany the basic parameters of the semi-Markov chain and to enhance the modeling framework, additional attributes of critical interest are the occupancy, first passage time and duration probabilities, which are described as follows 1.
Occupancy probabilities. These probabilities describe the distribution of the random variables that define the number of times the SMC has visited a specific state during an arbitrary time interval.

2.
First passage time probabilities. These are the probabilities that describe the transition from a state to a different state for the first time. The properties of the first passage time probabilities have been investigated for Markov processes and some specific types of semi-Markov processes [30][31][32][33][34][35]. Details for the first passage time probabilities have been also presented for various stochastic processes [36]. 3.
Duration probabilities. These probabilities describe the distribution of random variables that define the time needed for the SMC to transfer to a different state.
DNA sequences are usually studied using probabilistic models, as nucleotide appearances are inter-correlated and attempts to use Markov models to model them have been reported [10,37]. One of the earliest studies applied a Markov model on the nucleotide alphabet {A, C, G, T} to estimate the transition probability matrix and the number of doublets and triplets [38]. Several statistics have been proposed to test the dependency order of the sequence, e.g., the Markov order, such as the phi-divergent statistics and conditional mutual information [39][40][41]. More advances in the subject include hidden-Markov models that are able to model different regions of DNA sequences [42]. Word occurrences are also of interest in DNA analysis [43]. Previous studies have examined the distribution, moments and properties of successive word occurrences [44,45]. Papadopoulou has provided some examples of semi-Markov models on modeling biological sequences [46]. Furthermore, algorithmic applications for estimating the first passage time probabilities in genomic sequences have been reported [47].
The aim of this study is to provide insight on the actual mechanism of the recursive relations of the probabilities mentioned above. Section 2 presents the basic parameters of a SMC, the interval transition probabilities and the entrance probabilities. Section 3 presents the main results of the paper, that is, the closed analytic solutions for the occupancy, duration and first passage time probabilities. The final section applies these theoretical results to human genome DNA strands. For the first illustration, the aim is to find the corresponding probabilities between nucleotide words and their symmetric complements by using the analytic form of the first passage time probabilities. Finally, for the second illustration, the frequency of the dinucleotide GC is examined for two distinct DNA sequences, using the occupancy probabilities.

Basic Framework
We can consider the semi-Markov chain {X t } t≥1 with state space S = {1, 2, . . . , N} as a discrete stochastic process in which the successive states are defined by the transition probability matrix and the sojourn time in each state is described by a random variable conditioned on the current and the next state to be transitioned into. Thus, during the transition times, the process is equivalent to a Markov process. We call this Markovian process the embedded process. Let transition probabilities p ij (t) be the probability of a SMC provided that it entered state i during its last transition at time t to transition to state j in the next transition. The transition probabilities should satisfy the same equations of a Markovian process, that is, p ij ≥ 0, ∀i, j ∈ S and ∑ N j=1 p ij = 1, ∀i ∈ S. When the process enters state i at time t, we assume that this state determines the next transition to state j, which occurs according to the transition probabilities. However, before making the transition from state i to state j and after the next state j is selected, the chain holds in state i for time τ ij . The sojourn time τ ij is a positive random variable with density function h ij (·), which is called the function of sojourn time to transition from state i to state j. Thus, Prob[τ ij = m] = h ij (m), for m = 1, 2, .., and i, j ∈ S. We assume that the mean values of the distributions of sojourn times are finite and h ij (0) = 0. In matrix notation, the basic parameters of the semi-Markov chain are the sequence of transition matrices {P(t)} ∞ t=0 and the sequence of sojourn time matrices {H(m)} ∞ m=1 . The probabilities of the waiting times w i (t, m) are defined as follows: where τ i is the holding time of the SMC in state i. The core matrix of the SMC connects the transition probabilities and the sojourn times and it is defined as follows: The operator {•} denotes the element-wise product of matrices (Hadamard product). Using the core matrix, we define q ij (k|t, n), which is the joint probability that the SMC will be in state j at time t + n and that it has made k transitions during the time interval (t, t + n], given that at time t the process has entered state i. In order to calculate the probability q i,j (k|t, n), we distinguish two cases. First, we consider that during the time interval (t, t + n] the number of transitions is zero. Then, in order for the process at time t + n to be in state j, given that no transitions were made, it must be that the states i, j are the same. Secondly, assume that the SMC makes the first transition to state r at time t + m, 0 < m < n. Then, in the time interval (t, t + m], we have one transition to state r and, in the remaining time interval (t + m, t + n], we have the remaining k − 1 transitions, with a final transition to state j. Thus, the resulting formula is as follows: where > w i (t, n) = ∑ ∞ k=n+1 w i (t, k) indicates the survival function of w i (t, n) and δ(k) = 1 if k is zero, otherwise it is zero. If we are not interested in counting the number of transitions up to the final state j, we can deduce the following recursive relationship.
We also define the quantity e i,j (k|t, n), which is the probability that the SMC enters state j at time t + n and the total number of transitions in the time interval (t, t + n] is k, given that the SMC has entered state i at the initial position. Here, we can distinguish two cases. First, we assume that the number of transitions in the time interval (t, t + n] is zero. Then, to enter in state j at time t + n, the states i and j must be the same since state i was entered at the initial time. For the second case, suppose that the SMC at time t + m, 0 < m < n makes its first transition to state r. Then, at the time interval (t, t + m] we have a transition to state r and, at the time interval (t + m, t + n], we have the remaining k − 1 transitions, with the final transition to state j. These facts result in the following recursive relationship. If we are not interested in the number of transitions up to the final state j, we can reduce the recursive relationship to the quantity e ij (t, n), which are the probabilities that the SMC will enter state j at time n, provided that, at the initial position at time t, the SMC has entered state i. The equation for calculating the probabilities e ij (t, n) is given by the following.
The interval transition probabilities and entrance probabilities are connected by the following relationship.

First Passage Time
The first passage times provide a measure of how long it takes to reach a given state from another. We can think of first passage times either in terms of transitions or of time or both. Thus, let f ij (k|t, n) be the probability that k transitions and time n will be required for the first passage from state i to state j given that the SMC entered state i at time t. Applying a probabilistic argument, we can provide the following recursive formula. (1) The first term of equation (1) corresponds to the case where k > 1 and the SMC makes a transition to some state r different from j at time t + m and then makes a first passage from r to j in k − 1 transitions during the interval (t + m, n − m]. The term is summed over all states and holding times that could describe the first transition. The second term corresponds to the case where k = 1 and the process moves directly to state j at time t + n. If we are not interested in counting the transitions, then the recursive formula of the probabilities f ij (t, n) is provided by the following.
(2) Theorem 1. For each non-homogeneous SMC with discrete state space S = 1, 2, . . . , N, a sequence of transition probability matrices {P(t)} ∞ t=0 and a sequence of sojourn time matrices {H(m)} ∞ m=1 , the probability matrices of first passage times F(k|t, n) = { f ij (k|t, n)} i,j∈S are given by the following relationships: Proof. Appendix A.1.

Duration
Transitions of a SMC can be divided into two categories: virtual and real. The first category refers to transitions made from one state to the same state, while the second category refers to transitions from one state to a different state. Based on those two categories, one can define the duration as the number of transitions or the time required for the SMC to leave the initial state and to move to a different state, i.e., a real transition to take place for the first time and not a virtual one. Therefore, it is of interest to study the duration probability d i (k|t, n) defined as the probability that the SMC moves for the first time to a different state that the initial one after n time units and k transitions during the interval (t, t + n], given that the process entered state i at time t. We note here that out of the total k transitions in the above case, k − 1 transitions are virtual and one transition is real. The duration probabilities for k ≤ n are provided by the following. In the case that k > n or k = 0, then d i (k|t, n) = 0. The rationale of this relationship can be deconstructed into two parts. In the first part, we can assume that the SMC has at least one virtual intermediate transition, while it starts from state i at time t, holds at the state i for m time units and finally transfers to state i again. At this point, the associated probability is d i (k − 1|t + m, n − m). In the second scenario, we assume that the SMC makes no transition up to time t + n. Therefore, the chain holds at state i for exactly n time units and then moves to a state j different than i. Thus, the duration defined in the present measures how long it takes to leave a given state.

Theorem 2.
For each non-homogeneous SMC with discrete state space S = 1, 2, . . . , N, a sequence of transition probability matrices {P(t)} ∞ t=0 and a sequence of sojourn time matrices {H(m)} ∞ m=1 , the duration probability matrices D(k|t, n) = diag{d i (k|t, n)} i∈S are provided by the following relationships:

Occupancy
We define v ij (t, n) to be the number of times the SMC makes transitions to a state j in time interval of length equal to n, provided that in the initial time t the SMC had entered state i. If the initial state is the same as j, that is when i = j, then the initial state is not counted in v ij (t, n). We call the quantity v ij (t, n) as the occupancy measure of state j at time t + n, provided that the SMC entered state i at time t. Clearly, the quantity v ij (t, n) is a discrete random variable. We define as ω ij (·|t, n) the probability mass distribution The recursive relationship of the occupancy probabilities is given by the following: where i, j ∈ S, n = 0, 1, . . . , and x = 0, 1, . . ..

Assumption 1.
In what follows, we assume that the embedded Markov chain is homogeneous, i.e., {P(t)} ∞ t=0 = P, for each t.
Considering the above assumption, one can use the double geometric transform of the occupancy probabilities as follows.
Moreover, from the Equation (4), we can write the double geometric transform of the occupancy probabilities as follows.
In matrix notation, we can use the previous results to obtain the following [3]: The occupancy probabilities are connected with the corresponding homogeneous first passage time probabilities through the following relationship.
Using the double geometric transform, we can present the occupancy probabilities in matrix form according to the geometric transforms of the first passage time probabilities: which could be further simplified by using > f g ij (z) = 1− f g ij (z) 1−z (Appendix B.1) resulting in matrix notation in (Appendix B.2).
We now provide Theorem 3 and Lemma 1 that will be used to prove the main Theorem 4 of the occupancy probabilities with respect to the core matrix.
Theorem 3. For a SMC with core matrix C(·), we have the following: , ∀i, j ∈ S and n = 0, 1, 2, . . . Please note that the (j, r) element of S i (k, m k ) is the probability of moving from state j to state r after i − 1 time units and k intermediate transitions during the interval (t, t + i − 1] for every t due to the time-homogeneity assumption. Proof. Appendix A.3. Lemma 1. The product Ω g (z|n) • I is equal to the following: We now provide Theorem 4, which describes the analytic solutions of the occupancy probabilities. In order to facilitate the presentation and proof of Theorem 4, we begin with some aggregate notation. Let the following be the case: Theorem 4. For a SMC with core matrix C(·), by adopting the above notations, we have that the following: and Proof. Appendix A.5.

Illustration
In this section we will accompany the theoretical results of the paper with two applications related to DNA sequences. It is known that a DNA strand consists of a sequence of adenine (A), guanine (G), cytosine (C) and thymine (T), which are the four nucleotides. We assume that a DNA sequence could be described by a homogeneous discrete SMC {X t } ∞ t=0 with state space S = {w 1 , w 2 , . . . , w N }, where w i , i = 1, 2, . . . , N is a specific word that is a combination of the letters of the DNA alphabet S = {A, C, G, T} with length l and t denoting the position of the word inside the sequence.

Inverted Repeats
The main focus of the following approach is the appearance of specific words formed from the alphabet A, C, G, T and their symmetric complements (inverted repeats). Inverted repeats are commonly found in eukaryotic genomes [48]. The presence of inverted repeats could form DNA cruciforms that have been shown to play an important role in the regulation of natural processes involving DNA. The cruciform structures are important for various biological processes, including replication, regulation of gene expression and nucleosome structure. They have also been implicated in the development of diseases including cancer, Werner's syndrome and others [49].
For each DNA word w, there exists a reversed complement of the word w . For example, the word w = ACG has the word w = CGT as an inverted repeat. The main question that we will attempt to address by applying the analytic relationships derived earlier is the following: Given that the SMC entered at the initial position in the word w, we want to estimate the probability of the reversed complement word w appearing for the first time after a certain range of letters n. We define the distance, d, between two words as the number of letters between the first letter of the initial word that has appeared and the first letter of the following word that subsequently appears. For the sake of simplicity, we consider only the scenario where d > l. The DNA sequence that was used for this illustration is the first chromosome of the human genome consisting of 248,956,422 base-pairs that are publicly available from the website of the National Center for Biotechnology Information (NCBI) [50].
For the first illustration, three words of length l = 7 were chosen that have been previously shown to exhibit different distances between them and their inverted complements [51]. The words were w 1 = GGCTCAC, w 2 = ATATATG and w 3 = CCACAAT. For each word, the state space of the SMC consisted of the word and its reversed complement, e.g., S = {w i , w i }. First, the basic parameters of the SMC were estimated, namely the transition probability matrix and the sequence of sojourn times. The sojourn time was defined as the distance, i.e., the number of nucleotides that occur between each word and its inverted repeat. The transition matrix and the empirical distribution of the sojourn times were estimated using the empirical estimators. The sequence of the core matrices was calculated as the Hadamard product of the transition matrix with the sequence of the sojourn time matrices. For each word w ∈ S, the first passage time probability was calculated between the word w and its reversed complement w according to the proposed analytic relationship (Theorem 1). For a maximum distance, (n = 1000), the highest first passage time probabilities of the three words and their inverted repeats, along with the corresponding distances are illustrated in Figure 1. Concretely, the first passage time probabilities were calculated for the human Chromosome 1, aiming to estimate the most probable distances between words and their symmetrical complements. More specifically, as presented in Figure 1, we have noted that, for the first passage time probabilities, we have argmax( f w 1 w 1 ) = 210, argmax( f w 2 w 2 ) = 10 and argmax( f w 3 w 3 ) = 132 approximating the numerical results of previous studies with corresponding values for the arguments 210, 15 and 133 for the three words, respectively [51]. This highlights the fact that specific DNA words exhibit different behaviors and the distance between them and their inverted repeats demonstrates variability.

CpG Islands
Usually, in vertebrate DNA sequences, the dinucleotide CG occurs less frequently than expected [52]. For the second illustration, we considered CpG islands, which are genomic regions that contain an elevated number of the dinucleotide CG. The human genome contains approximately 30 thousand CpG islands. The APRT gene is an example of a CpG region and it was used for this analysis [53]. This gene provides instructions for making an enzyme called adenine phosphoribosyltransferase (APRT). APRT contains approximately 2500 nucleotides and it had been shown to include an elevated amount of the dinucleotide GC [54]. We modeled the sequence of this DNA region as a homogeneous SMC with state space containing all the two-letter words from the DNA alphabet. The transition probability matrix and the sojourn times were estimated using the empirical estimators. The occupancy distribution ω GCGC (x|n) for a fixed length of n = 100 was calculated using the analytic relationship from Theorem 4 in order to estimate the occupancy distribution of specific words up to a specified sequence length. For comparison, we also applied the model to an intron sequence of human's phosphodiesterase gene (PDEA) [55]. The two sequences are publicly available from the NCBI. The occupancy probabilities are presented in Figure 2 up to length n = 50. It is confirmed that the number of occupancies of the dinucleotide GC will be greater in the CpG island compared to the intron sequence. As expected, the occupancy probabilities applied on the two sequences indicated that the occurrences of GCs were more frequent in the CpG sequence.

Concluding Remarks
In this article, three classes of important probabilities of a semi-Markov process, namely the first passage time, the occupancy and the duration probabilities were defined and their closed analytic forms were proved by using the basic parameters of the process. The study of the first passage time probability provides information regarding the distribution of the time elapsed to reach a state from another for the first time, either in terms of transitions or time. The second category of duration probabilities provides information about the distribution of the number of virtual transitions taking place before an actual transition to a different state occurs. Finally, the third class of probabilities provides insight information regarding the distribution of the number of times the SMC makes transitions to some state in a time interval of a given length. We provided analytic forms on the actual behavior of the recursive relations of the aforementioned probabilities and included these results into specific propositions and theorems.
The analytical results were accompanied with two illustrations on human genome DNA strands which are often studied using probabilistic modeling and, specifically, Markovian models. Although, in the relevant literature, there exist several algorithmic approaches analyzing the occupancy and appearance of words in DNA sequences, the results of the illustration section strongly suggest that the proposed modeling framework could also be used for the investigation of the structure of genome sequences.
Of course nothing comes without limitations and motivation for further research. For example, additional research effort could aim towards high-order dependencies since DNA sequences often show long-range correlations. This could result in a more coherent modeling approach. Furthermore, additional parameters could be included in the model, for example the length of sequence or specific mutations, resulting in more realistic representations regarding the different structures of complex genome of humans and other organisms. Finally, the proposed model could be applied in completely different contexts, such as natural language processing, linguistics, text similarity and anomaly detection, i.e., areas of machine learning that appear to be amongst the most popular areas in the last decade in data science and stochastic modeling.

Acknowledgments:
The authors greatly acknowledge the comments and suggestions of the three anonymous referees, which improved the content and the presentation of the current paper.

Conflicts of Interest:
The authors declare no conflicts of interest.

Appendix A. Proofs
Appendix A.1. Proof of Theorem 1 The results for (1) and (2) are obvious. For the third part, we used the matrix notation of the first passage time probabilities: with F(k|t, n) = 0 if k > n or k = 0. For k = 1 and m = m i we have shown the results for the case where k > 1 can be proved by induction. Thus, we assume that this result holds for k − 1 and we will show that it also holds for each k ≤ n. Here we note that the recursive relationship of the first passage time probabilities could be reformulated as follows.
Using matrix notation, we can express the previous relationship as the following.
The initial conditions are F(k|t, n) = 0 for k > n or k = 0 and F(1|t, n) = C(t, n). By using the following notation: we obtain the following. The results for (1) and (2) are obvious. For the third part, we used induction. By using matrix notation on the recursive relationship, it holds that, for k = 2, we have the following.
Now assume that the relationship hold for k − 1, which is the following.
Therefore, the following obtains.
(A2) Equation (A2) in matrix notation is the following. By applying the geometric transform to the above, we obtain the following: with initial condition Ω g (z|0) = I. Following the methodology of Vassiliou and Papadopoulou (1992), we derive the result of the Theorem 3. [15] Appendix A.4. Proof of Lemma 1 By using the Hadamard product on Theorem 3, we have the following.
By using the following property: we obtain the following: which completes the proof.
Appendix A.5. Proof of Theorem 4 An early version of the proof of Theorem 4 can be found in [56]. We analytically present here all necessary steps of the proof. Using the equations provided by the results of Theorem 3 and by substituting Ω g (z|n) • I with the result found in Lemma 1, we can obtain the analytic relation for the geometric transforms of Ω g (z|n), which is as follows: Ω g (z|n) = (z − 1) n−1 ∑ j=1 A j   zG 1,n,j + z ∑ n−j u=2 (z − 1)M u + ∑ u−2 k=1 (z − 1) k+1 R u (k, m k ) G u,n,j +Q 1,n,j + ∑ n−j u=2 (z − 1)M u + ∑ u−2 k=1 (z − 1) k+1 R u (k, m k ) Q u,n,j   + where Then, by applying properties of the inverse geometric transforms by using the equation Ω(x|n) = 1 x! d (x) dz x Ω g (z|n) z=0 and by repeatedly taking the derivatives of Ω g (z|n) with respect to z, we obtain the result of the Theorem 5 for x ≥ 1.
Finally, for the special case where x = 0, by substituting z = 0 in expression (A3), we obtain the following: where the following results.