TEMPORAL AND SPATIALLY HETEROGENEOUS FINITE LENGTH RUNS ANALYSIS

Abstract – Various problems in physics and engineering lead to the problem of runlength probability distribution function (pdf) in a finite time series. In this study to find the pdf of run-lengths in such a series, first infinite sequence properties are reviewed and then finite series run length pdf is derived on the basis of simple set theory. This paper presents the derivation of exact run length pdf in finite length dependent series. In the derivation, two different definitions of runs are considered as the integration method for infinite series and combinatorial analysis for finite time series. The analytical derivations are solved numerically and the results are presented in forms of cumulative pdf, expectation, variance and higher order moment changes with the sample lengths. On the other hand, homogenous run properties based on Bernoulli trials are used in many physical and engineering applications for many decades. Heterogeneous regional Bernoulli trial probability distribution model is not available so far in applications and numerical calculations. Herein, a plausible, rational and logical mathematical derivation of the heterogeneous case is derived, which reduces to the classical homogeneous Bernoulli trial case. This paper provides regional probabilistic success and failure period areal coverage modeling, which is useful for temporal and spatial pattern recognition of spatial risk predictions and parameter assessments. The basis of the methodology is mutually exclusive and independent sub-areal (site) success and failure occurrences’ heterogeneous probabilities.


INTRODUCTION
From the statistical point of view, a run is defined by Mood (1940) as a succession of similar kind of elements proceeded and succeeded by different kinds.The number of elements in such a run is referred to as its length.The theory of runs has been successfully applied to recurrent events by Feller (1957).
Supposing that x 1 , x 2 , x 3 ,......, x n in a sequence of independent random variables and x 0 is a preselected value.It is then possible to truncate the given sequence at x 0 level.Such a truncation procedure leads to a new sequence of two basic elements, namely, a success when x i -x 0  0 and a failure if x 1 -x 0  0.Moreover, these two basic elements are, in fact, the Bernoulli trials.In the classical statistics literature, a success run of lengths s is defined as an uninterrupted sequence of either s or at least s successes.Therefore, a run length has an integer value.Various statistical properties of runs or their functions can be employed in many engineering problems.For instance, in water resources engineering x 0 might represent the demand necessary for water supply to an urban area or agricultural need for water.In addition, the length of a success run corresponds to wet spells whereas a failure run-length is representative of dry spells.During the subsequent sections in this paper Mood's basic definition of runs will be adopted.
The main purpose of this paper is to present a rigorous analytical methodology to model theoretically the spatial or temporal heterogeneous probabilities in a given region or during certain time duration.The formulation in this study reduces exactly to the wellknown Binomial pdf, and it empowers the meteorologists and hydrologists alike to model areal or temporal drought, flood, and wet or dry spell occurrences even in the most complicated set of heterogeneous probability of occurrences.

RUNS OF INFINITE SEQUENCES
So far in the application of runs theory two different approaches have been followed in studies of run lengths, namely, the integration and the combinatorial approaches.It appears that the integration approach refers to runs of infinite population as employed by [2].If the sequence of random variables is stationary and ergodic then any run length is synonymous with the first run.The relation between the probability mass, Ps = j of a given sequence run-length, s = j, and the probability distribution function Ps  j is given as [2]       (1) However, the following relationship for the r-th order moment about the origin has been presented.

 
    In general, for stationary and ergodic stochastic processes it is possible to write where Pj is the probability of j successive successes and, Pk,j is the joint probability of the simultaneous occurrence of j successes to be followed by k successive failures.In general, the computation of Pk,j can be achieved through the multiple integration of the joint probability distribution function (PDF), f(x 1 , x 2 ,......., x k+j ), of random variables x 1 , x 2 , x 3 ,...., x k+j which can be written as In the case of dependent random variables, equation (4) has been evaluated in terms of tetrachoric series expansion [3].However, for independent and identically distributed random variables, equation (4) simplifies down to k+j factors which render the multiple integration into its simplest form as or in terms of success and failure probabilities which are expressed as respectively.With these two last expressions, equation ( 5) takes the form as On the other hand, for k = 0   j p j P  (7) Furthermore, the substitutions of equations ( 6) and (7) (9) By making use of this probability mass, the expected and then the variance values of success runs in an infinite sequence can be obtained as [2]   respectively.It is also possible to extent the above mentioned procedure to cover the dependent random variables.Various properties of run-lengths for the Markov process have been already derived [4].

RUNS OF FINITE SEQUENCES
In the case of finite sequences the runs have been examined by means of the combinatorial analysis [5].It is possible through this analysis to find the total number of either success or failure runs regardless of their lengths and the longest run-length of any given type.However, the combinatoric analysis fails to yield any statistical moments such as the expected value of say, success run-length in a finite sequence.The following methodology which is based on the integration approach has been developed in order to evaluate the probability mass of a run-length in finite sequences.
Let S i = s = i be an event denoting the occurrence of s successive runs of length i in an infinite sequence of Bernoulli trials.It is then possible to verify from equation (9) that the union of all these events is where S is a universal set with probability equal to one.Furthermore, all these events are mutually exclusive , that is in which  denotes the empty set with zero probability.However, for a finite sequence of length n, no run-length greater than n can occur.Hence, the sample space is restricted by n.Therefore, where  represents a subsample space contained in the universal set S. Since the elementary events E i are mutually exclusive, the following statement can be written for the probability of the subsample, P as, In order to render the subsample space to a proper one over which the mass of probabilities of elementary events is equal to one, it is necessary to reappraise the probabilities of N i 's as PE which transforms equation (15) into an equality form by defining a new set denoted as ' : where Since, E i is an element of the set, , it is, therefore, known from set theory that PE .= PE i  the substitution of which into equation (17) gives.
which is fundamental in evaluation probability mass of run-length in finite sequences.
In the case of independently and identically distributed random variables equation (15) can be evaluated by considering equation (9) which yields (19) Furthermore, equation ( 18) can be written more explicitly as where n has been employed instead of , since it is the only parameter that characterizes .
It is convenient to interpret equation (20) as the truncated form of the probability mass function given by equation ( 9) at a truncation level equal to n.On the other hand, the expected value of success run-length in a finite sequence with n elements can be found by subtracting equation (20) from equation ( 2) with r = 1 which after some algebra yields where the second term on the right hand side becomes zero when n goes to infinity, hence equation ( 21) reduces to equation (10).In a similar manner, the variance of s can be evaluated by finding the second order original moment which becomes or the variance can be estimated as which for n  reduces to equation (11) The calculation of higher order moments than two is possible but requires very tedious algebraic work.Perhaps, it is better to find these methods numerically on a digital computer.For instance, the m-th order moment can be evaluated by All the above referred formula is applicable for failure runs provided that p and q in the formulations are interchanged.

RUNS OF DEPENDENT VARIABLES IN A FINITE SEQUENCE
So far in the runs theory applications two different approaches are followed for run-lengths.These are the integration and the combinatorial approaches.This approach refers to runs of infinite population.If the sequence of random variables is stationary and ergodic then any run-length is synonymous with the first run.In general, the relation between the probability mass, Ps = j of a given sequence run-length, s = j, and the probability distribution function (pdf), Ps  j, is given as [2], (24) This gives rise to the following relationship for the r-th order moment about the origin, In general, for stationary and ergodic stochastic processes one can write, where Pj is the probability of j successive successes and, Pk,j is the joint probability of the simultaneous occurrence of j successes followed by k successive failures.The computation of Pk,j can be achieved through the multiple integration of the joint pdf, f(x 1 , x 2 ,......., x k+j ), of random variables x 1 , x 2 , x 3 ,...., x k+j , which can be written as, In the case of dependent random variables, Eq. ( 27) has been evaluated in terms of tetrachoric series expansion [3].Factorization for dependent and identically distributed random variables simplifies Eq. ( 27) down to (k + j) factors as follows [4], Furthermore, each conditional probability in this expression is simplified in terms of success and failure probabilities as, where p and q are the success and failure probabilities which are expressed as, respectively.With these two last expressions, Eq. (28) takes the form as, On the other hand, for k = 0   (31) Furthermore, the substitutions of Eqs. ( 30) and (31) into Eq.( 26) lead to,   (32) from Eq. ( 1) (33) By making use of this probability mass, the expected and then the variance values of success runs in an infinite sequence can be obtained as [2], and respectively.It is also possible to reduce the above mentioned procedure to cover the independent random variables, when substitution of p instead of r.

COMBINATORIAL METHOD AND FINITE SEQUENCE
In finite sequence cases, the runs are examined by the combinatorial analysis [5].It is possible through this analysis to find the total number of either success or failure runs regardless of their lengths and the longest run-length of any given type [6].However, the combinatorial analysis fails to yield any statistical moments such as the expected value of say, success run-length in a finite sequence.The following methodology which is based on the integration approach is developed in order to evaluate the probability mass of a run-length in finite sequences, which helps to calculate the statistical moments.
Let S i = s = i be an event denoting the occurrence of s successive runs of length i in an infinite sequence of Bernoulli trials.It is then possible to verify from Eq. (33) that the union of all these events is where S is a universal set with probability equal to one.Furthermore, all these events are mutually exclusive, that is in which  denotes the empty set with zero probability.However, for a finite sequence of length n, run-lengths greater than n cannot occur, because the sample space is restricted by n, therefore, where  represents a sub-sample space in the universal set S. Since the elementary events E i are mutually exclusive, the following statement can be written for the probability of the sub-sample, P as, In order to render the sub-sample space to a proper one over which the mass of elementary events' probabilities is equal to one, it is necessary to reappraise the probabilities of N i 's as PE, which transforms Eq. ( 39) into an equality form by defining a new set as '. where Since, E i is an element in the set, , it is, therefore, known from set theory that PE .= PE i , the substitution of which into Eq.(41) gives.
this is fundamental in evaluation probability mass of run-length in finite sequences.
In the case of independently and identically distributed random variables Eq. ( 39) can be evaluated by considering Eq. (33) which yields, (43) Furthermore, Eq. ( 42) can be written more explicitly as, where n has been employed instead of , since it is the only parameter that characterizes .It is convenient to interpret Eq. (44) as the truncated form of the probability mass function given by Eq. (33) in a finite sample size n.The graphical representation of Eq. (44) for different sample sizes and r values [4] are given in Figure 1.
Figure 1.Run-length probability variations by sample size

FINITE LENGTH RUN-LENGTH STATISTICAL
On the other hand, the expected value of success run-length in a finite sequence with n elements can be found by substitution of Eq. (33) into Eq.( 25) with r = 1, which after some algebra yields where the second term on the right hand side becomes zero when n goes to infinity, hence Eq. ( 45) reduces to Eq. (34).The expected value variation with the sample size is presented in Figure 2. In a similar manner, the variance of s can be evaluated by finding the second order original moment which becomes, On the other hand, the variance can be estimated as, which for n  reduces to Eq. (36).Variation change with the sample size is presented in Figure 3.The calculation of higher order moments than two is possible but requires very tedious algebraic work.Perhaps, it is better to find these methods numerically on a digital computer.For instance, the m-th order moment can be evaluated by The third and fourth order expansion of this series with the substitution of Eq. ( 44) in this last expression is given in the Appendix.All the above referred formula is applicable for failure runs provided that p and q in the formulations are interchanged.

SPATIALLY HETEROGENEOUS RUNS
Natural phnomenon evolves temporally as well as spatially and they are recorded through the measuring instruments at a set of pointstemporally.Such a space-time distribution is referred to as a field quantity.Since natural phenomenon cannot be predicted with certainty they are assumed to be random and hence, it is necessary to study a new class of fields, i.e., random fields.Quantitatively, a random field is characterized as (s, t) and called a random function at a point, s, and time, t.It is obvious that random field concept is generalization of a stochastic process for which time is the scale variable.The expression "the random function of the coordinate (s, t)" must be understood in the sense that at each point (s, t) of four-dimensional space-time, the value  (s, t) is a random variable, and consequently, cannot be predicted exactly.In addition, the values  (s, t) are subject to a certain law of probability.A complete description of a random field can be achieved by constructing all the finite dimensional pdf of the field at different points in space.A random field  (s) at a fixed time instant with the given finite dimensional pdf is called homogeneous, if this function is invariant with respect to a shift of the system of points.The same random field is called statistically homogeneous and isotropic (if it is homogeneous in the sense indicated above), while the pdf's are invariant with respect to an arbitrary rotation of the system of points such as of a solid body and to a mirror reflection of this system with respect to the arbitrary plane passing through the origin of the coordinate system.In other words, statistical moments depend upon the configuration of the system of points for which they are formed, but not upon the position of the system in space [7].
In this section precipitation phenomenon is modeled as a random field where timewise probabilities at a fixed site are referred to as the PoP and spatial probabilities for a fixed time instant are coverage probabilities (CP).The areal probability is the fraction of the area hit by rainfall.It does not provide a means for depicting which of the subareas is going to be hit with precipitation event.However, it simply represents the estimate of what the fractional coverage area will be.By definition the PoP at any desired threshold value, x 0 , such as standard 0.01 inches is equivalent to the exceedence probability of this value.If the pdf of precipitation at site i is denoted by f i (X) then the PoP, p i , at this site is Even in the heterogeneous point probability case one can simple calculate the average CP, p , for a given number, n, of sites as, In the special case where the point probability is homogeneous over the forecast area, p is equal to the common value of the point probability.However, in heterogeneous case it does not mean any physical correspondence.

7.1
Theoretical treatment In the most general case, none of the sites have equal PoP's which implies that the random field is heterogeneous.It is quite likely that probabilities might vary from place to place even within a single area.In practice, in addition to the spatial correlation variations, the following three situations give rise to heterogeneous and anisotropic random fields.These are; (i) Identical pdf's of precipitation at different sites but non-uniform threshold levels, (ii) Non-identical pdf of precipitation at sites but uniform threshold level, (iii) Non-identical pdf's at sites and non-uniform threshold levels.
Let the PoP and its complementary probability at j-th site within a region of n sites be given by p j and q j , (j=1, 2,...., n), respectively.They are point-wise mutually exclusive and complementary, i.e., p j + q j = 1.The CP, P(C=i|n) including i sites can be evaluated through enumeration technique after the application of summation and multiplication rules of probability theory.For the derivation of this expression the conceptual model can be visualized as in Figure 4, where there are n mutually exclusive sites represented in square boxes.

Figure 4. Probability derivation conceptual model
Each box has two mutually independent and exhaustive events as p j and q j for site j.If out of n sites, i sites are required to have precipitation and the remaining (n-i) sites without, then the whole possible cases (alternative sequences) can be categorized into two mutually exclusive groups.The first group includes i cases with n possibility in each site and the second (n-i) sites again each with n possibilities.Since the PoP on each site is independent from others then the probability of precipitation occurrence collectively can be found from the multiplication rule of independent events as p 1 .p 2 .p 3 . . .p i for one pattern.In case of i sites with precipitation will have i(i-1)(n-2) . . . 1 = i!, mutually exclusive alternatives each with probability of precipitation.Mutually exclusive events imply summation in the probability theory, and therefore, the successive summation terms express all the possible, collective and exhaustive joint probabilities of i sites.The second part on the right hand side in Figure 1 corresponds to the joint probability of the remaining (ni) sites each belonging to respective pattern of joint PoP occurrence at i precedent pattern.Again the independence principle of the probability theory provides the multiplication of the remaining non-precipitation occurrence sites.
The explanation in the last paragraph can be translated into a mathematical formulation as follows: where the i summation terms in the first horizontal big bracket includes all the possible combinations of i precipitation occurrences at n sites, whereas the second horizontal big bracket implies the multiplication term corresponding to possible non-precipitation combinations.For heterogeneous PoP's the term in the first horizontal big bracket simplifies to n(n-1) . . .(n-i+1)p i and the in the next horizontal big bracket multiplication yields to q n-i , hence Eq.(51) reduces to, This is the well-known Binomial distribution with two-stage Bernoulli trials [2].

. Numerical applications
The first numerical application will be the confirmation of the basic expression (Eq.51) with the well-known Binomial pdf.For this purpose, a Matlab program is developed for heterogeneous point probabilities of different sizes.In Figure 5, Eqs. ( 51) and ( 52) results are plotted versus each other at all probability levels.The scatter of points for different sample sizes fall exactly on the 45 o line confirming the validity of heterogeneous probability formulation, as in Eq. (51).
After the validation of Eq. (51), it is now possible to use this formulation for any set of heterogeneous probability sequences in time or groups within an area.Prior to real data application, it is helpful to look at various coverage probability variations with number of sites and the number of wet spells among such sites.For instance, Figure 6 shows the areal or temporal heterogeneous probabilities sets {0.1, 0.2, 0.3, 0,4, 0.5, 0.6, 0.7, 0.8, 0.9}.Although the probability values are in a systematic sequence in this set, they may occupy any location site among the available number of locations.In this figure n = 12 sites are considered and the variation of coverage probabilities are calculated according to Eq. (51)  3) and ( 4) if 1, 2, . ., 8 sites are subject to wet spells.The coverage probability indicates the probability of any 1, 2, . . ., 8 sites to be under the influence of wetness.These wet sites are randomly distributed in a given area of 12, say, meteorology stations, or any 1, 2, . . ., 8 wet sites among the available 12 time periods.Figure 6.Some heterogeneous ACP variation Another example is given in Figure 7 for n = 9 sites and with the same probability of set as in the previous example.The comparison of these two figures indicates that as the number of sites increases the CP also increases.

CONCLUSIONS
Run-length probability distribution function (pdf) is derived for stationary stochastic dependent processes in finite lengths.The necessary analytical derivations are achieved through the combinatorial methodology.
Analytical derivations of various point precipitation including probability of precipitation (PoP) concept and the areal or temporal coverage probabilities conditioned on a certain number of wet sites (precipitation occurrence more than a given threshold level) have been presented for heterogeneous and homogeneous precipitation evaluations over a region.In this paper, the general formulation of heterogeneous probability areal or temporal coverage probability calculation is developed based on the random field concept, where the temporal and areal precipitation occurrences are assumed independent from site to another.This general formula reduces exactly to homogeneous case, where the wellknown Binomial probability distribution is valid.The application of the suggested formulation is shown first for its confirmation with the Binomial pdf for homogeneous probabilities values and it is seen that the general formula reduces to the classical Binomial pdf exactly.Theoretically its application is presented a set of probabilities with 9 and 12 sites.The factual data application is shown for 6 meteorology stations in and around Istanbul City, Turkey.It is hoped that in the future theoretical methodology developed herein can be extended to areally dependent precipitation events.

Figure 2 .
Figure 2. Run-length expectation variations by sample size