Asymptotic Analysis of the kth Subword Complexity

Ahmadi, Lida; Ward, Mark Daniel

doi:10.3390/e22020207

Open AccessArticle

Asymptotic Analysis of the kth Subword Complexity

by

Lida Ahmadi

^1,*,†

and

Mark Daniel Ward

²

¹

Department of Mathematics, Purdue University, West Lafayette, IN 47907, USA

²

Department of Statistics, Purdue University, West Lafayette, IN 47907, USA

^*

Author to whom correspondence should be addressed.

^†

Current address: Department of Mathematics, 5500 University Parkway, San Bernardino, CA 92407, USA.

Entropy 2020, 22(2), 207; https://doi.org/10.3390/e22020207

Submission received: 25 December 2019 / Revised: 28 January 2020 / Accepted: 4 February 2020 / Published: 12 February 2020

(This article belongs to the Special Issue Information Theory and Language)

Download

Browse Figures

Review Reports Versions Notes

Abstract

Patterns within strings enable us to extract vital information regarding a string’s randomness. Understanding whether a string is random (Showing no to little repetition in patterns) or periodic (showing repetitions in patterns) are described by a value that is called the kth Subword Complexity of the character string. By definition, the kth Subword Complexity is the number of distinct substrings of length k that appear in a given string. In this paper, we evaluate the expected value and the second factorial moment (followed by a corollary on the second moment) of the kth Subword Complexity for the binary strings over memory-less sources. We first take a combinatorial approach to derive a probability generating function for the number of occurrences of patterns in strings of finite length. This enables us to have an exact expression for the two moments in terms of patterns’ auto-correlation and correlation polynomials. We then investigate the asymptotic behavior for values of

k = Θ (log n)

. In the proof, we compare the distribution of the kth Subword Complexity of binary strings to the distribution of distinct prefixes of independent strings stored in a trie. The methodology that we use involves complex analysis, analytical poissonization and depoissonization, the Mellin transform, and saddle point analysis.

Keywords:

subword complexity; asymptotics; generating functions; saddle point method; probability; the Mellin transform; moments

1. Introduction

Analyzing and understanding occurrences of patterns in a character string is helpful for extracting useful information regarding the nature of a string. We classify strings to low-complexity and high-complexity, according to their level of randomness. For instance, we take the binary string

X = 10101010 . . .

, which is constructed by repetitions of the pattern

w = 10

. This string is periodic, and therefore has low randomness. Such periodic strings are classified as low-complexity strings, whereas strings that do not show periodicity are considered to have high complexity. An effective way of measuring a string’s randomness is to count all distinct patterns that appear as contiguous subwords in the string. This value is called the Subword Complexity. The name is given by Ehrenfeucht, Lee, and Rozenberg [1], and initially was introduced by Morse and Hedlund in 1938 [2]. The higher the Subword Complexity, the more complex the string is considered to be.

Assessing information about the distribution of the Subword Complexity enables us to better characterize strings, and determine atypically random or periodic strings that have complexities far from the average complexity [3]. This type of string classification has applications in fields such as data compression [4], genome analysis (see [5,6,7,8,9]), and plagiarism detection [10]. For example, in data compression, a data set is considered compressible if it has low complexity, as consists of repeated subwords. In computational genomics, Subword Complexity (known as k-mers) is used in detection of repeated sequences and DNA barcoding [11,12]. k-mers are composed of A, T, G, and C nucleotides. For instance, 7-mers for a DNA sequence GTAGAGCTGT is four, meaning that there are 4-hour distinct substrings of length 7 in the given DNA sequence. Counting k-mers becomes challenging for longer DNA sequences. Our results can be easily extended to the alphabet

{A, T, G, C}

and directly applied in theoretical analysis of the genomic k-mer distributions under the Bernoulli probabilistic model, particularly when the length n of the sequence approaches infinity.

There are two variations for the definition of the Subword Complexity: the one that counts all distinct subwords of a given string (also known as Complexity Index and Sequence Complexity [13]), and the one that only counts the subwords of the same length, say k, that appear in the string. In our work, we analyze the latter, and we call it the kth Subword Complexity to avoid any confusion.

Throughout this work, we consider the kth Subword Complexity of a random binary string of length n over a memory-less source, and we denote it by

X_{n, k}

. We analyze the first and second factorial moments of

X_{n, k}

(1) for the range

k = Θ (log n)

, as

n \to \infty

. More precisely, will divide the analysis into three ranges as follows.

$i .$: $\frac{1}{log q^{- 1}} log n < k < \frac{2}{log q^{- 1} + log p^{- 1}} log n$ ,
$i i .$: $\frac{2}{log q^{- 1} + log p^{- 1}} log n < k < \frac{1}{q log q^{- 1} + p log p^{- 1}} log n$ , and
$i i i .$: $\frac{1}{q log q^{- 1} + p log p^{- 1}} log n < k < \frac{1}{log p^{- 1}} log n$ .

Our approach involves two major steps. First, we choose a suitable model for the asymptotic analysis, and afterwards we provide proofs for the derivation of the asymptotic expansion of the first two factorial moments.

1.1. Part I

This part of the analysis is inspired by the earlier work of Jacquet and Szpankowski [14] on the analysis of suffix trees by comparing them to independent tries. A trie, first introduced by René de la Briandais in 1959 (see [15]), is a search tree that stores n strings, according to their prefixes. A suffix tree, introduced by Weiner in 1973 (see [16]), is a trie where the strings are suffixes of a given string. An example of these data structures are given in Figure 1.

A direct asymptotic analysis of the moments is a difficult task, as patterns in a string are not independent from each other. However, we note that each pattern in a string can be regarded as a prefix of a suffix of the string. Therefore, the number of distinct patterns of length k in a string is actually the number of nodes of the suffix tree at level k and lower. It is shown by I. Gheorghiciuc and M. D. Ward [17] that the expected value of the k-th Subword Complexity of a Bernoulli string of length n is asymptotically comparable to the expected value of the number of nodes at level k of a trie built over n independent strings generated by a memory-less source.

We extend this analysis to the desired range for k, and we prove that the result holds for when k grows logarithmically with n. Additionally, we show that asymptotically, the second factorial moment of the k-th Subword Complexity can also be estimated by admitting the same independent model generated by a memory-less source. The proof of this theorem heavily relies on the characterization of the overlaps of the patterns with themselves and with one another. Autocorrelation and correlation polynomials explicitly describe these overlaps. The analytic properties of these polynomials are key to understanding repetitions of patterns in large Bernoulli strings. This, in conjunction with Cauchy’s integral formula (used to compare the generating functions in the two models) and the residue theorem, provides solid verification that the second factorial moment in the Subword Complexity behaves the same as in the independent model.

To make this comparison, we derive the generating functions of the first two factorial moments in both settings. In a paper published by F. Bassino, J. Clément, and P. Nicodème in 2012 [18], the authors provide a multivariate probability generating function

f (z, x)

for the number of occurrences of patterns in a finite Bernoulli string. That is, given a pattern w, the coefficient of the term

z^{n} x^{m}

in

f (z, x)

is the probability in the Bernoulli model that a random string of size n has exactly m occurrences of the pattern w. Following their technique, we derive the exact expression for the generating functions of the first two factorial moments of the kth Subword Complexity. In the independent model, the generating functions are obtained by basic probability concepts.

1.2. Part II

This part of the proof is analogous to the analysis of profile of tries [19]. To capture the asymptotic behavior, the expressions for the first two factorial moments in the independent trie are further improved by means of a Poisson process. The poissonized version yields generating functions in the form of harmonic sums for each of the moments. The Mellin transform and the inverse Mellin transforms of these harmonic sums establish a connection between the asymptotic expansion and singularities of the transformed function. This methodology is sufficient for when the length k of the patterns are fixed. However, allowing k to grow with n, makes the analysis more challenging. This is because for large k, the dominant term of the poissonized generating function may come from the term involving k, and singularities may not be significant compared to the growth of k. This issue is treated by combining the singularity analysis with a saddle point method [20]. The outcome of the analysis is a precise first-order asymptotics of the moments in the poissonized model. Depoissonization theorems are then applied to obtain the desired result in the Bernoulli model.

2. Results

For a binary string

X = X_{1} X_{2} . . . X_{n}

, where

X_{i}

’s (

i = 1, . . ., n

) are independent and identically distributed random variables, we assume that

P (X_{i} = 1) = p

,

P (X_{i} = 0) = q = 1 - p

, and

p > q

. We define the kth Subword Complexity,

X_{n, k}

, to be the number of distinct substrings of length k that appear in a random string X with the above assumptions. In this work, we obtain the first order asymptotics for the average and the second factorial moment of

X_{n, k}

. The analysis is done in the range

k = Θ (log n)

. We rewrite this range as

k = a log n

, and by performing a saddle point analysis, we will show that

\begin{matrix} 1 / log q^{- 1} < a < 1 / log p^{- 1} \end{matrix}

(1)

In the first step, we compare the kth Subword Complexity to an independent model constructed in the following way: We store a set of n independently generated strings by a memory-less source in a trie. This means that each string is a sequence of independent and identically distributed Bernoulli random variables from the binary alphabet

A = {0, 1}

, with

P (1) = p

,

P (0) = q = 1 - p

. We denote the number of distinct prefixes of length k in the trie by

{\hat{X}}_{n, k}

, and we call it the kth prefix complexity. Before proceeding any further, we remind that factorial moments of a random variable are defined as following.

Definition 1.

The jth factorial moment of a random variable X is defined as

\begin{matrix} E [{(X)}_{j}] = E [(X) (X - 1) (X - 2) . . . (X - j + 1)], \end{matrix}

(2)

where j = 1, 2, … will show that the first and second factorial moments of

X_{n, k}

are asymptotically comparable to those of

{\hat{X}}_{n, k}

, when

k = Θ (log n)

. We have the following theorems.

Theorem 1.

For large values of n, and for

k = Θ (log n)

, there exists

M > 0

such that

E [X_{n, k}] - E [{\hat{X}}_{n, k}] = O (n^{- M}) .

We also prove a similar result for the second factorial moments of the kth Subword Complexity and the kth Prefix Complexity:

Theorem 2.

For large values of n, and for

k = Θ (log n)

, there exists

ϵ > 0

such that

E [{(X_{n, k})}_{2}] - E [{({\hat{X}}_{n, k})}_{2}] = O (n^{- ϵ}) .

In the second part of our analysis, we derive the first order asymptotics of the kth Prefix Complexity. The methodology used here is analogous to the analysis of profile of tries [19]. The rate of the asymptotic growth depends on the location of the value a as seen in (1). For instance, for the average kth Subword Complexity,

E [X_{n, k}]

, we have the following observations.

i.: For the range $I_{1} : \frac{1}{log q^{- 1}} < a < \frac{2}{log q^{- 1} + log p^{- 1}}$ , the growth rate is of order $O (2^{k})$ ,
ii.: in the range $I_{2} : \frac{2}{log q^{- 1} + log p^{- 1}} < a < \frac{1}{q log q^{- 1} + p log p^{- 1}}$ , we observe some oscillations with n, and
iii.: in the range $I_{3} : \frac{1}{q log q^{- 1} + p log p^{- 1}} < a < \frac{1}{log p^{- 1}}$ , the average has a linear growth $O (n)$ .

The above observations will be discussed in depth in the proofs of the following theorems.

Theorem 3.

The average of the kth Prefix Complexity has the following asymptotic expansion

i.: For $a \in I_{1}$ ,

$\begin{matrix} E [{\hat{X}}_{n, k}] & = 2^{k} - Φ_{1} ((1 + log p) {log}_{p / q} n) \frac{n^{ν}}{\sqrt{log n}} (1 + O (\frac{1}{\sqrt{log n}})), \end{matrix}$

(3)

where $ν = - r_{0} + a log (p^{- r_{0}} + q^{- r_{0}})$ , and

$\begin{matrix} Φ_{1} (x) & = \frac{{(p / q)}^{- r_{0} / 2} + {(p / q)}^{r_{0} / 2}}{\sqrt[]{2 π} log p / q} \sum_{j \in Z} Γ (r_{0} + i t_{j}) e^{- 2 π i j x} \end{matrix}$

is a bounded periodic function.
ii.: For $a \in I_{2}$ ,

$\begin{matrix} E [{\hat{X}}_{n, k}] & = Φ_{1} ((1 + log p) {log}_{p / q} n) \frac{n^{ν}}{\sqrt{log n}} (1 + O (\frac{1}{\sqrt{log n}})) . \end{matrix}$
iii.: For $a \in I_{3}$

$\begin{matrix} E [{\hat{X}}_{n, k}] & = n + O (n^{ν_{0}}), \end{matrix}$

for some $ν_{0} < 1$ .

Theorem 4.

The second factorial moment of the kth Prefix Complexity has the following asymptotic expansion.

i.: For $a \in I_{1}$ ,

$\begin{matrix} E [{({\hat{X}}_{n, k})}_{2}] = {(2^{k} - Φ_{1} ({log}_{p / q} n (1 + log p)) \frac{n^{ν}}{\sqrt{log n}} (1 + O (\frac{1}{\sqrt{log n}})))}^{2} . \end{matrix}$
ii.: For $a \in I_{2}$ ,

$\begin{matrix} E [{({\hat{X}}_{n, k})}_{2}] & = Φ_{1}^{2} ({log}_{p / q} n (1 + log p)) \frac{n^{2 ν}}{log n} (1 + O (\frac{1}{log n})) . \end{matrix}$
iii.: For $a \in I_{3}$ ,

$\begin{matrix} E [{({\hat{X}}_{n, k})}_{2}] = n^{2} + O (n^{2 ν_{0}}) . \end{matrix}$

The periodic function

Φ_{1} (x)

in Theorems 3 and 4 is shown in Figure 2.

The results in Theorem 4 will follow for the second moment of the kth Subword Complexity as the analysis can be easily extended from the second factorial moment to the second moment. The variance however, as seen in Figure 3, does not show the same asymptotic behavior as the variance of kth Subword Complexity.

3. Proofs and Methods

3.1. Groundwork

We first introduce a few terminologies and lemmas regarding overlaps of patterns and their number of occurrences in texts. Some of the notations we use in this work are borrowed from [18] and [21].

Definition 2.

For a binary word

w = w_{1} . . . w_{k}

of length k, The autocorrelation set

S_{w}

of the word w is defined in the following way.

\begin{matrix} S_{w} = {w_{i + 1} . . . w_{k} | w_{1} . . . w_{i} = w_{k - i + 1} . . . w_{k}} . \end{matrix}

(4)

The autocorrelation index set is

\begin{matrix} P (w) = {i | w_{1} . . . w_{i} = w_{k - i + 1} . . . w_{k}}, \end{matrix}

(5)

And the autocorrelation polynomial is

\begin{matrix} S_{w} (z) = \sum_{i \in P (w)} P (w_{i + 1} . . . w_{k}) z^{k - i} . \end{matrix}

(6)

Definition 3.

For the distinct binary words

w = w_{1} . . . w_{k}

and

w^{'} = w_{1}^{'} . . . w_{k}^{'}

, the correlation set

S_{w, w^{'}}

of the words w and

w^{'}

is

\begin{matrix} S_{w, w^{'}} = {w_{i + 1}^{'} . . . w_{k}^{'} | w_{1}^{'} . . . w_{i}^{'} = w_{k - i + 1} . . . w_{k}} . \end{matrix}

(7)

The correlation index set is

\begin{matrix} P (w, w^{'}) = {i | w_{1}^{'} . . . w_{i}^{'} = w_{k - i + 1} . . . w_{k}}, \end{matrix}

(8)

The correlation polynomial is

\begin{matrix} S_{w, w^{'}} (z) = \sum_{i \in P (w, w^{'})} P (w_{i + 1}^{'} . . . w_{k}^{'}) z^{k - i} . \end{matrix}

(9)

The following two lemmas present the probability generating functions for the number of occurrences of a single pattern and a pair of distinct pattern, respectively, in a random text of length n. For a detailed dissection on obtaining such generating functions, refer to [18].

Lemma 1.

The Occurrence probability generating function for a single pattern w in a binary text over a memoryless source is given by

F_{w} (z, x - 1)

, where

\begin{matrix} F_{w} (z, t) = \frac{1}{1 - A (z) - \frac{t P (w) z^{k}}{1 - t (S_{w} (z) - 1)}}, \end{matrix}

(10)

The coefficient

[z^{n} x^{m}] F_{w} (z, x - 1)

is the probability that a random binary string of length n has m occurrences of the pattern w.

Lemma 2.

The Occurrence PGF for two distinct Patterns of length k in a Bernoulli random text is given by

F_{w, w^{'}} (z, x_{1} - 1, x_{2} - 1)

where,

\begin{matrix} F_{w, w^{'}} (z, t_{1}, t_{2}) = \frac{1}{1 - A (z) - M (z, t_{1}, t_{2})}, \end{matrix}

(11)

and

M (z, t_{1}, t_{2}) = (\begin{matrix} P (w) z^{k} t_{1} & P (w^{'}) z^{k} t_{2} \end{matrix}) {(\begin{matrix} I - (\begin{matrix} (S_{w} (z) - 1) t_{1} & S_{w, w^{'}} (z) t_{2} \\ S_{w^{'}, w} (z) t_{1} & (S_{w^{'}} (z) - 1) t_{2} \end{matrix}) \end{matrix})}^{- 1} (\begin{matrix} 1 \\ 1 \end{matrix}) .

The coefficient

[z^{n} x_{1}^{m_{1}} x_{2}^{m_{2}}] F_{w, w^{'}} (z, x_{1} - 1, x_{2} - 1)

is the probability that there are

m_{1}

occurrences of w and

m_{2}

occurrences of

w^{'}

in a random string of length n.

The above results will be used to find the generating functions for the first two factorial moments of the kth Subword Complexity in the following section.

3.2. Derivation of Generating Functions

Lemma 3.

For generating functions

H_{k} (z) = \sum_{n \geq 0} E [X_{n, k}] z^{n}

and

G_{k} (z) = \sum_{n \geq 0} E [{(X_{n, k})}_{2}] z^{n}

, we have

i.: $\begin{matrix} H_{k} (z) = \sum_{w \in A^{k}} (\frac{1}{1 - z} - \frac{S_{w} (z)}{D_{w} (z)}), \end{matrix}$

(12)

where $D_{w} (z) = P (w) z^{k} + (1 - z) S_{w} (z)$ , and
ii.: $\begin{matrix} G_{k} (z) = \sum_{\begin{matrix} w, w^{^{'}} \in A^{k} \\ w \neq w^{'} \end{matrix}} (\frac{1}{1 - z} - \frac{S_{w} (z)}{D_{w} (z)} - \frac{S_{w^{^{'}}} (z)}{D_{w^{'}} (z)} + \frac{S_{w} (z) S_{w^{^{'}}} (z) - S_{w, w^{'}} (z) S_{w^{^{'}}, w} (z)}{D_{w, w^{'}} (z)}), \end{matrix}$

(13)

where

$\begin{matrix} D_{w, w^{'}} (z) = (1 - z) (S_{w} (z) S_{w^{'}} (z) - S_{w, w^{'}} (z) S_{w^{'}, w} (z)) \\ + z^{k} (P (w) (S_{w^{'}} (z) - S_{w, w^{'}} (z)) + P (w^{'}) (S_{w} (z) - S_{w^{'}, w} (z))) . \end{matrix}$

(14)

Proof.

i .

We define

X_{n, k}^{(w)} = \{\begin{matrix} 1 & if w appears at least once in string X \\ 0 & otherwise . \end{matrix}

This yields

\begin{matrix} E [X_{n, k}^{(w)}] & = P (X_{n, k}^{(w)} = 1) \\ = 1 - P (X_{n, k}^{(w)} = 0) \\ = 1 - [z^{n} x^{0}] F_{w} (z, x) . \end{matrix}

(15)

We observe that

[z^{n} x^{0}] F_{w} (z, x) = [z^{n}] F_{w} (z, 0)

. By defining

f_{w} (z) = F_{w} (z, 0)

and from (10), we obtain

\begin{matrix} f_{w} (z) & = \frac{S_{w} (z)}{P (w) z^{k} + (1 - z) S_{w} (z)} . \end{matrix}

(16)

Having the above function, we derive the following result.

\begin{matrix} H (z) & = \sum_{n \geq 0} E [X_{n, k}] z^{n} \\ = \sum_{n \geq 0} \sum_{w \in A^{k}} (1 - [z^{n}] f_{w} (z)) z^{n} \\ = \sum_{w \in A^{k}} (\frac{1}{1 - z} - f_{w} (z)) \\ = \sum_{w \in A^{k}} (\frac{1}{1 - z} - \frac{S_{w} (z)}{D_{w} (z)}) . \end{matrix}

(17)

i i .

For this part, we first note that

\begin{matrix} E [{(X_{n, k})}_{2}] & = E [X_{n, k}^{2}] - E [X_{n, k}] \\ = E [{(X_{n, k}^{(w)} + . . . + X_{n, k}^{(w^{(r)})})}^{2}] - E [X_{n, k}^{(w)} + . . . + X_{n, k}^{(w^{(r)})}] \\ = \sum_{w \in A^{k}} E [{(X_{n, k}^{(w)})}^{2}] + \sum_{\begin{matrix} w, w^{'} \in A^{k} \\ w \neq w^{'} \end{matrix}} E [X_{n, k}^{(w)} X_{n, k}^{(w^{'})}] - \sum_{w \in A^{k}} E [X_{n, k}^{(w)}] \\ = \sum_{\begin{matrix} w, w^{'} \in A^{k} \\ w \neq w^{'} \end{matrix}} E [X_{n, k}^{(w)} X_{n, k}^{(w^{'})}] . \end{matrix}

(18)

Due to properties of indicator random variables, we observe that the expected value of the second factorial moment has only one term:

\begin{matrix} E [{(X_{n, k})}_{2}] = \sum_{\begin{matrix} w, w^{'} \in A^{k} \\ w \neq w^{'} \end{matrix}} E [X_{n, k}^{(w)} X_{n, k}^{(w^{'})}] . \end{matrix}

(19)

We proceed by defining a second indicator variable as following.

X_{n, k}^{(w)} X_{n, k}^{(w^{'})} = \{\begin{matrix} 1 & if X_{n, k}^{(w)} = X_{n, k}^{(w^{'})} = 1 \\ 0 & otherwise . \end{matrix}

This gives

\begin{matrix} E [X_{n, k}^{(w)} X_{n, k}^{(w^{'})}] & = P (X_{n, k}^{(w)} = 1, X_{n, k}^{(w^{'})} = 1) \\ = 1 - P (X_{n, k}^{(w)} = 0 \cup X_{n, k}^{(w^{'})} = 0) \\ = 1 - P (X_{n, k}^{(w)} = 0) - P (X_{n, k}^{(w^{'})} = 0) + P (X_{n, k}^{(w)} = 0, X_{n, k}^{(w^{'})} = 0) . \end{matrix}

Finally, we are able to express

E [{(X_{n, k})}_{2}]

in the following

\begin{matrix} E [{(X_{n, k})}_{2}] & = \sum_{\begin{matrix} w, w^{'} \in A^{k} \\ w \neq w^{'} \end{matrix}} (1 - [z^{n}] f_{w} (z) - [z^{n}] f_{w^{'}} (z) + [z^{n}] f_{w w^{'}} (z)), \end{matrix}

(20)

where

f_{w, w^{'}} (z) = F_{w, w^{'}} (z, 0, 0)

and

[z^{n}] F_{w, w^{'}} (z, 0, 0) = [z^{n} x_{1}^{0} x_{2}^{0}] F_{w, w^{'}} (z, x_{1}, x_{2})

. By (11) we have

\begin{matrix} f_{w, w^{'}} (z) & = \frac{S_{w} (z) S_{w^{'}} (z) - S_{w, w^{'}} (z) S_{w^{'}, w} (z)}{D_{w, w^{'}} (z)} \end{matrix}

(21)

Having the above expression, we finally obtain

\begin{matrix} G_{k} (z) & = \sum_{n \geq 0} E [{(X_{n, k})}_{2}] z^{n} \\ = \sum_{\begin{matrix} w, w^{^{'}} \in A^{k} \\ w \neq w^{'} \end{matrix}} \sum_{n \geq 0} (1 - [z^{n}] f_{w} (z) - [z^{n}] f_{w^{^{'}}} (z) + [z^{n}] f_{w, w^{^{'}}} (z)) z^{n} \\ = \sum_{\begin{matrix} w, w^{^{'}} \in A^{k} \\ w \neq w^{'} \end{matrix}} (\frac{1}{1 - z} - f_{w} (z) - f_{w^{^{'}}} (z) + f_{w, w^{^{'}}} (z)) \\ = \sum_{\begin{matrix} w, w^{^{'}} \in A^{k} \\ w \neq w^{'} \end{matrix}} (\frac{1}{1 - z} - \frac{S_{w} (z)}{D_{w} (z)} - \frac{S_{w^{^{'}}} (z)}{D_{w^{'}} (z)} + \frac{S_{w} (z) S_{w^{^{'}}} (z) - S_{w, w^{'}} (z) S_{w^{^{'}}, w} (z)}{D_{w, w^{'}} (z)}) . \end{matrix}

(22)

□

In the following lemma, we present the generating functions for the first two factorial moments for the kth Prefix Complexity in the independent model.

Lemma 4.

For

{\hat{H}}_{k} (z) = \sum_{n \geq 0} E [{\hat{X}}_{n, k}] z^{n}

and

{\hat{G}}_{k} (z) = \sum_{n \geq 0} E [{({\hat{X}}_{n, k})}_{2}] z^{n}

, which are the generating functions for

E [{\hat{X}}_{n, k}]

and

E [{({\hat{X}}_{n, k})}_{2}]

respectively, we have

i.: $\begin{matrix} {\hat{H}}_{k} (z) & = \sum_{w \in A^{k}} (\frac{1}{1 - z} - \frac{1}{1 - (1 - P (w)) z}) . \end{matrix}$

(23)
ii.: $\begin{matrix} {\hat{G}}_{k} (z) & = \sum_{\begin{matrix} w, w^{^{'}} \in A^{k} \\ w \neq w^{'} \end{matrix}} (\frac{1}{1 - z} - \frac{1}{1 - (1 - P (w)) z} - \frac{1}{1 - (1 - P (w^{'})) z}) \\ + \sum_{\begin{matrix} w, w^{^{'}} \in A^{k} \\ w \neq w^{'} \end{matrix}} \frac{1}{1 - (1 - P (w) - P (w^{'})) z} . \end{matrix}$

(24)

Proof.

i .

We define the indicator variable

{\hat{X}}_{n, k}^{(w)}

as follows.

{\hat{X}}_{n, k}^{(w)} = \{\begin{matrix} 1 & if w is a prefix of at least one string in P \\ 0 & otherwise . \end{matrix}

For each

{\hat{X}}_{n, k}^{(w)}

, we have

\begin{matrix} E [{\hat{X}}_{n, k}^{(w)}] & = P ({\hat{X}}_{n, k}^{(w)} = 1) \\ = 1 - P ({\hat{X}}_{n, k}^{(w)} = 0) \\ = 1 - {(1 - P (w))}^{n} . \end{matrix}

(25)

Summing over all words w of length k, determines the generating function

\hat{H} (z)

:

\begin{matrix} \hat{H} (z) & = \sum_{n \geq 0} E [{\hat{X}}_{n, k}] z^{n} \\ = \sum_{w \in A^{k}} (\frac{1}{1 - z} - \frac{1}{1 - (1 - P (w)) z}) . \end{matrix}

(26)

i i .

Similar to in (18) and (20), we obtain

\begin{matrix} E [{({\hat{X}}_{n, k})}_{2}] & = \sum_{\begin{matrix} w, w^{'} \in A^{k} \\ w \neq w^{'} \end{matrix}} E [{\hat{X}}_{n, k}^{(w)} {\hat{X}}_{n, k}^{(w^{'})}] \\ = \sum_{\begin{matrix} w, w^{^{'}} \in A^{k} \\ w \neq w^{'} \end{matrix}} (1 - {(1 - P (w))}^{n} - {(1 - P (w^{'}))}^{n} + {(1 - P (w) - P (w^{'}))}^{n}) . \end{matrix}

(27)

Subsequently, we obtain the generating function below.

\begin{matrix} \hat{G} (z) & = \sum_{n \geq 0} E [{({\hat{X}}_{n, k})}_{2}] z^{n} \\ = \sum_{\begin{matrix} w, w^{^{'}} \in A^{k} \\ w \neq w^{'} \end{matrix}} \sum_{n \geq 0} (1 - {(1 - P (w))}^{n} - {(1 - P (w^{'}))}^{n} + {(1 - P (w) - P (w^{'}))}^{n}) z^{n} \\ = \sum_{\begin{matrix} w, w^{^{'}} \in A^{k} \\ w \neq w^{'} \end{matrix}} (\frac{1}{1 - z} - \frac{1}{1 - (1 - P (w)) z} - \frac{1}{1 - (1 - P (w^{'})) z}) \\ + \sum_{\begin{matrix} w, w^{^{'}} \in A^{k} \\ w \neq w^{'} \end{matrix}} \frac{1}{1 - (1 - P (w) - P (w^{'})) z} . \end{matrix}

(28)

□

Our first goal is to compare the coefficients of the generating functions in the two models. The coefficients are expected to be asymptotically equivalent in the desired range for k. To compare the coefficients, we need more information on the analytic properties of these generating functions. This will be discussed in Section 3.3.

3.3. Analytic Properties of the Generating Functions

Here, we turn our attention to the smallest singularities of the two generating functions given in Lemma 3. It has been shown by Jacquet and Szpankowski [21] that

D_{w} (z)

has exactly one root in the disk

| z | \leq ρ

. Following the notations in [21], we denote the root within the disk

| z | \leq ρ

of

D_{w} (z)

by

A_{w}

, and by bootstrapping we obtain

\begin{matrix} A_{w} = 1 + \frac{1}{S_{w} (1)} P (w) + O (P {(w)}^{2}) . \end{matrix}

(29)

We also denote the derivative of

D_{w} (z)

at the root

A_{w}

, by

B_{w}

, and we obtain

\begin{matrix} B_{w} = - S_{w} (1) + (k - \frac{2 S_{w}^{'} (1)}{S_{w} (1)} P (w)) + O (P {(w)}^{2}) . \end{matrix}

(30)

In this paper, we will prove a similar result for the polynomial

D_{w, w^{'}} (z)

through the following work.

Lemma 5.

If w and

w^{'}

are two distinct binary words of length k and

δ = \sqrt{p}

, there exists

ρ > 1

, such that

ρ δ < 1

and

\begin{matrix} \sum_{w \in A^{k}} [[| S_{w, w^{'}} {(ρ) | \leq (ρ δ)}^{k} θ]] P (w) \geq 1 - θ δ^{k} . \end{matrix}

(31)

Proof.

If the minimal degree of

S_{w, w^{'}} (z)

is greater than

> ⌊ k / 2 ⌋

, then

\begin{matrix} | S_{w, w^{'}} {(ρ) | \leq (ρ δ)}^{k} θ . \end{matrix}

(32)

for

θ = {(1 - p)}^{- 1}

. For a fixed

w^{'}

, we have

\begin{matrix} \sum_{w \in A^{k}} [[S_{w, w^{'}} (z) has minimal degree \leq ⌊ k / 2 ⌋]] P (w) \\ = \sum_{i = 1}^{⌊ k / 2 ⌋} \sum_{w \in A^{k}} [[S_{w, w^{'}} (z) has minimal degree = i]] P (w) \\ = \sum_{i = 1}^{⌊ k / 2 ⌋} \sum_{w_{1} . . . w_{i} \in A^{i}} P (w_{1} . . . w_{i}) \\ \sum_{w_{i + 1} . . . w_{k} \in A^{k - i}} [[S_{w, w^{'}} (z) has minimal degree = i]] P (w_{i + 1} . . . w_{k}) \\ \leq \sum_{i = 1}^{⌊ k / 2 ⌋} \sum_{w_{1} . . w_{i} \in A^{i}} P (w_{i + 1} . . . w_{k}) p^{k - i} \\ = \sum_{i = 1}^{⌊ k / 2 ⌋} p^{k - i} \sum_{w_{1} . . w_{i} \in A^{i}} P (w_{1} . . . w_{i}) \\ = \sum_{i = 1}^{⌊ k / 2 ⌋} p^{k - i} \leq \frac{p^{k - ⌊ k / 2 ⌋}}{1 - p} . \end{matrix}

(33)

This leads to the following

\begin{matrix} \sum_{w \in A^{k}} [[every term of S_{w, w^{'}} (z) is of degree > ⌊ k / 2 ⌋]] P (w) \\ = 1 - \sum_{w \in A^{k}} [[S_{w, w^{'}} (z) has a term of degree \leq ⌊ k / 2 ⌋]] P (w) \\ \geq 1 - \frac{p^{⌈ k / 2 ⌉}}{1 - p} \geq 1 - θ δ^{k} . \end{matrix}

(34)

□

Lemma 6.

There exist

K^{'} > 0

, and

ρ > 1

such that

p ρ < 1

, and such that, for every pair of distinct words w, and

w^{'}

of length

k \geq K^{'}

, and for

| z | \leq ρ

, we have

\begin{matrix} | S_{w} (z) S_{w^{'}} (z) - S_{w, w^{'}} (z) S_{w^{'}, w} (z) | > 0 . \end{matrix}

(35)

In other words,

S_{w} (z) S_{w^{'}} (z) - S_{w, w^{'}} (z) S_{w^{'}, w} (z)

does not have any roots in

| z | \leq ρ

.

Proof.

There are three cases to consider:

Case

i .

When either

S_{w} (z) = 1

or

S_{w^{'}} (z) = 1

, then every term of

S_{w, w^{'}} (z) S_{w^{'}, w} (z)

has degree k or larger, and therefore

\begin{matrix} | S_{w, w^{'}} (z) S_{w^{'}, w} (z) | \leq k \frac{{(p ρ)}^{k}}{1 - p ρ} . \end{matrix}

(36)

There exists

K_{1} > 0

, such that for

k > K_{1}

, we have

{lim}_{k \to \infty} k \frac{{(p ρ)}^{k}}{1 - p ρ} = 0

. This yields

\begin{matrix} | S_{w} (z) S_{w^{'}} (z) - S_{w, w^{'}} (z) S_{w^{'}, w} (z) | & \geq | S_{w} (z) S_{w^{'}} (z) | - | S_{w, w^{'}} (z) S_{w^{'}, w} (z) | \\ \geq 1 - k \frac{{(p ρ)}^{k}}{1 - p ρ} > 0 . \end{matrix}

(37)

Case

i i .

If the minimal degree for

S_{w} (z) - 1

or

S_{w^{'}} (z) - 1

is greater than

⌊ k / 2 ⌋

, then every term of

S_{w, w^{'}} (z) S_{w^{'}, w} (z)

has degree at least

k / 2

. We also note that, by Lemma 9,

| S_{w} (z) S_{w^{'}} (z) | > 0

. Therefore, there exists

K_{2} > 0

, such that

\begin{matrix} | S_{w} (z) S_{w^{'}} (z) - S_{w^{'}, w} (z) S_{w, w^{'}} (z) | & \geq | S_{w} (z) S_{w^{'}} (z) | - | S_{w^{'}, w} (z) S_{w, w^{'}} (z) | \\ > 0 for k > K_{2} . \end{matrix}

(38)

Case

i i i .

The only remaining case is where the minimal degree for

S_{w} (z) - 1

and

S_{w^{'}} (z) - 1

are both less than or equal to

⌊ k / 2 ⌋

. If

w = w_{1} . . . w_{k}

, then

w^{'} = u w_{1} . . . w_{k - m}

, where u is a word of length

m \geq 1

. Then we have

\begin{matrix} S_{w^{'}, w} (z) = P (w_{k - m + 1} . . . w_{k}) z^{m} (S_{w} (z) - O ({(p z)}^{k - m})) . \end{matrix}

(39)

There exists

K_{3} > 0

, such that

\begin{matrix} | S_{w^{'}, w} (z) | & \leq {(p ρ)}^{m} (| S_{w} (z) | + O ({(p ρ)}^{k - m}) \\ = {(p ρ)}^{m} | S_{w} (z) | + O ({(p ρ)}^{k}) \\ < | S_{w} (z) | for k > K_{3} . \end{matrix}

(40)

Similarly, we can show that there exists

K_{3}^{'}

, such that

| S_{w, w^{'}} (z) | < | S_{w^{'}} (z) |

. Therefore, for

k > K_{3}^{'}

we have

\begin{matrix} | S_{w} (z) S_{w^{'}} (z) - S_{w, w^{'}} (z) S_{w^{'}, w} (z) | & \geq | S_{w} (z) | | S_{w^{'}} (z) | - | S_{w, w^{'}} (z) | | S_{w^{'}, w} (z) | \\ > | S_{w} (z) | | S_{w^{'}} (z) | - | S_{w} (z) | | S_{w^{'}} (z) | = 0 . \end{matrix}

(41)

We complete the proof by setting

K^{'} = max {K_{1}, K_{2}, K_{3}, K_{3}^{'}}

. □

Lemma 7.

There exist

K_{w, w^{'}} > 0

and

ρ > 1

such that

p ρ < 1

, and for every word w and

w^{'}

of length

k \geq K_{w, w^{'}}

, the polynomial

\begin{matrix} D_{w, w^{'}} (z) = (1 - z) (S_{w} (z) S_{w^{'}} (z) - S_{w, w^{'}} (z) S_{w^{'}, w} (z)) \\ + z^{k} (P (w) (S_{w^{'}} (z) - S_{w, w^{'}} (z)) + P (w^{'}) (S_{w} (z) - S_{w^{'}, w} (z))), \end{matrix}

(42)

has exactly one root in the disk

| z | \leq ρ

.

Proof.

First note that

\begin{matrix} | S_{w} (z) - S_{w^{'}, w} (z) | & \leq | S_{w} (z) | + | S_{w^{'}, w} (z) | \\ \leq \frac{1}{1 - p ρ} + \frac{p ρ}{1 - p ρ} = \frac{1 + p ρ}{1 - p ρ} . \end{matrix}

(43)

This yields

\begin{matrix} |z^{k} (P (w) (S_{w^{'}} (z) - S_{w, w^{'}} (z)) + P (w^{'}) (S_{w} (z) - S_{w^{'}, w} (z)))| \\ \leq {(p ρ)}^{k} (| S_{w} (z) - S_{w^{'}, w} (z) | + | S_{w^{'}} (z) - S_{w, w^{'}} (z) |) \\ \leq {(p ρ)}^{k} (\frac{2 (1 + p ρ)}{1 - p ρ}) . \end{matrix}

(44)

There exist

K^{'}

,

K^{″}

large enough, such that, for

k > K^{'}

, we have

| (S_{w} (z) S_{w^{'}} (z) - S_{w, w^{'}} (z) S_{w^{'}, w} (z)) | \geq β > 0,

and for

k > K^{″}

,

{(p ρ)}^{k} (\frac{2 (1 + p ρ)}{1 - p ρ}) < (ρ - 1) β .

If we define

K_{w, w^{'}} = max {K^{'}, K^{″}}

, then we have, for

k \geq K_{w, w^{'}}

,

\begin{matrix} {(p ρ)}^{k} (\frac{2 (1 + p ρ)}{1 - p ρ}) & < (ρ - 1) β \\ < | (1 - z) (S_{w} (z) S_{w^{'}} (z) - S_{w, w^{'}} (z) S_{w^{'}, w} (z)) | . \end{matrix}

(45)

by Rouché’s theorem, as

(1 - z) (S_{w} (z) S_{w^{'}} (z) - S_{w, w^{'}} (z) S_{w^{'}, w} (z))

has only one root in

| z | \leq ρ

, then also

D_{w, w^{'}} (z)

has exactly one root in

| z | \leq ρ

. □

We denote the root within the disk

| z | \leq ρ

of

D_{w, w^{'}} (z)

by

α_{w, w^{'}}

, and by bootstrapping we obtain

\begin{matrix} α_{w, w^{'}} = 1 + \frac{S_{w^{'}} (1) - S_{w, w^{'}} (1)}{S_{w} (1) S_{w^{'}} (1) - S_{w, w^{'}} (1) S_{w^{'}, w} (1)} P (w) \\ + \frac{S_{w} (1) - S_{w^{'}, w} (1)}{S_{w} (1) S_{w^{'}} (1) - S_{w, w^{'}} (1) S_{w^{'}, w} (1)} P (w^{'}) + O (p^{2 k}) . \end{matrix}

(46)

We also denote the derivative of

D_{w, w^{'}} (z)

at the root

α_{w, w^{'}}

, by

β_{w, w^{'}}

, and we obtain

\begin{matrix} β_{w, w^{'}} = S_{w, w^{'}} (1) S_{w^{'}, w} (1) - S_{w} (1) S_{w^{'}} (1) + O (k p^{k}) . \end{matrix}

(47)

We will refer to these expressions in the residue analysis that we present in the next section.

3.4. Asymptotic Difference

We begin this section by the following lemmas on the autocorrelation polynomials.

Lemma 8

(Jacquet and Szpankowski, 1994). For most words w, the autocorrelation polynomial

S_{w} (z)

is very close to 1, with high probably. More precisely, if w is a binary word of length k and

δ = \sqrt{p}

, there exists

ρ > 1

, such that

ρ δ < 1

and

\begin{matrix} \sum_{w \in A^{k}} [[| S_{w} (ρ) - 1 | \leq {(ρ δ)}^{k} θ]] P (w) \geq 1 - θ δ^{k}, \end{matrix}

(48)

where

θ = {(1 - p)}^{- 1}

. We use Iverson notation

[[A]] = \{\begin{matrix} 1 & i f A h o l d s \\ 0 & o t h e r w i s e \end{matrix}

Lemma 9

(Jacquet and Szpankowski, 1994). There exist

K > 0

and

ρ > 1

, such that

p ρ < 1

, and for every binary word w with length

k \geq K

and

| z | \leq ρ

, we have

\begin{matrix} | S_{w} (z) | > 0 . \end{matrix}

(49)

In other words,

S_{w} (z)

does not have any roots in

| z | \leq ρ

.

Lemma 10.

With high probability, for most distinct pairs

{w, w^{'}}

, the correlation polynomial

S_{w, w^{'}} (z)

is very close to 0. More precisely, if w and

w^{'}

are two distinct binary words of length k and

δ = \sqrt{p}

, there exists

ρ > 1

, such that

ρ δ < 1

and

\begin{matrix} \sum_{w \in A^{k}} [[| S_{w, w^{'}} {(ρ) | \leq (ρ δ)}^{k} θ]] P (w) \geq 1 - θ δ^{k} \end{matrix}

(50)

We will use the above results to prove that the expected values in the Bernoulli model and the model built over a trie are asymptotically equivalent. We now prove Theorem 1 below.

Proof of Theorem 1.

From Lemmas 3 and 4, we have

\begin{matrix} H (z) = \sum_{w \in A^{k}} (\frac{1}{1 - z} - \frac{S_{w} (z)}{D_{w} (z)}), \end{matrix}

and

\begin{matrix} \hat{H} (z) = \sum_{w \in A^{k}} (\frac{1}{1 - z} - \frac{1}{1 - (1 - P (w)) z}) . \end{matrix}

subtracting the two generating functions, we obtain

\begin{matrix} H (z) - \hat{H} (z) = \sum_{w \in A^{k}} (\frac{1}{1 - (1 - P (w)) z} - \frac{S_{w} (z)}{D_{w} (z)}) . \end{matrix}

(51)

We define

\begin{matrix} Δ_{w} (z) = \frac{1}{1 - (1 - P (w)) z} - \frac{S_{w} (z)}{D_{w} (z)} . \end{matrix}

(52)

Therefore, by Cauchy integral formula (see [20]), we have

\begin{matrix} [z^{n}] Δ_{w} (z) & = \frac{1}{2 π i} \oint Δ_{w} (z) \frac{d z}{z^{n + 1}} = {Res}_{z = 0} Δ_{w} (z) \frac{d z}{z^{n + 1}}, \end{matrix}

(53)

where the path of integration is a circle about zero with counterclockwise orientation. We note that the above integrand has poles at

z = 0

,

z = \frac{1}{1 - P (w)}

, and

z = A_{w}

(refer to expression (29)). Therefore, we define

\begin{matrix} I^{w} (ρ) & : = \frac{1}{2 π i} \int_{| z | = ρ} Δ_{w} (z) \frac{d z}{z^{n + 1}}, \end{matrix}

(54)

where the circle of radius

ρ

contains all of the above poles. By the residue theorem, we have

\begin{matrix} I^{w} (ρ) & = {Res}_{z = 0} \frac{Δ_{w} (z)}{z^{n + 1}} + {Res}_{z = A_{w}} \frac{Δ_{w} (z)}{z^{n + 1}} + {Res}_{z = 1 / 1 - P (w)} \frac{Δ_{w} (z)}{z^{n + 1}} \\ = [z^{n}] Δ_{w} (z) - {Res}_{z = A_{w}} \frac{H_{w} (z)}{z^{n + 1}} + {Res}_{z = 1 / 1 - P (w)} \frac{{\hat{H}}_{w} (z)}{z^{n + 1}} \end{matrix}

(55)

We observe that

\begin{matrix} {Res}_{z = A_{w}} \frac{Δ_{w} (z)}{z^{n + 1}} = \frac{S_{w} (A_{w})}{B_{w} A_{w}^{n + 1}}, where B_{w} is as in (30) \\ {Res}_{z = 1 / 1 - P (w)} \frac{{\hat{H}}_{w} (z)}{z^{n + 1}} = - {(1 - P (w))}^{n + 1} . \end{matrix}

Then we obtain

\begin{matrix} [z^{n}] Δ_{w} & = I^{w} (ρ) - \frac{S_{w} (A_{w})}{B_{w} A_{w}^{n + 1}} - {(1 - P (w))}^{n + 1}, \end{matrix}

(56)

and finally, we have

\begin{matrix} [z^{n}] (H (z) - \hat{H} (z)) & = \sum_{w \in A^{k}} [z^{n}] Δ_{w} \\ = \sum_{w \in A^{k}} I_{n}^{w} (ρ) - \sum_{w \in A^{k}} (\frac{S_{w} (A_{w})}{B_{w} A_{w}^{n + 1}} + {(1 - P (w))}^{n + 1}) . \end{matrix}

(57)

First, we show that, for sufficiently large n, the sum

\sum_{w \in A^{k}} (\frac{S_{w} (A_{w})}{B_{w} A_{w}^{n + 1}} + {(1 - P (w))}^{n + 1})

approaches zero. □

Lemma 11.

For large enough n, and for

k = Θ (log n)

, there exists

M > 0

such that

\begin{matrix} \sum_{w \in A^{k}} (\frac{S_{w} (A_{w})}{B_{w} A_{w}^{n + 1}} + {(1 - P (w))}^{n + 1}) = O (n^{- M}) . \end{matrix}

(58)

Proof.

We let

\begin{matrix} r_{w} (z) & = {(1 - P (w))}^{z} + \frac{S_{w} (A_{w})}{B_{w} A_{w}^{z}} . \end{matrix}

(59)

The Mellin transform of the above function is

\begin{matrix} r_{w}^{*} (s) = Γ (s) {log}^{- s} (\frac{1}{1 - P (w)}) - \frac{S_{w} (A_{w})}{B_{w}} Γ (s) {log}^{- s} (A_{w}) . \end{matrix}

(60)

We define

\begin{matrix} C_{w} = \frac{S_{w} (A_{w})}{B_{w}} & = \frac{S_{w} (A_{w})}{- S_{w} (1) + O (k P (w))}, \end{matrix}

(61)

which is negative and uniformly bounded for all w. Also, for a fixed s, we have

\begin{matrix} {ln}^{- s} (\frac{1}{1 - P (w)}) & = {ln}^{- s} (1 + P (w) + O (P {(w)}^{2})) \\ = {(P (w) + O (P {(w)}^{2}))}^{- s} \\ = P {(w)}^{- s} {(1 + O (P (w)))}^{- s} \\ = P {(w)}^{- s} (1 + O (P (w))), \end{matrix}

(62)

\begin{matrix} {ln}^{- s} (A_{w}) & = {ln}^{- s} (1 - (- \frac{P (w)}{S_{w} (1)} + O (P {(w)}^{2}))) \\ = {(\frac{P (w)}{S_{w} (1)} + O (P {(w)}^{2}))}^{- s} \\ = {(\frac{P (w)}{S_{w} (1)})}^{- s} {(1 + O (P (w)))}^{- s} \\ = {(\frac{P (w)}{S_{w} (1)})}^{- s} (1 + O (P (w))), \end{matrix}

(63)

and therefore, we obtain

\begin{matrix} r_{w}^{*} (s) & = Γ (s) P {(w)}^{- s} (1 - \frac{1}{S_{w} {(1)}^{- s}}) O (1) . \end{matrix}

(64)

From this expression, and noticing that the function has a removable singularity at

s = 0

, we can see that the Mellin transform

r_{w}^{*} (s)

exists on the strip where

ℜ (s) > - 1

. We still need to investigate the Mellin strip for the sum

\sum_{w \in A^{k}} r_{w}^{*} (s)

. In other words, we need to examine whether summing

r_{w}^{*} (s)

over all words of length k (where k grows with n) has any effect on the analyticity of the function. We observe that

\begin{matrix} \sum_{w \in A^{k}} | r_{w}^{*} (s) | & = \sum_{w \in A^{k}} | Γ (s) P {(w)}^{- s} (1 - \frac{1}{S_{w} {(1)}^{- s}}) O (1) | \\ \leq | Γ (s) | \sum_{w \in A^{k}} P {(w)}^{- ℜ (s)} (1 - \frac{1}{S_{w} {(1)}^{- ℜ (s)}}) O (1) \\ = {(q^{k})}^{- ℜ (s) - 1} | Γ (s) | \sum_{w \in A^{k}} P (w) (1 - S_{w} {(1)}^{ℜ (s)}) O (1) . \end{matrix}

Lemma 8 allows us to split the above sum between the words for which

S_{w} (1) \leq 1 + O (δ^{k})

and words that have

S_{w} (1) > 1 + O (δ^{k})

.

Such a split yields the following

\begin{matrix} \sum_{w \in A^{k}} | r_{w}^{*} (s) | & = {(q^{k})}^{- ℜ (s) - 1} | Γ (s) | O (δ^{k}) . \end{matrix}

(65)

This shows that

\sum_{w \in A^{k}} r_{w}^{*} (s)

is bounded above for

ℜ (s) > - 1

and, therefore, it is analytic. This argument holds for

k = Θ (log n)

as well, as

{(q^{k})}^{- ℜ (s) - 1}

would still be bounded above by a constant

M_{s, k}

that depends on s and k.

We would like to approximate

\sum_{w \in A^{k}} r_{w}^{*} (s)

when

z \to \infty

. By the inverse Mellin transform, we have

\begin{matrix} \sum_{w \in A^{k}} r_{w} (z) & = \frac{1}{2 π i} \int_{c - i \infty}^{c + i \infty} (\sum_{w \in A^{k}} r_{w}^{*} (s)) z^{- s} d s . \end{matrix}

(66)

We choose

c \in (- 1, M)

for a fixed

M > 0

. Then by the direct mapping theorem [22], we obtain

\begin{matrix} \sum_{w \in A^{k}} r_{w} (z) = O (z^{- M}) . \end{matrix}

(67)

and subsequently, we get

\begin{matrix} \sum_{w \in A^{k}} (\frac{S_{w} (A_{w})}{B_{w} A_{w}^{n + 1}} + {(1 - P (w))}^{n + 1}) = O (n^{- M}) . \end{matrix}

(68)

□

We next prove the asymptotic smallness of

I_{n}^{w} (ρ)

in (54).

Lemma 12.

Let

\begin{matrix} I_{n}^{w} (ρ) & = \frac{1}{2 π i} \int_{| z | = ρ} (\frac{1}{1 - (1 - P (w)) z} - \frac{S_{w} (z)}{D_{w} (z)}) \frac{d z}{z^{n + 1}} . \end{matrix}

(69)

For large n and

k = Θ (log n)

, we have

\begin{matrix} \sum_{w \in A^{k}} I_{n}^{w} (ρ) = O (ρ^{- n} {(ρ δ)}^{k}) . \end{matrix}

(70)

Proof.

We observe that

\begin{matrix} | I_{n}^{w} (ρ) | \leq \frac{1}{2 π} \int_{| z | = ρ} |\frac{P (w) z (z^{k - 1} - S_{w} (z))}{D_{w} (z) (1 - (1 - P (w)) z)} \frac{1}{z^{n + 1}}| d z . \end{matrix}

(71)

For

| z | = ρ

, we show that the denominator in (71) is bounded away from zero.

\begin{matrix} | D_{w} (z) | & = | (1 - z) S_{w} (z) + P (w) z^{k} | \\ \geq | 1 - z | | S_{w} (z) | - P (w) | z^{k} | \\ \geq (ρ - 1) α - {(p ρ)}^{k}, where α > 0 by Lemma 9 . \\ > 0, we assume k to be large enough such that {(p ρ)}^{k} < α (ρ - 1) . \end{matrix}

(72)

To find a lower bound for

| 1 - (1 - P (w)) z |

, we can choose

K_{w}

large enough such that

\begin{matrix} | 1 - (1 - P (w)) z | & \geq |1 - (1 - P (w)) | z || \\ \geq | 1 - ρ (1 - p^{K_{w}}) | \\ > 0 . \end{matrix}

(73)

We now move on to finding an upper bound for the numerator in (71), for

| z | = ρ

.

\begin{matrix} | z^{k - 1} - S_{w} (z) | & \leq | S_{w} (z) - 1 | + | 1 - z^{k - 1} | \\ \leq (S_{w} (ρ) - 1) + (1 + ρ^{k - 1}) \\ = (S_{w} (ρ) - 1) + O (ρ^{k}) . \end{matrix}

(74)

Therefore, there exists a constant

μ > 0

such that

\begin{matrix} | I_{n}^{w} | & \leq μ ρ P (w) ((S_{w} (ρ) - 1) + O (ρ^{k})) \frac{1}{ρ^{n + 1}} \\ = O (ρ^{- n}) (P (w) (S_{w} (ρ) - 1) + P (w) O (ρ^{k})) . \end{matrix}

(75)

Summing over all patterns w, and applying Lemma 8, we obtain

\begin{matrix} \sum_{w \in A^{k}} | I_{n}^{w} (ρ) | & = O (ρ^{- n}) \sum_{w \in A^{k}} P (w) (S_{w} (ρ) - 1) + O (ρ^{- n + k}) \sum_{w \in A^{k}} P (w) \\ = O (ρ^{- n}) (θ {(ρ δ)}^{k} + \frac{p ρ}{1 - p ρ} θ δ^{k}) + O (ρ^{- n + k}) \\ = O (ρ^{- n} {(ρ δ)}^{k}), \end{matrix}

(76)

which approaches zero as

n \to \infty

and

k = Θ (log n)

. This completes the proof of of Theorem 1. □

Similar to Theorem 1, we provide a proof to show that the second factorial moments of the kth Subword Complexity and the kth Prefix Complexity, have the same first order asymptotic behavior. We are now ready to state the proof of Theorem 2.

Proof of Theorem 2.

As discussed in Lemmas 3 and 4, the generating functions representing

E [{(X_{n, k})}_{2}]

and

E [{({\hat{X}}_{n, k})}_{2}]

respectively, are

\begin{matrix} G (z) = \sum_{\begin{matrix} w, w^{'} \in A^{k} \\ w \neq w^{'} \end{matrix}} (\frac{1}{1 - z} - \frac{S_{w} (z)}{D_{w} (z)} - \frac{S_{w^{'}} (z)}{D_{w^{'}} (z)} + \frac{S_{w} (z) S_{w^{'}} (z) - S_{w, w^{'}} (z) S_{w^{'}, w} (z)}{D_{w, w^{'}} (z)}), \end{matrix}

and

\begin{matrix} \hat{G} (z) & = \sum_{\begin{matrix} w, w^{'} \in A^{k} \\ w \neq w^{'} \end{matrix}} (\frac{1}{1 - z} - \frac{1}{1 - (1 - P (w)) z} - \frac{1}{1 - (1 - P (w^{'})) z}) \\ + \sum_{\begin{matrix} w, w^{'} \in A^{k} \\ w \neq w^{'} \end{matrix}} \frac{1}{1 - (1 - P (w) - P (w^{'})) z} . \end{matrix}

Note that

\begin{matrix} G (z) - \hat{G} (z) & = \sum_{\begin{matrix} w^{'} \in A^{k} \\ w \neq w^{'} \end{matrix}} \sum_{w \in A^{k}} (\frac{1}{1 - (1 - P (w)) z} - \frac{S_{w} (z)}{D_{w} (z)}) \end{matrix}

(77)

\begin{matrix} + \sum_{\begin{matrix} w \in A^{k} \\ w \neq w^{'} \end{matrix}} \sum_{w^{'} \in A^{k}} (\frac{1}{1 - (1 - P (w^{'})) z} - \frac{S_{w^{'}} (z)}{D_{w^{'}} (z)}) \end{matrix}

(78)

\begin{matrix} + \sum_{\begin{matrix} w, w^{'} \in A^{k} \\ w \neq w^{'} \end{matrix}} (\frac{1}{1 - (1 - P (w) - P (w^{'})) z} - \frac{S_{w} (z) S_{w^{'}} (z) - S_{w, w^{'}} (z) S_{w^{'}, w} (z)}{D_{w, w^{'}} (z)}) \end{matrix}

(79)

In Theorem 1, we proved that for every

M > 0

(which does not depend on n or k), we have

\begin{matrix} H (z) - \hat{H} (z) & = \sum_{w \in A^{k}} (\frac{1}{1 - (1 - P (w)) z} - \frac{S_{w} (z)}{D_{w} (z)}) = O (n^{- M}) . \end{matrix}

Therefore, both (77) and (78) are of order

(2^{k} - 1) O (n^{- M}) = O (n^{- M + a log 2})

for

k = a log n

. Thus, to show the asymptotic smallness, it is enough to choose

M = a log 2 + ϵ

, where

ϵ

is a small positive value. Now, it only remains to show (79) is asymptotically negligible as well. We define

\begin{matrix} Δ_{w, w^{'}} (z) = \frac{1}{1 - (1 - P (w) - P (w^{'})) z} - \frac{S_{w} (z) S_{w^{'}} (z) - S_{w, w^{'}} (z) S_{w^{'}, w} (z)}{D_{w, w^{'}} (z)} . \end{matrix}

(80)

Next, we extract the coefficient of

z^{n}

\begin{matrix} [z^{n}] Δ_{w, w^{'}} (z) & = \frac{1}{2 π i} \oint Δ_{w, w^{'}} (z) \frac{d z}{z^{n + 1}}, \end{matrix}

(81)

where the path of integration is a circle about the origin with counterclockwise orientation. We define

\begin{matrix} I_{n}^{w, w^{'}} (ρ) = \frac{1}{2 π i} \int_{| z | = ρ} Δ_{w, w^{'}} (z) \frac{d z}{z^{n + 1}}, \end{matrix}

(82)

The above integrand has poles at

z = 0

,

z = α_{w, w^{'}}

(as in (46)), and

z = \frac{1}{1 - P (w) - P (w^{'})}

. We have chosen

ρ

such that the poles are all inside the circle

| z | = ρ

. It follows that

\begin{matrix} I_{n}^{w, w^{'}} (ρ) & = {Res}_{z = 0} \frac{Δ_{w, w^{'}} (z)}{z^{n + 1}} + {Res}_{z = α_{w, w^{'}}} \frac{Δ_{w, w^{'}} (z)}{z^{n + 1}} + {Res}_{z = \frac{1}{1 - P (w) - P (w^{'})}} \frac{Δ_{w} (z)}{z^{n + 1}}, \end{matrix}

(83)

and the residues give us the following.

\begin{matrix} {Res}_{z = \frac{1}{1 - P (w) - P (w^{'})}} \frac{1}{1 - (1 - P (w) - P (w^{'})) z) z^{n + 1}} = - {(1 - P (w) - P (w^{'}))}^{n + 1}, \end{matrix}

and

\begin{matrix} {Res}_{z = α_{w, w^{'}}} \frac{S_{w} (z) S_{w^{'}} (z) - S_{w, w^{'}} (z) S_{w^{'}, w} (z)}{D_{w, w^{'}} (z)} = \\ \frac{S_{w} (α_{w, w^{'}}) S_{w^{'}} (α_{w, w^{'}}) - S_{w, w^{'}} (α_{w, w^{'}}) S_{w^{'}, w} (α_{w, w^{'}})}{β_{w, w^{'}} α_{w, w^{'}}^{n + 1}}, \end{matrix}

where

β_{w, w^{'}}

is as in (47). Therefore, we get

\begin{matrix} \sum_{\begin{matrix} w, w^{'} \in A^{k} \\ w \neq w^{'} \end{matrix}} [z^{n}] Δ_{w, w^{'}} (z) & = \sum_{\begin{matrix} w, w^{'} \in A^{k} \\ w \neq w^{'} \end{matrix}} I_{n}^{w, w^{'}} (ρ) \\ - \sum_{\begin{matrix} w, w^{'} \in A^{k} \\ w \neq w^{'} \end{matrix}} (\frac{S_{w} (α_{w, w^{'}}) S_{w^{'}} (α_{w, w^{'}}) - S_{w, w^{'}} (α_{w, w^{'}}) S_{w^{'}, w} (α_{w, w^{'}})}{β_{w, w^{'}} α_{w, w^{'}}^{n + 1}} \\ + {(1 - P (w) - P (w^{'}))}^{n + 1}) . \end{matrix}

(84)

We now show that the above two terms are asymptotically small. □

Lemma 13.

There exists

ϵ > 0

where the sum

\begin{matrix} \sum_{\begin{matrix} w, w^{'} \in A^{k} \\ w \neq w^{'} \end{matrix}} (\frac{S_{w} (α_{w, w^{'}}) S_{w^{'}} (α_{w, w^{'}}) - S_{w, w^{'}} (α_{w, w^{'}}) S_{w^{'}, w} (α_{w, w^{'}})}{β_{w, w^{'}} α_{w, w^{'}}^{n + 1}} + {(1 - P (w) - P (w^{'}))}^{n + 1}) \end{matrix}

is of order O(

n^{- ϵ}

).

Proof.

We define

\begin{matrix} r_{w, w^{'}} (z) & = \frac{S_{w} (α_{w, w^{'}}) S_{w^{'}} (α_{w, w^{'}}) - S_{w, w^{'}} (α_{w, w^{'}}) S_{w^{'}, w} (α_{w, w^{'}})}{β_{w, w^{'}} α_{w, w^{'}}^{z}} + {(1 - P (w) - P (w^{'}))}^{z} . \end{matrix}

The Mellin transform of the above function is

\begin{matrix} r_{w, w^{'}}^{*} (s) = Γ (s) {log}^{- s} (\frac{1}{1 - P (w) - p (w^{'})}) + C_{w, w^{'}} Γ (s) {log}^{- s} (α_{w, w^{'}}), \end{matrix}

(85)

where

C_{w, w^{'}} = \frac{S_{w} (α_{w, w^{'}}) S_{w^{'}} (α_{w, w^{'}}) - S_{w, w^{'}} (α_{w, w^{'}}) S_{w^{'}, w} (α_{w, w^{'}})}{β_{w, w^{'}}}

. We note that

C_{w, w^{'}}

is negative and uniformly bounded from above for all

w, w^{'} \in A^{k}

.For a fixes s, we also have,

\begin{matrix} {ln}^{- s} (\frac{1}{1 - P (w) - P (w^{'})}) & = {ln}^{- s} (1 + P (w) + P (w^{'}) + O (p^{2 k})) \\ = {(P (w) + P (w^{'}) + O (p^{2 k}))}^{- s} \\ = {(P (w) + P (w^{'}))}^{- s} {(1 + O (p^{k}))}^{- s} \\ = {(P (w) + P (w^{'}))}^{- s} (1 + O (p^{k})), \end{matrix}

(86)

and

\begin{matrix} {ln}^{- s} (α_{w, w^{'}}) & = (\frac{S_{w^{'}} (1) - S_{w, w^{'}} (1)}{S_{w} (1) S_{w^{'}} (1) - S_{w, w^{'}} (1) S_{w^{'}, w} (1)} P (w) \\ + \frac{S_{w} (1) - S_{w^{'}, w} (1)}{S_{w} (1) S_{w^{'}} (1) - S_{w, w^{'}} (1) S_{w^{'}, w} (1)} P (w^{'}) + O (p^{2 k}))^{- s} \\ = (\frac{S_{w^{'}} (1) - S_{w, w^{'}} (1)}{S_{w} (1) S_{w^{'}} (1) - S_{w, w^{'}} (1) S_{w^{'}, w} (1)} P (w) \\ + \frac{S_{w} (1) - S_{w^{'}, w} (1)}{S_{w} (1) S_{w^{'}} (1) - S_{w, w^{'}} (1) S_{w^{'}, w} (1)} P (w^{'}))^{- s} (1 + O (p^{k})) . \end{matrix}

(87)

Therefore, we have

\begin{matrix} r_{w, w^{'}}^{*} (s) & = Γ (s) {(P (w) + P (w^{'}))}^{- s} (1 + O (p^{k})) \\ - Γ (s) (\frac{S_{w^{'}} (1) - S_{w, w^{'}} (1)}{S_{w} (1) S_{w^{'}} (1) - S_{w, w^{'}} (1) S_{w^{'}, w} (1)} P (w) \\ + \frac{S_{w} (1) - S_{w^{'}, w} (1)}{S_{w} (1) S_{w^{'}} (1) - S_{w, w^{'}} (1) S_{w^{'}, w} (1)} P (w^{'}))^{- s} (1 + O (p^{k})) O (1) . \end{matrix}

(88)

To find the Mellin strip for the sum

\sum_{w \in A^{k}} r_{w, w^{'}}^{*} (s)

, we first note that

\begin{matrix} {(x + y)}^{a} & \leq x^{a} + y^{a}, for any real x, y > 0 and a \leq 1 . \end{matrix}

Since

- ℜ (s) < 1

, we have

\begin{matrix} {(P (w) + P (w^{'}))}^{- ℜ (s)} \leq P {(w)}^{- ℜ (s)} + P {(w^{'})}^{- ℜ (s)}, \end{matrix}

(89)

and

\begin{matrix} {(\frac{S_{w^{'}} (1) - S_{w, w^{'}} (1)}{S_{w} (1) S_{w^{'}} (1) - S_{w, w^{'}} (1) S_{w^{'}, w} (1)} P (w) ı n + \frac{S_{w} (1) - S_{w^{'}, w} (1)}{S_{w} (1) S_{w^{'}} (1) - S_{w, w^{'}} (1) S_{w^{'}, w} (1)} P (w^{'}))}^{- ℜ (s)} \\ \leq {(\frac{S_{w^{'}} (1) - S_{w, w^{'}} (1)}{S_{w} (1) S_{w^{'}} (1) - S_{w, w^{'}} (1) S_{w^{'}, w} (1)} P (w))}^{- ℜ (s)} \\ + {(\frac{S_{w} (1) - S_{w^{'}, w} (1)}{S_{w} (1) S_{w^{'}} (1) - S_{w, w^{'}} (1) S_{w^{'}, w} (1)} P (w^{'}))}^{- ℜ (s)} . \end{matrix}

(90)

Therefore, we get

\begin{matrix} \sum_{\begin{matrix} w, w^{'} \in A^{k} \\ w \neq w^{'} \end{matrix}} | r_{w, w^{'}}^{*} (s) | & \leq | Γ (s) | O (1) \\ (\sum_{\begin{matrix} w, w^{'} \in A^{k} \\ w \neq w^{'} \end{matrix}} P {(w)}^{- ℜ (s)} (1 - {(\frac{S_{w} (1) S_{w^{'}} (1) - S_{w, w^{'}} (1) S_{w^{'}, w} (1)}{S_{w^{'}} (1) - S_{w, w^{'}} (1)})}^{ℜ (s)}) \\ + \sum_{\begin{matrix} w, w^{'} \in A^{k} \\ w \neq w^{'} \end{matrix}} P {(w^{'})}^{- ℜ (s)} (1 - {(\frac{S_{w} (1) S_{w^{'}} (1) - S_{w, w^{'}} (1) S_{w^{'}, w} (1)}{S_{w} (1) - S_{w^{'}, w} (1)})}^{ℜ (s)})) \\ \leq {(q^{k})}^{- ℜ (s) - 1} | Γ (s) | O (1) \end{matrix}

\begin{matrix} (\sum_{\begin{matrix} w^{'} \in A^{k} \\ w \neq w^{'} \end{matrix}} \sum_{w \in A^{k}} P (w) (1 - {(S_{w} (1))}^{ℜ (s)} {(1 - \frac{S_{w, w^{'}} (1)}{S_{w^{'}} (1)})}^{- ℜ (s)}) \end{matrix}

(91)

\begin{matrix} + \sum_{\begin{matrix} w^{'} \in A^{k} \\ w \neq w^{'} \end{matrix}} \sum_{w \in A^{k}} P (w) S_{w, w^{'}} {(1)}^{ℜ (s)} {(\frac{S_{w^{'}} (1) - S_{w, w^{'}} (1)}{S_{w^{'}, w} (1)})}^{- ℜ (s)} \end{matrix}

(92)

\begin{matrix} + \sum_{\begin{matrix} w \in A^{k} \\ w \neq w^{'} \end{matrix}} \sum_{w^{'} \in A^{k}} P (w^{'}) (1 - {(S_{w^{'}} (1))}^{ℜ (s)} {(1 - \frac{S_{w^{'}, w} (1)}{S_{w} (1)})}^{- ℜ (s)}) \end{matrix}

(93)

\begin{matrix} + \sum_{\begin{matrix} w \in A^{k} \\ w \neq w^{'} \end{matrix}} \sum_{w^{'} \in A^{k}} P (w^{'}) S_{w^{'}, w} {(1)}^{ℜ (s)} {(\frac{S_{w} (1) - S_{w^{'}, w} (1)}{S_{w, w^{'}} (1)})}^{- ℜ (s)}) . \end{matrix}

(94)

By Lemma 10, with high probability, a randomly selected w has the property

S_{w, w^{'}} (1) = O (δ^{k})

, and thus

{(1 - \frac{S_{w, w^{'}} (1)}{S_{w^{'}} (1)})}^{- ℜ (s)} = 1 + O (δ^{k}) .

With that and by Lemma 8, for most words w,

1 - S_{w} {(1)}^{ℜ (s)} (1 + O (δ^{k})) = O (δ^{k}) .

Therefore, both sums (91) and (93) are of the form

(2^{k} - 1) O (δ^{k})

. The sums (92) and (94) are also of order

(2^{k} - 1) O (δ^{k})

by Lemma 10. Combining all these terms we will obtain

\begin{matrix} \sum_{\begin{matrix} w, w^{'} \in A^{k} \\ w \neq w^{'} \end{matrix}} | r_{w, w^{'}}^{*} (s) | & \leq (2^{k} - 1) {(q^{k})}^{- ℜ (s) - 1} | Γ (s) | O (δ^{k}) O (1) . \end{matrix}

(95)

By the inverse Mellin transform, for

k = a log n

,

M = a log 2 + ϵ

and

c \in (- 1, M)

, we have

\begin{matrix} \sum_{\begin{matrix} w, w^{'} \in A^{k} \\ w \neq w^{'} \end{matrix}} r_{w, w^{'}} (z) & = \frac{1}{2 π i} \int_{c - i \infty}^{c + i \infty} (\sum_{\begin{matrix} w, w^{'} \in A^{k} \\ w \neq w^{'} \end{matrix}} r_{w, w^{'}}^{*} (s)) z^{- s} d s = O (z^{- M}) O (2^{k}) \\ = O (z^{- ϵ}) . \end{matrix}

(96)

□

In the following lemma we show that the first term in (85) is asymptotically small.

Lemma 14.

Recall that

\begin{matrix} I_{n}^{w, w^{'}} (ρ) & = \frac{1}{2 π i} \int_{| z | = ρ} Δ_{w, w^{'}} (z) \frac{d z}{z^{n + 1}} . \end{matrix}

We have

\begin{matrix} \sum_{\begin{matrix} w, w^{'} \in A^{k} \\ w \neq w^{'} \end{matrix}} I_{n}^{w, w^{'}} (ρ) & = O (ρ^{- n + 2 k} δ^{k}) . \end{matrix}

(97)

Proof.

First note that

\begin{matrix} Δ_{w, w^{'}} (z) & = \frac{1}{1 - (1 - P (w) - P (w^{'})) z} - \frac{S_{w} (z) S_{w^{'}} (z) - S_{w, w^{'}} (z) S_{w^{'}, w} (z)}{D_{w, w^{'}} (z)} \\ = \frac{z P (w) (S_{w, w^{'}} (z) S_{w^{'}, w} (z) - S_{w} (z) S_{w^{'}} (z) + z^{k - 1} S_{w^{'}} (z) - z^{k - 1} S_{w, w^{'}} (z))}{(1 - (1 - P (w) - P (w^{'})) z) D_{w, w^{'}} (z)} \\ + \frac{z P (w^{'}) (S_{w^{'}, w} (z) S_{w, w^{'}} (z) - S_{w^{'}} (z) S_{w} (z) + z^{k - 1} S_{w} (z) - z^{k - 1} S_{w^{'}, w} (z))}{(1 - (1 - P (w) - P (w^{'})) z) D_{w, w^{'}} (z)} . \end{matrix}

(98)

We saw in (73) that

| 1 - (1 - P (w^{'})) z | \geq c_{2}

, and therefore, it follows that

\begin{matrix} | 1 - (1 - P (w) - P (w^{'})) z | \geq c_{1} \end{matrix}

(99)

For

z = ρ

,

| D_{w, w^{'}} (z) |

is also bounded below as the following

\begin{matrix} | D_{w, w^{'}} (z) | & = | (1 - z) (S_{w} (z) S_{w^{'}} (z) - S_{w, w^{'}} (z) S_{w^{'}, w} (z)) \\ + z^{k} (P (w) (S_{w^{'}} (z) - S_{w, w^{'}} (z)) + P (w^{'}) (S_{w} (z) - S_{w^{'}, w} (z))) | \\ \geq | (1 - z) (S_{w} (z) S_{w^{'}} (z) - S_{w, w^{'}} (z) S_{w^{'}, w} (z)) | \\ - |z^{k}| |(P (w) (S_{w^{'}} (z) - S_{w, w^{'}} (z)) + P (w^{'}) (S_{w} (z) - S_{w^{'}, w} (z)))| \\ \geq (ρ - 1) β - {(p ρ)}^{k} (\frac{2 (1 + p ρ)}{1 - p ρ}), \end{matrix}

(100)

which is bounded away from zero by the assumption of Lemma 7. Additionally, we show that the numerator in (98) is bounded above, as follows

\begin{matrix} | S_{w, w^{'}} (z) S_{w^{'}, w} (z) - S_{w} (z) S_{w^{'}} (z) + z^{k - 1} S_{w^{'}} (z) - z^{k - 1} S_{w, w^{'}} (z) | \leq \\ | S_{w^{'}} (z) (z^{k - 1} - S_{w} (z)) | + | S_{w, w^{'}} (z) (S_{w^{'}, w} (z) - z^{k - 1}) | \\ \leq S_{w^{'}} (ρ) ((S_{w} (ρ) - 1) + O (ρ^{k})) + S_{w, w^{'}} (ρ) (S_{w^{'}, w} (ρ) + O (ρ^{k})) . \end{matrix}

(101)

This yields

\begin{matrix} \sum_{\begin{matrix} w, w^{'} \in A^{k} \\ w \neq w^{'} \end{matrix}} | I_{n}^{w, w^{'}} | & \leq O (ρ^{- n}) \sum_{\begin{matrix} w^{'} \in A^{k} \\ w \neq w^{'} \end{matrix}} S_{w^{'}} (ρ) \sum_{w \in A^{k}} P (w) ((S_{w} (ρ) - 1) + O (ρ^{k})) \\ + O (ρ^{- n}) \sum_{\begin{matrix} w^{'} \in A^{k} \\ w \neq w^{'} \end{matrix}} \sum_{w \in A^{k}} P (w) S_{w, w^{'}} (ρ) (S_{w^{'}, w} (ρ) + O (ρ^{k})) . \end{matrix}

(102)

By (75), the first term above is of order

(2^{k} - 1) O (ρ^{- n + k})

and by Lemma 10 and an analysis similar to (75), the second term yields

(2^{k} - 1) O (ρ^{- n + k})

as well. Finally, we have

\begin{matrix} \sum_{\begin{matrix} w, w^{'} \in A^{k} \\ w \neq w^{'} \end{matrix}} | I_{n}^{w, w^{'}} | & \leq O (ρ^{- n + 2 k} δ^{k}) . \end{matrix}

Which goes to zero asymptotically, for

k = Θ (log n)

. □

This lemma completes our proof of Theorem 2.

3.5. Asymptotic Analysis of the kth Prefix Complexity

We finally proceed to analyzing the asymptotic moments of the kth Prefix Complexity. The results obtained hold true for the moments of the kth Subword Complexity. Our methodology involves poissonization, saddle point analysis (the complex version of Laplace’s method [23]), and depoissonization.

Lemma 15

(Jacquet and Szpankowski, 1998). Let

\tilde{G} (z)

be the Poisson transform of a sequence

g_{n}

. If

\tilde{G} (z)

is analytic in a linear cone

S_{θ}

with

θ < π / 2

, and if the following two conditions hold:

(I) For

z \in S_{θ}

and real values B,

r > 0

, ν

\begin{matrix} | z | > r \to | \tilde{G} (z) | \leq B | z^{ν} | Ψ (| z |), \end{matrix}

(103)

where

Ψ (x)

is such that, for fixed t,

{lim}_{x \to \infty} \frac{Ψ (t x)}{Ψ (x)} = 1

;

(II) For

z \notin S_{θ}

and

A, α < 1

\begin{matrix} | z | > r \to | \tilde{G} (z) e^{z} | \leq A e^{α | z |} . \end{matrix}

(104)

Then, for every non-negative integer n, we have

g_{n} = \tilde{G} (n) + O (n^{ν - 1} Ψ (n)) .

On the Expected Value: To transform the sequence of interest,

{(E [{\hat{X}}_{n, k}])}_{n \geq 0}

, into a Poisson model, we recall that in (25) we found

\begin{matrix} E [{\hat{X}}_{n, k}] & = \sum_{w \in A^{k}} (1 - {(1 - P (w))}^{n}) . \end{matrix}

Thus, the Poisson transform is

\begin{matrix} {\tilde{E}}_{k} (z) & = \sum_{n = 0}^{\infty} E [{\hat{X}}_{n, k}] \frac{z^{n}}{n!} e^{- z} \\ = \sum_{n = 0}^{\infty} \sum_{w \in A^{k}} (1 - {(1 - P (w))}^{n}) \frac{z^{n}}{n!} e^{- z} \\ = \sum_{w \in A^{k}} (1 - e^{- z P (w)}) . \end{matrix}

(105)

To asymptotically evaluate this harmonic sum, we turn our attention to the Mellin Transform once more. The Mellin transform of

{\tilde{E}}_{k} (z)

is

\begin{matrix} {\tilde{E}}_{k}^{*} (s) & = - Γ (s) \sum_{w \in A^{k}} P {(w)}^{- s} \\ = - Γ (s) {(p^{- s} + q^{- s})}^{k}, \end{matrix}

(106)

which has the fundamental strip

s \in 〈 - 1, 0 〉

. For

c \in (- 1, 0)

, the inverse Mellin integral is the following

\begin{matrix} {\tilde{E}}_{k} (z) & = \frac{1}{2 π i} \int_{c - i \infty}^{c + i \infty} {\tilde{E}}_{k}^{*} (s) \cdot z^{- s} d s \\ = \frac{- 1}{2 π i} \int_{c - i \infty}^{c + i \infty} z^{- s} Γ (s) {(p^{- s} + q^{- s})}^{k} d s \\ = \frac{- 1}{2 π i} \int_{c - i \infty}^{c + i \infty} Γ (s) e^{- k (s \frac{log z}{k} - log (p^{- s} + q^{- s}))} d s \\ = \frac{- 1}{2 π i} \int_{c - i \infty}^{c + i \infty} Γ (s) e^{- k h (s)} d s, \end{matrix}

(107)

where we define

h (s) = \frac{s}{a} - log (p^{- s} + q^{- s})

for

k = a log z

. We emphasize that the above integral involves k, and k grows with n. We evaluate the integral through the saddle point analysis. Therefore, we choose the line of integration to cross the saddle point

r_{0}

. To find the saddle point

r_{0}

, we let

h^{'} (r_{0}) = 0

, and we obtain

\begin{matrix} {(p / q)}^{- r_{0}} = \frac{a log p^{- 1} - 1}{1 - a log q^{- 1}}, \end{matrix}

(108)

and therefore,

\begin{matrix} r_{0} = \frac{- 1}{log p / q} log (\frac{a log q^{- 1} - 1}{1 - a log p^{- 1}}), \end{matrix}

(109)

where

\frac{1}{log q^{- 1}} < a < \frac{1}{log p^{- 1}}

.

By (108) and the fact that

{(p / q)}^{i t_{j}} = 1

for

t_{j} = \frac{2 π j}{log p / q}

and

j \in Z

, we can see that there are actually infinitely many saddle points

z_{j}

of the form

r_{0} + i t_{j}

on the line of integration.

We remark that the location of

r_{0}

depends on the value of a. We have

r_{0} \to \infty

as

a \to \frac{1}{log q^{- 1}}

, and

r_{0} \to - \infty

as

a \to \frac{1}{log p^{- 1}}

. We divide the analysis into three parts, for the three ranges

r_{0} \in (0, \infty)

,

r_{0} \in (- 1, 0)

, and

r_{0} \in (- \infty, - 1)

.

In the first range, which corresponds to

\begin{matrix} \frac{1}{log q^{- 1}} < a < \frac{2}{log q^{- 1} + log p^{- 1}}, \end{matrix}

(110)

we perform a residue analysis, taking into account the dominant pole at

s = - 1

. In the second range, we have

\begin{matrix} \frac{2}{log q^{- 1} + log p^{- 1}} < a < \frac{1}{q log q^{- 1} + p log p^{- 1}}, \end{matrix}

(111)

and we get the asymptotic result through the saddle point method. The last range corresponds to

\begin{matrix} \frac{1}{q log q^{- 1} + p log p^{- 1}} < a < \frac{1}{log p^{- 1}}, \end{matrix}

(112)

and we approach it with a combination of residue analysis at

s = 0

, and the saddle point method. We now proceed by stating the proof of Theorem 3.

Proof of Theorem 3.

We begin with proving part

i i

which requires a saddle point analysis. We rewrite the inverse Mellin transform with integration line at

ℜ (s) = r_{0}

as

\begin{matrix} {\tilde{E}}_{k} (z) & = \frac{- 1}{2 π} \int_{- \infty}^{\infty} z^{- (r_{0} + i t)} Γ (r_{0} + i t) {(p^{- (r_{0} + i t)} + q^{- (r_{0} + i t)})}^{k} d t \\ = \frac{- 1}{2 π} \int_{- \infty}^{\infty} Γ (r_{0} + i t) e^{- k ((r_{0} + i t) \frac{log z}{k} - log (p^{- (r_{0} + i t)} + q^{- (r_{0} + i t)}))} d t . \end{matrix}

(113)

Step one: Saddle points’ contribute to the integral estimation

First, we are able to show those saddle points with

| t_{j} | > \sqrt{log n}

do not have a significant asymptotic contribution to the integral. To show this, we let

\begin{matrix} T_{k} (z) & = \int_{| t | > \sqrt{log n}} z^{- r_{0} - i t} Γ (r_{0} + i t) {(p^{- r_{0} - i t} + q^{- r_{0} - i t})}^{k} d t . \end{matrix}

(114)

Since

| Γ (r_{0} + i t) {| = O (| t |}^{r_{0} - \frac{1}{2}} e^{\frac{- π | t |}{2}})

as

| t | \to \pm \infty

, we observe that

\begin{matrix} T_{k} (z) & = O (z^{- r_{0}} {(p^{- r_{0}} + q^{- r_{0}})}^{k} \int_{\sqrt{log n}}^{\infty} t^{r_{0} / 2 - 1 / 2} e^{- π t / 2} d t) \\ = O (z^{- r_{0}} {(p^{- r_{0}} + q^{- r_{0}})}^{k} {(log n)}^{r_{0} / 4 - 1 / 4} \int_{\sqrt{log n}}^{\infty} e^{- π t / 2} d t) \\ = O (z^{- r_{0}} {(p^{- r_{0}} + q^{- r_{0}})}^{k} {(log n)}^{r_{0} / 4 - 1 / 4} e^{- π \sqrt{log n} / 2}) \\ = O ({(log n)}^{r_{0} / 4 - 1 / 4} e^{- π \sqrt{log n} / 2}), \end{matrix}

(115)

which is very small for large n. Note that for

t \in (\sqrt{log n}, \infty)

,

t^{r_{0} / 2 - 1 / 2}

is decreasing, and bounded above by

{(log n)}^{r_{0} / 4 - 1 / 4}

.

Step two: Partitioning the integral

There are now only finitely many saddle points to work with. We split the integral range into sub-intervals, each of which contains exactly one saddle point. This way, each integral has a contour traversing a single saddle point, and we will be able to estimate the dominant contribution in each integral from a small neighborhood around the saddle point. Assuming that

j^{*}

is the largest j for which

\frac{2 π j}{log p / q} \leq \sqrt{log n}

, we split the integral

{\tilde{E}}_{k} (z)

as following

\begin{matrix} {\tilde{E}}_{k} (z) & = - \frac{1}{2 π} (\sum_{| j | < j^{*}} \int_{| t - t_{j} | \leq \frac{π}{log p / q}} z^{- r_{0} + i t} Γ (r_{0} + i t) {(p^{- r_{0} - i t} + q^{- r_{0} - i t})}^{k} d t) \\ - \frac{1}{2 π} \int_{\frac{π}{log p / q} \leq | t_{j}^{*} | < \sqrt{log n}} Γ (r + i t) z^{- r_{0} + i t} {(p^{- r_{0} - i t} + q^{- r_{0} - i t})}^{k} d t . \end{matrix}

(116)

By the same argument as in (115), the second term in (116) is also asymptotically negligible. Therefore, we are only left with

\begin{matrix} {\tilde{E}}_{k} (z) & = \sum_{| j | < j^{*}} S_{j} (z), \end{matrix}

(117)

where

S_{j} (z) = - \frac{1}{2 π} \int_{| t - t_{j} | \leq \frac{π}{log p / q}} z^{- r_{0} + i t} Γ (r_{0} + i t) {(p^{- r_{0} - i t} + q^{- r_{0} - i t})}^{k} d t)

.

Step three: Splitting the saddle contour

For each integral

S_{j}

, we write the expansion of

h (t)

about

t_{j}

, as follows

\begin{matrix} h (t) & = h (t_{j}) + \frac{1}{2} h^{^{″}} (t_{j}) {(t - t_{j})}^{2} + O ({(t - t_{j})}^{3}) . \end{matrix}

(118)

The main contribution for the integral estimate should come from an small integration path that reduces

k h (t)

to its quadratic expansion about

t_{j}

. In other words, we want the integration path to be such that

\begin{matrix} k {(t - t_{j})}^{2} \to \infty, and k {(t - t_{j})}^{3} \to 0 . \end{matrix}

(119)

The above conditions are true when

| t - t_{j} | ≫ k^{- 1 / 2}

and

| t - t j | ≪ k^{- 1 / 3}

. Thus, we choose the integration path to be

| t - t_{j} | \leq k^{- 2 / 5}

. Therefore, we have

\begin{matrix} S_{j} (z) & = - \frac{1}{2 π} \int_{| t - t_{j} | \leq k^{- 2 / 5}} z^{- r_{0} + i t} Γ (r_{0} + i t) {(p^{- r_{0} - i t} + q^{- r_{0} - i t})}^{k} d t \\ - \frac{1}{2 π} \int_{k^{- 2 / 5} < | t - t_{j} | < \frac{π}{log p / q}} z^{- r_{0} + i t} Γ (r_{0} + i t) {(p^{- r_{0} - i t} + q^{- r_{0} - i t})}^{k} d t . \end{matrix}

(120)

Saddle Tails Pruning.

We show that the integral is small for

k^{- 2 / 5} < | t - t_{j} | < \frac{π}{log p / q}

. We define

\begin{matrix} S_{j}^{(1)} (z) = - \frac{1}{2 π} \int_{k^{- 2 / 5} < | t - t_{j} | < \frac{π}{log p / q}} z^{- r_{0} + i t} Γ (r_{0} + i t) {(p^{- r_{0} - i t} + q^{- r_{0} - i t})}^{k} d t . \end{matrix}

(121)

Note that for

| t - t_{j} | \leq \frac{π}{log p / q}

, we have

\begin{matrix} | p^{- r_{0} - i t} + q^{- r_{0} - i t} | & = (p^{- r_{0}} + q^{- r_{0}}) \sqrt{1 - \frac{2 p^{- r_{0}} q^{- r_{0}}}{{(p^{- r_{0}} + q^{- r_{0}})}^{2}} (1 - cos (t log p / q))} \\ \leq (p^{- r_{0}} + q^{- r_{0}}) (1 - \frac{p^{- r_{0}} q^{- r_{0}}}{{(p^{- r_{0}} + q^{- r_{0}})}^{2}} (1 - cos (t - t_{j}) log p / q)) \\ \sin ce \sqrt[]{1 - x} \leq 1 - \frac{x}{2} for x \in [0, 1] \\ \leq (p^{- r_{0}} + q^{- r_{0}}) (1 - \frac{2 p^{- r_{0}} q^{- r_{0}}}{π^{2} {(p^{- r_{0}} + q^{- r_{0}})}^{2}} {((t - t_{j}) log p / q)}^{2}) \\ \sin ce 1 - cos x \geq \frac{2 x^{2}}{π^{2}} for | x | \leq π \\ \leq (p^{- r_{0}} + q^{- r_{0}}) e^{- γ {(t - t_{j})}^{2}}, \end{matrix}

(122)

where

γ = \frac{2 p^{- r_{0}} q^{- r_{0}} {log}^{2} p / q}{π^{2} {(p^{- r_{0}} + q^{- r_{0}})}^{2}}

. Thus,

\begin{matrix} S_{j}^{(1)} (z) & = O (z^{- r_{0}} | Γ (r_{0} + i t) | \int_{k^{- 2 / 5} < | t - t_{j} | < \frac{π}{log p / q}} | p^{- r_{0} - i t} + q^{- r_{0} - i t} | d t) \\ = O (z^{- r_{0}} {(p^{- r_{0}} + q^{- r_{0}})}^{k} \int_{k^{- 2 / 5}}^{\infty} e^{- γ k u^{2}} d u) \\ = O (z^{- r_{0}} {(p^{- r_{0}} + q^{- r_{0}})}^{k} k^{- 3 / 5} e^{- γ k^{1 / 5}}), \sin ce \erf (x) = O (e^{- x^{2}} / x) . \end{matrix}

(123)

Central Approximation.

Over the main path, the integrals are of the form

\begin{matrix} S_{j}^{(0)} (z) & = - \frac{1}{2 π} \int_{| t - t_{j} | \leq k^{- 2 / 5}} Γ (r_{0} + i t) z^{- r_{0} + i t} {(p^{- r_{0} - i t} + q^{- r_{0} - i t})}^{k} d t \\ = - \frac{1}{2 π} \int_{| t - t_{j} | \leq k^{- 2 / 5}} Γ (r_{0} + i t) e^{- k h (t)} d t . \end{matrix}

We have

\begin{matrix} h^{^{''}} (t_{j}) = \frac{{log}^{2} p / q}{{({(p / q)}^{- r_{0} / 2} + {(p / q)}^{r_{0} / 2})}^{2}}, \end{matrix}

(124)

and

\begin{matrix} p^{- r_{0} - i t_{j}} + q^{- r_{0} - i t_{j}} = p^{- i t_{j}} (p^{- r_{0}} + q^{- r_{0}}) . \end{matrix}

(125)

Therefore, by Laplace’s theorem (refer to [22]) we obtain

\begin{matrix} S_{j}^{(0)} (z) & = \frac{1}{\sqrt{2 π k h^{^{''}} (t_{j})}} Γ (r_{0} + i t_{j}) e^{- k h (t_{j})} (1 + O (k^{- 1 / 2})) \\ = \frac{{(p / q)}^{- r_{0} / 2} + {(p / q)}^{r_{0} / 2}}{\sqrt[]{2 π} log p / q} \\ \times z^{- r_{0}} {(p^{- r_{0}} + q^{- r_{0}})}^{k} Γ (r_{0} + i t_{j}) z^{- i t_{j}} p^{- i k t_{j}} k^{- 1 / 2} (1 + O (\frac{1}{\sqrt{k}})) . \end{matrix}

(126)

We finally sum over all j

(| j | < j^{*})

, and we get

\begin{matrix} {\tilde{E}}_{k} (z) & = \frac{{(p / q)}^{- r_{0} / 2} + {(p / q)}^{r_{0} / 2}}{\sqrt[]{2 π} log p / q} \\ \times \sum_{| j | < j^{*}} z^{- r_{0}} {(p^{- r_{0}} + q^{- r_{0}})}^{k} Γ (r_{0} + i t_{j}) z^{- i t_{j}} p^{- i k t_{j}} k^{- 1 / 2} (1 + O (\frac{1}{\sqrt{k}})) . \end{matrix}

(127)

We can rewrite

{\tilde{E}}_{k} (z)

as

\begin{matrix} {\tilde{E}}_{k} (z) & = Φ_{1} ((1 + a log p) {log}_{p / q} n) \frac{z^{ν}}{\sqrt{log n}} (1 + O (\frac{1}{\sqrt{log n}})), \end{matrix}

(128)

where

ν = - r_{0} + a log (p^{- r_{0}} + q^{- r_{0}})

, and

\begin{matrix} Φ_{1} (x) & = \frac{{(p / q)}^{- r_{0} / 2} + {(p / q)}^{r_{0} / 2}}{\sqrt[]{2 a π} log p / q} \sum_{| j | < j^{*}} Γ (r_{0} + i t_{j}) e^{- 2 π i j x} . \end{matrix}

(129)

For part

i i

, we move the line of integration to

r_{0} \in (0, \infty)

. Note that in this range, we must consider the contribution of the pole at

s = 0

. We have

\begin{matrix} {\tilde{E}}_{k} (z) & = {Res}_{s = 0} {\tilde{E}}_{k}^{*} (s) z^{- s} + \int_{r_{0} - i \infty}^{r_{0} + i \infty} {\tilde{E}}_{k}^{*} (z) z^{- s} d s . \end{matrix}

(130)

Computing the residue at

s = 0

, and following the same analysis as in part i for the above integral, we arrive at

\begin{matrix} {\tilde{E}}_{k} (z) & = 2^{k} - Φ_{1} ((1 + a log p) {log}_{p / q} n) \frac{z^{ν}}{\sqrt{log n}} (1 + O (\frac{1}{\sqrt{log n}})) . \end{matrix}

(131)

For part

i i i .

of Theorem 3, we shift the line of integration to

c_{0} \in (- 2, - 1)

, then we have

\begin{matrix} {\tilde{E}}_{k} (z) & = {Res}_{s = - 1} {\tilde{E}}_{k}^{*} (s) z^{- s} + \int_{c - i \infty}^{c + i \infty} {\tilde{E}}_{k}^{*} (z) z^{- s} d s \\ = z + O (z^{- c_{0}} {(p^{- c_{0}} + q^{- c_{0}})}^{k}) \\ = z^{a log 2} + O (z^{ν_{0}}), \end{matrix}

(132)

where

ν_{0} = - c_{0} + a log (p^{- c_{0}} + q^{- c_{0}}) < 1

.

Step four: Asymptotic depoissonization

To show that both conditions in (15) hold for

{\tilde{E}}_{k} (z)

, we extend the real values z to complex values

z = n e^{i θ}

, where

| θ | < π / 2

. To prove (103), we note that

\begin{matrix} | e^{- i θ (r_{0} + i t)} Γ (r_{0} + i t) {| = O (| t |}^{r_{0} - 1 / 2} e^{t θ - π | t | / 2}), \end{matrix}

(133)

and therefore

\begin{matrix} {\tilde{E}}_{k} (n e^{i θ}) & = \frac{1}{2 π} \int_{- \infty}^{\infty} e^{- i θ (r_{0} + i t)} n^{- r_{0} - i t} Γ (r_{0} + i t) {(p^{- r_{0} - i t} + q^{- r_{0} - i t})}^{k} d t \end{matrix}

(134)

is absolutely convergent for

| θ | < π / 2

. The same saddle point analysis applies here and we obtain

\begin{matrix} | {\tilde{E}}_{k} (z) | & \leq B \frac{| z^{ν} |}{\sqrt{log n}}, \end{matrix}

(135)

where

B = | Φ_{1} ((1 + a log p) {log}_{p / q} n) |

, and

ν

is as in (128). Condition (103) is therefore satisfied. To prove condition (104) We see that for a fixed k,

\begin{matrix} | {\tilde{E}}_{k} (z) e^{z} | \leq \sum_{w \in A^{k}} | e^{z} - e^{z (1 - P (w))} | \\ \leq 2^{k + 1} e^{| z | cos (θ)} . \end{matrix}

(136)

Therefore, we have

\begin{matrix} E [{\hat{X}}_{n, k}] = \tilde{E} (n) + O (\frac{n^{ν - 1}}{\sqrt{log n}}) . \end{matrix}

(137)

This completes the proof of Theorem 3. □

On the Second Factorial Moment: We poissonize the sequence

{(E [{({\hat{X}}_{n, k})}_{2}])}_{n \geq 0}

as well. By the analysis in (27),

\begin{matrix} E [{({\hat{X}}_{n, k})}_{2}] & = \sum_{\begin{matrix} w, w^{'} \in A^{k} \\ w \neq w^{'} \end{matrix}} (1 - {(1 - P (w))}^{n} - {(1 - P (w^{'}))}^{n} + {(1 - P (w) - P (w^{'}))}^{n}), \end{matrix}

which gives the following poissonized form

\begin{matrix} \tilde{G} (z) & = \sum_{n \geq 0} E [{({\hat{X}}_{n, k})}_{2}] \frac{z^{n}}{n!} e^{- z} \\ = \sum_{\begin{matrix} w, w^{'} \in A^{k} \\ w \neq w^{'} \end{matrix}} 1 - e^{- P (w) z} - e^{- P (w^{'}) z} + e^{- (P (w) + P (w^{'})) z} \\ = \sum_{\begin{matrix} w, w^{'} \in A^{k} \\ w \neq w^{'} \end{matrix}} (1 - e^{- P (w^{'}) z}) (1 - e^{- P (w) z}) \\ = {(\sum_{w \in A^{k}} (1 - e^{- P (w) z}))}^{2} - \sum_{w \in A^{k}} {(1 - e^{- P (w) z})}^{2} \\ = {({\tilde{E}}_{k} (z))}^{2} - \sum_{w \in A^{k}} {(1 - e^{- P (w) z})}^{2} \\ = {({\tilde{E}}_{k} (z))}^{2} - \sum_{w \in A^{k}} (1 - 2 e^{- P (w) z} + e^{- 2 P (w) z}) . \end{matrix}

(138)

We show that in all ranges of a the leftover sum in (138) has a lower order contribution to

{\tilde{G}}_{k} (z)

compared to

{({\tilde{E}}_{k} (z))}^{2}

. We define

\begin{matrix} {\tilde{L}}_{k} (z) = \sum_{w \in A^{k}} (1 - 2 e^{- P (w) z} + e^{- 2 P (w) z}) . \end{matrix}

(139)

In the first range for k, we take the Mellin transform of

{\tilde{L}}_{k} (z)

, which is

\begin{matrix} {\tilde{L}}_{k}^{*} (s) & = - 2 Γ (s) \sum_{w \in A^{k}} P {(w)}^{- s} + Γ (s) \sum_{w \in A^{k}} {(2 P (w))}^{- s} \\ = - 2 Γ (s) {(p^{- s} + q^{- s})}^{k} + Γ (s) 2^{- s} {(p^{- s} + q^{- s})}^{k} \\ = Γ (s) {(p^{- s} + q^{- s})}^{k} (2^{- s - 1} - 1), \end{matrix}

(140)

and we note that the fundamental strip for this Mellin transform of is

〈 - 2, 0 〉

as well. The inverse Mellin transform for

c \in (- 2, 0)

is

\begin{matrix} {\tilde{L}}_{k} (z) & = \frac{1}{2 π i} \int_{c - i \infty}^{c + i \infty} {\tilde{L}}_{k}^{*} (s) z^{- s} d s \\ = \frac{1}{π i} \int_{c - i \infty}^{c + i \infty} Γ (s) {(p^{- s} + q^{- s})}^{k} (2^{- s - 1} - 1) z^{- s} d s \end{matrix}

(141)

We note that this range of

r_{0}

corresponds to

\begin{matrix} \frac{2}{log q^{- 1} + log p^{- 1}} < a < \frac{p^{2} + q^{2}}{q^{2} log q^{- 1} + p^{2} log p^{- 1}} . \end{matrix}

(142)

The integrand in (141) is quite similar to the one seen in (107). The only difference is the extra term

2^{- s - 1} - 1

. However, we notice that

2^{- s - 1} - 1

is analytic and bounded. Thus, we obtain the same saddle points with the real part as in (109) and the same imaginary parts in the form of

\frac{2 π i j}{log p / q}

,

j \in Z

. Thus, the same saddle point analysis for the integral in (107) applies to

{\tilde{L}}_{k} (z)

as well. We avoid repeating the similar steps, and we skip to the central approximation, where by Laplace’s theorem (ref. [22]), we get

\begin{matrix} {\tilde{L}}_{k} (z) & = \frac{{(p / q)}^{- r_{0} / 2} + {(p / q)}^{r_{0} / 2}}{\sqrt[]{2 π} log p / q} \\ \times \sum_{| j | < j^{*}} z^{- r_{0}} {(p^{- r_{0}} + q^{- r_{0}})}^{k} (2^{- r_{0} - 1 - i t_{j}} - 1) \\ \times Γ (r_{0} + i t_{j}) z^{- i t_{j}} p^{- i k t_{j}} k^{- 1 / 2} (1 + O (\frac{1}{\sqrt{k}})), \end{matrix}

(143)

which can be represented as

\begin{matrix} {\tilde{L}}_{k} (z) & = Φ_{2} ((1 + a log p) {log}_{p / q} n) \frac{z^{ν}}{\sqrt{log n}} (1 + O (\frac{1}{\sqrt{log n}})), \end{matrix}

(144)

where

\begin{matrix} Φ_{2} (x) & = \frac{{(p / q)}^{- r_{0} / 2} + {(p / q)}^{r_{0} / 2}}{\sqrt[]{2 a π} log p / q} \sum_{| j | < j^{*}} (2^{- r_{0} - 1 - i t_{j}} - 1) Γ (r_{0} + i t_{j}) e^{- 2 π i j x} . \end{matrix}

(145)

This shows that

{\tilde{L}}_{k} (z) = O (\frac{z^{ν}}{\sqrt{log n}})

, when

\frac{2}{log q^{- 1} + log p^{- 1}} < a < \frac{p^{2} + q^{2}}{q^{2} log q^{- 1} + p^{2} log p^{- 1}} .

Subsequently, for

\frac{1}{log q^{- 1}} < a < \frac{2}{log q^{- 1} + log p^{- 1}}

, we get

\begin{matrix} {\tilde{L}}_{k} (z) & = 2^{k} - Φ_{2} ((1 + a log p) {log}_{p / q} n) \frac{z^{ν}}{\sqrt{log n}} (1 + O (\frac{1}{\sqrt{log n}})), \end{matrix}

(146)

and for

\frac{p^{2} + q^{2}}{q^{2} log q^{- 1} + p^{2} log p^{- 1}} < a < \frac{1}{log p^{- 1}}

, we get

\begin{matrix} {\tilde{L}}_{k} (z) & = O (n^{2}) . \end{matrix}

(147)

It is not difficult to see that for each range of a as stated above,

{\tilde{L}}_{k} (z)

has a lower order contribution to the asymptotic expansion of

{\tilde{G}}_{k} (z)

, compared to

{({\tilde{E}}_{k} (z))}^{2}

. Therefore, this leads us to Theorem 4, which will be proved bellow.

Proof of Theorem 4.

It is only left to show that the two depoissonization conditions hold: For condition (103) in Theorem 15, from (135) we have

\begin{matrix} | {\tilde{G}}_{k} (z) | & \leq B^{2} \frac{| z^{2 ν} |}{log n}, \end{matrix}

(148)

and for condition (104), we have, for fixed k,

\begin{matrix} | {\tilde{G}}_{k} (z) e^{z} | & \leq \sum_{\begin{matrix} w, w^{'} \in A^{k} \\ w \neq w^{'} \end{matrix}} |e^{z} - e^{(1 - P (w)) z} - e^{(1 - P (w^{'})) z} + e^{(1 - (P (w) + P (w^{'}))) z}| \\ \leq 4^{k} e^{| z | cos θ} . \end{matrix}

(149)

Therefore both depoissonization conditions are satisfied and the desired result follows. □

Corollary. A Remark on the Second Moment and the Variance

For the second moment we have

\begin{matrix} E [{({\hat{X}}_{n, k})}^{2}] & = \sum_{\begin{matrix} w, w^{'} \in A^{k} \\ w \neq w^{'} \end{matrix}} E [{\hat{X}}_{n, k}^{(w)} {\hat{X}}_{n, k}^{(w^{'})}] + \sum_{w \in A^{k}} E [{\hat{X}}_{n, k}^{(w)}] \\ = \sum_{\begin{matrix} w, w^{'} \in A^{k} \\ w \neq w^{'} \end{matrix}} (1 - {(1 - P (w))}^{n} - {(1 - P (w^{'}))}^{n} + {(1 - P (w) - P (w^{'}))}^{n}) \\ + \sum_{w \in A^{k}} (1 - {(1 - P (w))}^{n}) . \end{matrix}

(150)

Therefore, by (105) and (138) the Poisson transform of the second moment, which we denote by

{\tilde{G}}_{k}^{(2)} (z)

is

\begin{matrix} {\tilde{G}}_{k}^{(2)} (z) & = {({\tilde{E}}_{k} (z))}^{2} + {\tilde{E}}_{k} (z) - \sum_{w \in A^{k}} (1 - 2 e^{- P (w) z} + e^{- 2 P (w) z}), \end{matrix}

(151)

which results in the same first order asymptotic as the second factorial moment. Also, it is not difficult to extend the proof in Chapter 6 to show that the second moments of the two models are asymptotically the same. For the variance we have

\begin{matrix} Var [{\hat{X}}_{n, k}] & = E [{({\hat{X}}_{n, k})}^{2}] - {(E [{\hat{X}}_{n, k}])}^{2} \\ = \sum_{\begin{matrix} w, w^{'} \in A^{k} \\ w \neq w^{'} \end{matrix}} (1 - {(1 - P (w))}^{n} - {(1 - P (w^{'}))}^{n} + {(1 - P (w) - P (w^{'}))}^{n}) \\ + \sum_{w \in A^{k}} (1 - {(1 - P (w))}^{n}) \\ - \sum_{\begin{matrix} w, w^{'} \in A^{k} \\ w \neq w^{'} \end{matrix}} (1 - {(1 - P (w))}^{n} - {(1 - P (w^{'}))}^{n} + {(1 - P (w) - P (w^{'}))}^{n}) \\ - \sum_{w \in A^{k}} (1 - {(1 - P (w))}^{n} - {(1 - P (w))}^{n} + {(1 - P (w))}^{2 n}) \\ = \sum_{w \in A^{k}} ({(1 - P (w))}^{n} - {(1 - P (w))}^{2 n}) . \end{matrix}

(152)

Therefore the Poisson transform, which we denote by

{\tilde{G}}_{k}^{var} (z)

is

\begin{matrix} {\tilde{G}}_{k}^{var} (z) = \sum_{w \in A^{k}} (e^{- P (w) z} - e^{- (2 P (w) + {(P (w))}^{2}) z}) . \end{matrix}

(153)

The Mellin transform of the above function has the following form

\begin{matrix} {\tilde{G^{*}}}_{k}^{var} (z) & = Γ (s) {(p^{- s} + q^{- s})}^{k} (- 1 + O (P (w))) . \end{matrix}

(154)

This is quite similar to what we saw in (106), which indicates that the variance has the same asymptotic growth as the expected value. But the variance of the two models do not behave in the same way (cf. Figure 2).

4. Summary and Conclusions

We studied the first-order asymptotic growth of the first two (factorial) moments of the kth Subword Complexity. We recall that the kth Subword Complexity of a string of length n is denoted by

X_{n, k}

, and is defined as the number of distinct subwords of length k, that appear in the string. We are interested in the asymptotic analysis for when k grows as a function of the string’s length. More specifically, we conduct the analysis for

k = Θ (log n)

, and as

n \to \infty

.

The analysis is inspired by the earlier work of Jacquet and Szpankowski on the analysis of suffix trees, where they are compared to independent tries (cf. [14]). In our work, we compare the first two moments of the kth Subword Complexity to the kth Prefix Complexity over a random trie built over n independently generated binary strings. We recall that we define the kth Prefix Complexity as the number of distinct prefixes that appear in the trie at level k and lower.

We obtain the generating functions representing the expected value and the second factorial moments as their coefficients, in both settings. We prove that the first two moments have the same asymptotic growth in both models. For deriving the asymptotic behavior, we split the range for k into three intervals. We analyze each range using the saddle point method, in combination with residue analysis. We close our work with some remarks regarding the comparison of the second moment and the variance to the kth Prefix Complexity.

5. Future Challenges

The intervals’ endpoints for a in Theorems 3 and 4 are not investigated in this work. The asymptotic analysis of the end points can be studied using van der Waerden saddle point method [24].

The analogous results are not (yet) known in the case where the underlying probability source has Markovian dependence or in the case of dynamical sources.

Author Contributions

This paper is based on a Ph.D. dissertation conducted by the L.A. under the supervision of the M.D.W. All authors have read and agreed to the published version of the manuscript.

Funding

M.D.W. Ward’s research is supported by FFAR Grant 534662, by the USDA NIFA Food and Agriculture Cyberinformatics and Tools (FACT) initiative, by NSF Grant DMS-1246818, by the NSF Science & Technology Center for Science of Information Grant CCF-0939370, and by the Society Of Actuaries.

Acknowledgments

The authors thank Wojciech Szpankowski and Mireille Régnier for insightful conversations on this topic.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

PGF	Probabilty Generating Function
$P$	Probability
$E$	Expected value
Var	Variance
$E [{(X_{n, k})}_{2}]$	The second factorial moment of $X_{n, k}$

References

Ehrenfeucht, A.; Lee, K.; Rozenberg, G. Subword complexities of various classes of deterministic developmental languages without interactions. Theor. Comput. Sci. 1975, 1, 59–75. [Google Scholar] [CrossRef]
Morse, M.; Hedlund, G.A. Symbolic Dynamics. Am. J. Math. 1938, 60, 815–866. [Google Scholar] [CrossRef]
Jacquet, P.; Szpankowski, W. Analytic Pattern Matching: From DNA to Twitter; Cambridge University Press: Cambridge, UK, 2015. [Google Scholar]
Bell, T.C.; Cleary, J.G.; Witten, I.H. Text Compression; Prentice-Hall: Upper Saddle River, NJ, USA, 1990. [Google Scholar]
Burge, C.; Campbell, A.M.; Karlin, S. Over-and under-representation of short oligonucleotides in DNA sequences. Proc. Natl. Acad. Sci. USA 1992, 89, 1358–1362. [Google Scholar] [CrossRef] [PubMed]
Fickett, J.W.; Torney, D.C.; Wolf, D.R. Base compositional structure of genomes. Genomics 1992, 13, 1056–1064. [Google Scholar] [CrossRef]
Karlin, S.; Burge, C.; Campbell, A.M. Statistical analyses of counts and distributions of restriction sites in DNA sequences. Nucleic Acids Res. 1992, 20, 1363–1370. [Google Scholar] [CrossRef] [PubMed]
Karlin, S.; Mrázek, J.; Campbell, A.M. Frequent Oligonucleotides and Peptides of the Haemophilus Influenzae Genome. Nucleic Acids Res. 1996, 24, 4263–4272. [Google Scholar] [CrossRef] [PubMed][Green Version]
Pevzner, P.A.; Borodovsky, M.Y.; Mironov, A.A. Linguistics of Nucleotide Sequences II: Stationary Words in Genetic Texts and the Zonal Structure of DNA. J. Biomol. Struct. Dyn. 1989, 6, 1027–1038. [Google Scholar] [CrossRef] [PubMed]
Chen, X.; Francia, B.; Li, M.; Mckinnon, B.; Seker, A. Shared information and program plagiarism detection. IEEE Trans. Inf. Theory 2004, 50, 1545–1551. [Google Scholar] [CrossRef]
Chor, B.; Horn, D.; Goldman, N.; Levy, Y.; Massingham, T. Genomic DNA k-mer spectra: models and modalities. Genome Biol. 2009, 10, R108. [Google Scholar] [CrossRef] [PubMed]
Price, A.L.; Jones, N.C.; Pevzner, P.A. De novo identification of repeat families in large genomes. Bioinformatics 2005, 21, i351–i358. [Google Scholar] [CrossRef] [PubMed]
Janson, S.; Lonardi, S.; Szpankowski, W. On the Average Sequence Complexity. In Annual Symposium on Combinatorial Pattern Matching; Springer: Berlin/Heidelberger, Germany, 2004; pp. 74–88. [Google Scholar]
Jacquet, P.; Szpankowski, W. Autocorrelation on words and its applications: Analysis of suffix trees by string-ruler approach. J. Comb. Theory Ser. A 1994, 66, 237–269. [Google Scholar] [CrossRef][Green Version]
Liang, F.M. Word Hy-phen-a-tion by Com-put-er; Technical Report; Stanford University: Stanford, CA, USA, 1983. [Google Scholar]
Weiner, P. Linear pattern matching algorithms. In Proceedings of the 14th Annual Symposium on Switching and Automata Theory (swat 1973), Iowa City, IA, USA, 15–17 October 1973; pp. 1–11. [Google Scholar]
Gheorghiciuc, I.; Ward, M.D. On correlation Polynomials and Subword Complexity. Discrete Math. Theor. Comput. Sci. 2007, 7, 1–18. [Google Scholar]
Bassino, F.; Clément, J.; Nicodème, P. Counting occurrences for a finite set of words: Combinatorial methods. ACM Trans. Algorithms 2012, 8, 31. [Google Scholar] [CrossRef]
Park, G.; Hwang, H.K.; Nicodème, P.; Szpankowski, W. Profile of Tries. In Latin American Symposium on Theoretical Informatics; Springer: Berlin/Heidelberger, Germany, 2008; pp. 1–11. [Google Scholar]
Flajolet, P.; Sedgewick, R. Analytic Combinatorics; Cambridge University Press: Cambridge, UK, 2009. [Google Scholar]
Lothaire, M. Applied Combinatorics on Words; Cambridge University Press: Cambridge, UK, 2005; Volume 105. [Google Scholar]
Szpankowski, W. Average Case Analysis of Algorithms on Sequences; John Wiley & Sons: Chichester, UK, 2011; Volume 50. [Google Scholar]
Widder, D.V. The Laplace Transform (PMS-6); Princeton University Press: Princeton, NJ, USA, 2015. [Google Scholar]
van der Waerden, B.L. On the method of saddle points. Appl. Sci. Res. 1952, 2, 33–45. [Google Scholar] [CrossRef]

Figure 1. The suffix tree in (a) is built over the first four suffixes of string

X = 101110 . . .

, and the trie in (b) is build over strings

X_{1} = 111 . . .

,

X_{2} = 101 . . .

,

X_{3} = 100

, and

X_{4} = 010 . . .

.

Figure 1. The suffix tree in (a) is built over the first four suffixes of string

X = 101110 . . .

, and the trie in (b) is build over strings

X_{1} = 111 . . .

,

X_{2} = 101 . . .

,

X_{3} = 100

, and

X_{4} = 010 . . .

.

Figure 2. Left:

Φ_{1} (x)

at

p = 0.90

, and various levels of

r_{0}

. The amplitude increases as

r_{0}

increases. Right:

Φ_{1} (x)

at

r_{0} = 1

, and various levels of p. The amplitude tends to zero as

p \to 1 / 2^{+}

.

Figure 2. Left:

Φ_{1} (x)

at

p = 0.90

, and various levels of

r_{0}

. The amplitude increases as

r_{0}

increases. Right:

Φ_{1} (x)

at

r_{0} = 1

, and various levels of p. The amplitude tends to zero as

p \to 1 / 2^{+}

.

Figure 3. Approximated second moments (left), and variances (right) of the kth Subword Complexity (red), and the kth Prefix Complexity (blue), for

n = 4000

, at different probability levels, averaged over 10,000 iterations.

Figure 3. Approximated second moments (left), and variances (right) of the kth Subword Complexity (red), and the kth Prefix Complexity (blue), for

n = 4000

, at different probability levels, averaged over 10,000 iterations.

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ahmadi, L.; Ward, M.D. Asymptotic Analysis of the kth Subword Complexity. Entropy 2020, 22, 207. https://doi.org/10.3390/e22020207

AMA Style

Ahmadi L, Ward MD. Asymptotic Analysis of the kth Subword Complexity. Entropy. 2020; 22(2):207. https://doi.org/10.3390/e22020207

Chicago/Turabian Style

Ahmadi, Lida, and Mark Daniel Ward. 2020. "Asymptotic Analysis of the kth Subword Complexity" Entropy 22, no. 2: 207. https://doi.org/10.3390/e22020207

APA Style

Ahmadi, L., & Ward, M. D. (2020). Asymptotic Analysis of the kth Subword Complexity. Entropy, 22(2), 207. https://doi.org/10.3390/e22020207

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Asymptotic Analysis of the kth Subword Complexity

Abstract

1. Introduction

1.1. Part I

1.2. Part II

2. Results

3. Proofs and Methods

3.1. Groundwork

3.2. Derivation of Generating Functions

3.3. Analytic Properties of the Generating Functions

3.4. Asymptotic Difference

3.5. Asymptotic Analysis of the kth Prefix Complexity

4. Summary and Conclusions

5. Future Challenges

Author Contributions

Funding

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI