Entropy Estimation Using a Linguistic Zipf–Mandelbrot–Li Model for Natural Sequences

Entropy estimation faces numerous challenges when applied to various real-world problems. Our interest is in divergence and entropy estimation algorithms which are capable of rapid estimation for natural sequence data such as human and synthetic languages. This typically requires a large amount of data; however, we propose a new approach which is based on a new rank-based analytic Zipf–Mandelbrot–Li probabilistic model. Unlike previous approaches, which do not consider the nature of the probability distribution in relation to language; here, we introduce a novel analytic Zipfian model which includes linguistic constraints. This provides more accurate distributions for natural sequences such as natural or synthetic emergent languages. Results are given which indicates the performance of the proposed ZML model. We derive an entropy estimation method which incorporates the linguistic constraint-based Zipf–Mandelbrot–Li into a new non-equiprobable coincidence counting algorithm which is shown to be effective for tasks such as entropy rate estimation with limited data.


Introduction
Natural systems such as language, can be understood in terms of symbolic sequences described within an information-theoretic framework, where meaning is encoded through the arrangement of probabilistic elements. When placed in a mathematical framework, we can characterize and begin to understand the meaning of messages, not only on the basis of the meaning directly attached to words, but on the statistical characteristics of symbols.
Using this approach, natural language can be viewed as observing one or more discrete random variables X of a sequence X = X 1 , . . . , X i , . . . , X K , X i = x ∈ X M , that is, x i may take on one of M distinct values, X M is a set from which the members of the sequence are drawn, and hence x i is in this sense symbolic, where each value occurs with the probability p(x i ), i ∈ [1, M].
The single symbol Shannon entropy (if not otherwise specified, any reference to entropy will refer to the classical Shannon entropy of unigram probabilities.) is defined for unigrams as [1,2] In the context of language processing, statistical models of symbol sequences are of interest and can be defined by the probability p(Ω) = p(s 1 , . . . , s N ) where Ω is a sequence of N symbols {s i }. If the full sequence is available, the n-gram entropy can be directly computed using the same formula as (1), where instead of computing unigram probabilities, the joint probabilities are estimated so the n-gram entropy is obtained. However, the problem with this approach is that a large amount of data is generally required and the reliability is questionable for N > 5 [3].
In probabilistic language modeling, it is often of interest to predict the next symbol in a sequence using the previous symbols. Hence, the joint probability can be computed using the Markov property by considering the previous block of symbols in terms of the conditional probabilities as Hence, provided the symbol probabilities can be estimated, it is possible to determine the n-gram probabilities and n-gram entropy of language sequences. A related method of characterizing language models is to measure perplexity [4], defined as P e (s) = 2 H(s) In contrast to entropy, which can be understood as measuring the average number of bits to encode the information in a symbol, perplexity can be intuitively understood as measuring the total number of bits required to encode the information in a sequence; hence, the smaller the value the better. While entropy provides a measure of information in a given sequence, this raises the question of how quickly the information grows with increasing text length [1,5]. The idea is that the entropy rate can measure the complexity of language by the average information content of symbols such as words taken over a sufficiently long period. Similarly, the effectiveness of compression algorithms can be measured by how closely the algorithm can compress any stationary and ergodic source down to the entropy rate for a sufficiently long input source sequence [6].
Entropy rate has been of interest for analyzing the information content neuronal spike sequences [7], complexity of short heart period variability [8], attention models using visual salience attention [9], complexity of animal vocal complexity [10], statistical structure of non-redundant coding sequences in DNA [11], behavior prediction [12] and in estimating the long-term memory of language models [13].
A problem with entropy estimation is characterizing infrequently occurring symbols, hence requiring a potentially large number of samples to adequately model the probabilistic structure [14].
In contrast to statistical descriptive applications which depend on large amounts of data, we are interested in building models of social interaction using entropy estimation methods with limited available data.
Various efficient methods of entropy estimation have been proposed. Entropy estimation over short symbolic sequences for dynamical time series models was considered in [15]. A computationally efficient method for calculating entropy based on a James-Stein-type shrinkage estimator was proposed in [16]. Methods for overcoming bias in maximum likelihood entropy estimators with limited data have been examined in [17]. The Nemenman-Shafee-Bialek (NSB) entropy estimator extends this concept to correct sample size-dependent bias by using a Bayesian approach to construct priors with power-law dependence on the probabilities, in particular, using Dirichlet distributions [18].
The advantages of more sophisticated entropy estimation techniques which go beyond naive plug-in methods are evident. These can be broadly referred to as "model-based" because they introduce some additional complexities into an otherwise simple algorithm which takes into account some understanding of the nature of the data and the estimation process whilst remaining broadly applicable. For example, a model-based estimator using an understanding of how sequences of symbols will have probabilistic patterns of "coincidence" was proposed in [19].
In this paper, we consider a novel probabilistic model-based entropy estimator which extends [19] and is comparable to [18] in that it uses a limited amount of data and an a priori model as a basis for constructing an efficient entropy estimator. The model "hint" that we introduce is the idea that for many natural sequences including language, instead of a naive estimator, the probabilistic distribution of symbols is expected to follow linguistic patterns.
Hence, the basis for our proposed approach is to develop an analytic rank-based Zipfianstyle probabilistic model which is constrained to accommodate the linguistic features of human language and to incorporate this into an efficient non-equiprobable coincidence counting the entropy estimation algorithm.
In the next section, we describe a coincidence-counting entropy estimator and introduced the concept of linguistic entropy. In Section 3, we introduce a new framework of linguistic probabilistic models which is incorporated into the proposed entropy estimator. In Section 4, we demonstrate the efficacy of the proposed model and show that it provides a high degree of accuracy while requiring a small number of samples.

Coincidence Counting Approach
Entropy estimation difficulties can occur with low probability events exacerbated due to real-world issues surrounding the data, including problems of small data sets [20], limited resource environments [21], bias due to heavy-tailed distributions [22] and uneven distributions with poorly populated bins [23]. The latter problem is especially evident in estimating entropy in language involving very low probability events such as infrequent words.
The problem of undersampling in the context of entropy estimation, where the alphabet size is large compared with the number of samples, that is, N M, has been considered at length where it is well known that significant bias can occur, particularly in the case of using binning approaches [23,24]. Hence, alternative entropy estimation algorithms are of interest which can provide useful results with a small number of samples [25][26][27].
One class of proposed solutions is based on the method of coincidence counting to derive entropy from the phase space trajectory of symbolic events [28]. In particular, Ma proposed the method of coincidence counting as a suitable method of deriving entropy from the phase space trajectory, noting the problematic issue of metastability with estimating the empirical probability distribution. A simple algorithm for entropy estimation based on this approach was proposed in [19]. Their novel approach used the idea of estimating probabilities from a quadratic function of the inverse number of symbol coincidences; however, it has the limitation of this method, which was that it assumed equiprobable symbols. In the next section, we show how it is possible to extend this to the non-equiprobable case by using an analytic Zipf-Mandelbrot-Li law.

Linguistic Entropy Estimation
Consider a sequence of symbols which is defined by a discrete (or symbolic) random variable x which may take on a finite number M of distinct values x i ∈ {x 1 , . . . , x M } with probabilities p(x i ), i ∈ [1, M]. For example, suppose M = 4 and we have a sequence of symbols abcdabc. The frequency of symbols can be estimated as a function of the distance between consecutive repeating symbols or the 'coincidence distance' and in this case, the initial distance for a is D(a; M) = 5. Hence, it can be observed that by measuring this distance, it may be possible to estimate the relative frequency of any given symbol by measuring the distance between them.
To compute the probability f (n; M) of a first coincidence occurring exactly at the nth symbol for 1 < n < M means that it is necessary to compute the probability of drawing no repeating symbols in the entire sequence up to the (n − 1)th draw given by F(n − 1; M) and consequently drawing any q n−1 ∈ [2, . . . , n − 1] identical symbols is given by F(n − 1; M). Hence, the nth symbol coincidence probability is given by [19] f (n; M) = F(n; M) − F(n − 1; M) The expectation of the discrete parameter n and its associated probability f (n; M) is given by Since n is a function of M, we may define the coincidence distance: By forming a model to estimate M using the symbol distance D such as then by measuring D(M) and hence evaluating M from the parametric model, then the entropy can be directly estimated from the symbol coincidences. The model parameters θ i can be determined by fitting a curve to an ensemble of data. For equiprobable symbols, the Shannon entropy is estimated as Using this approach with only a small number of symbol observations, entropy estimation for equiprobable symbols was shown to be accurate and with a low bias of [19,29]. Now, for any given M, each symbol of a specified rank r can be treated as being equiprobable and hence by considering the probability of each ranked symbol, then we have: where F(n; M) is the probability of drawing any symbol on the first try followed by any other different symbol up to the nth draw and up to n symbols, and P h is the probability of independently drawing h − 1 identical symbols from a set of M in h − 1 draws. It follows that we can define the probabilities in terms of rank using a probabilistic model such as the Zipf-Mandelbrot-Li law developed by Li [30], who showed that the constants can be computed as with a normalization step introduced in [14] as to give: which leads to: This approach provides an equiprobable representation of the symbols by considering a different model for each symbol rank. Hence, an invertible rank-based model for D(M) = J (n, r, M) such as a power-based model is chosen so that the inverse model can be directly estimated using the forward model, for example, as where θ = {a, b, c} are the forward parameters. Given an estimate M(D; r) from the observed inter-symbol distance, it is possible to apply this parameter to the Zipf-Mandelbrot-Li set of equations in addition to our rank-based probability model, and estimate the entire set of symbolic probabilities. Using P h (r, M), the entropy can then be easily estimated as which defines the rank r Shannon entropy estimate. The model is applied by determining the mean distance between symbols D i (r) and then finding the estimated value M(D; r) which is used to estimate entropy. The scaling of the inverse model curves for the proposed algorithm can be observed in Figure 1.

Remarks on Bias and Convergence Properties
The proposed algorithm is defined in terms of a Zipf-Mandelbrot-Li distribution which uses a coincidence counting approach to estimate the mean symbolic distance between one or more ranked symbols and then use this to form an estimate of the whole distribution. We can consider the bias and convergence properties in terms of the maximum likelihood estimator for D(M; r) in contrast to the probabilities directly in a plug-in entropy estimator.
Algorithmic bias can occur due to systematically underestimating the mean distance between symbols D i (r) [29]. To compute the bias precisely requires a closed form of the probability density function of D i (r). Analyzing this requires considering (4) and (11)- (16), where an approximation to the probability distribution can be obtained from the multiplicative process defined by (11) in terms of a log-normal distribution [31]. However, since a closed form solution for the likelihood of a log-normal distribution is not generally available, it is non-trivial to determine the specific bias properties [32,33]. One possible solution to this is to examine the probabilistic bounds [34]. Though we do not consider it in this paper, the bias properties of the proposed algorithm may also be improved by applying techniques such as the Miller-Madow procedure [16].
In terms of the convergence properties of the algorithm, a full proof of convergence properties is beyond the scope of this paper; however we provide an indication of some properties of the expected result. Note that D i (r) can be determined from the maximum likelihood estimation of the inter-symbol distance for any given symbol. Choosing the most frequent symbol r = 1, then D i (1) will converge in the sense of a usual maximum likelihood estimator to within some limits with a particular confidence level [35].
Convergence depends on symbols with rank r = 1, with specified probability P h (1, M) and all other possible symbols with probability 1 − P h (1, M). Hence, the number of symbols required to estimate D i (r) to within the specified degree can be found by means of the Dvoretzky-Kiefer-Wolfowitz (DKW) inequality [35].
Since the convergence of D i (r) depends on the estimation of P h (r, M), then the DKW inequality provides the following result [14]: where for a maximum difference r between the estimated probability P h (r, M) and its theoretical target value P(r, M), there will be n r samples required to estimate the probability with a confidence level of ζ , specified by Hence, following [14], it can be shown that for a given confidence level, the minimum number of samples required to estimate P(r, M) which are described by a Zipf-Mandelbrot-Li approximation, can be found as where for a ranked distribution, ∆ r is found as Now, it follows that the relative convergence performance can be defined in terms of a scaling factor λ f (M) which measures the reduction in samples required for convergence compared to a naive plug-in estimator as measured against the symbolic alphabet size M.
Hence, we have: which simplifies to: A graph of the relative convergence performance is shown in Figure 2 where an improvement of several orders of magnitude in the reduction in the number of samples required can be observed for alphabet sizes in the ranges M = 20 − 40 which are of typical interest in both natural and synthetic languages.
In contrast, the conventional plug-in estimator requires an estimation of all symbol probabilities which depends on the probabilities of all symbols and consequently significantly more samples. A full derivation of this latter result is shown in [14]. Hence, the proposed entropy estimation algorithm converges with a factor of approximately λ f (M) fewer symbols than in the conventional case. The current estimator employs a ZML distribution, and in the next section, we extend this model to include linguistic constraints.

Limitations of Zipfian Models for Language
For various natural sequences, Zipf's law describes how the frequency of ranked events occurring in such a way that they can be described by a power law [36][37][38]. This question of whether Zipf's law is a universal law of natural language and other phenomena has generated substantial interest over a long period of time [39].
The premise of Zipf in 1949 was that natural systems follow a principle of least effort, which means that individuals will follow a course of action which involves the expenditure of the least amount of work [40]. In terms of human language, this implies that the distribution of word use would follow the same principle so communication would occur efficiently with the least effort.
Various ongoing works have attempted to prove and disprove results in this field. Miller proposed that a monkey typing would produce a natural language with Zipfian distribution [41]. This argument was based on the result that the probabilistic distribution of words in natural languages only occurs as a statistical artifact of random spaces and can be described by Zipf's law.
While this claim has continued to generate considerable interest decades later, in fact, Miller's result was shown to be flawed by Howes in 1968 [42]. The problem with Miller's result is that assumes all word probabilities are strictly ranked by word length. Moreover, it assumes that all possible words of the same length have the same probability and that all sequences of letters are equiprobable. Clearly, these assumptions are not valid for natural language. A more recent analysis of this problem was performed, for example, considering unequal letter probabilities and log normal rank distributions [43,44]. It was found in [45] that the average information content was more consistently ranked than word length, by examining the inter-word statistical dependencies as the n-gram entropy of words in a local linguistic context. Cancho and Sole considered the principle of least effort in human language as a compromise between speaker's and hearer's needs, where they were able to show results which indicate how Zipf's law explains the observations [46]. The concept of efficiency in languages was considered in [47], where efficiency can be defined in terms of successfully transmitting many different messages with minimal effort, yet balanced in terms of informativeness and complexity. Principles of least effort are closely related to the concept of semantic language universals by which such effort can be instantiated and measured. Accordingly, the principle of ease of learning was found to have strong evidence as a language universal in [48].
Zipf's law has been shown to occur as a result of the choice of rank as an independent variable [30,49], and hence has been challenged in terms of suitability as a universal model of human language or other natural sequences [50]. For example, in [30], it was reported that because the word frequency distribution of random texts can exhibit Zipfian characteristics, then Zipf's law is unsuitable as a criterion for identifying natural languages. However the statistical analysis of this result was disputed in [51] where it was shown that the rank distributions of random and natural texts are statistically inconsistent, and this suggests that Zipf's law may exist as a fundamental principle in natural languages.
While word frequency has generated significant interest, Zipf's law may operate at other levels in natural language, for example some results show the ranked order of phrases in natural language [52]. Moving towards more complex understandings of how Zipf's law can be refined, Corral considered the issue of how Zipf's law applies to normalized language element lemmas, i.e., a stem-like word form [50], and how a more complex formulation of Zipf's law of word frequency arising from a mixture of conditional distributions of frequency at different lengths may provide a better explanation of observations [53].
It is evident that there is not yet a single definitive answer as to the question of whether Zipf's law is necessarily a universal model of human language. However, it is clearly useful in forming a model of ranked symbolic information transmission which mimics human language elements [49,53] and have proved helpful as a probabilistic model for characterizing the observed behavior of natural symbolic sequences [54]. Here, we do not seek to prove the universality of Zipfian laws for language, but we consider their use as a way to model some aspects of natural language and how this may be useful for entropy estimation.
While a number of variations of Zipf's law have been proposed, including the Bradford Law [55] and Lotka Law [56], we previously proposed a new variation of the model we refer to as the Zipf-Mandelbrot-Li law [14,30,[57][58][59][60], which models the frequency rank r of a language element x ∈ Σ M+1 from an alphabet of size M + 1, then, for any random word of length L, given by v k (L) = {w s , x 1 , . . . , x L , w s }, k = 1, . . . , M L the frequency of occurrence is determined as where Li showed that λ can be analytically determined [30]. While these various rank-based laws provide a convenient analytical framework to model symbolic sequences, there is a problem in terms of known languages because such simple models carry no particular linguistic information. For example, consider the case of five-letter words. In English, there are approximately 12,500 known five-letter dictionary words. However, in the unconstrained ZML model, there can be 11.8 M words allowed. In the unconstrained form, this means that we allow words such as aaaag, rrrrx, czzzs, xyyaa and many other invalid words. A way to understand this is that such words do not conform to known linguistic principles such as orthographics [61], syllable structure [62], consonant/vowel ratio and organization [63,64], letter position [65] and graphemes [66]. Each of these can be viewed as a constraint on the allowable letters and their position in any word and hence it is evident that an unconstrained word-letter model allows a greater number of words than should be anticipated.
This raises the question of whether it is possible to introduce a probabilistic model which extends the advantageous Zipf-Mandelbrot-Li law to one which includes some linguistic constraints. Such a model would potentially provide the same useful analytic formulation while providing a more realistic bound on the types of words implicitly permitted. While the ZML model offers an effective basis for computing linguistic behavioral characteristics, it is evident that there is a need for improved models which provide a greater degree of conformity to the true properties of human language if such models are to be better utilized.

Unconstrained Rank-Ordered Probabilistic Model
Given a natural sequence such as language elements, consider a Zipfian probabilistic model of symbolic events which models the frequency rank r of a word (a word or n-gram is not necessarily referring to human language, but indicates a specific set of sequentially occurring symbols.), i.e., the r-th most frequent word, by a simple inverse power law, such that the frequency of a word f (r) scales according to an equation which is given by where a proportionality-dependent constant on the particular corpus may be introduced, Ref. [30] and where typically α ≈ 1. Thus, if p i (x) follows a Zipfian law, then p 0 (x) ∝ 1/M and p i (x) = ϕ f (r). This power law equation can be understood in terms of the principle of least effort. It indicates that frequency decays linearly as the rank increases according to a log-log scale. A way to view this is that according to Zipf's law, efficient language will minimize the effort between speaker and hearer [46]. Hence, the speaker may use a small vocabulary of common words to minimize effort in speaking and the hearer may desire a large vocabulary of less common words to minimize the effort in terms of ambiguity or confusion (Note that ambiguity and confusion are different concepts. The former may define the meaning of words, whereas confusion can relate to the intelligibility of words. Zipf's law provides a mathematical explanation of the balance between these competing features.).
We consider the Zipf-Mandelbrot law below [58]: Given symbols x ∈ Σ M+1 from an alphabet of size M + 1 which includes a blank space w s then for any random word of length L, given by v k (L) = {w s , x 1 , . . . , x L , w s }, k = 1, . . . , M L the total number of words possible is given by It follows that the frequency of occurrence for an unconstrained word of length L following a Zipf-Mandelbrot-Li distribution is determined as where γ is a normalization constant. Now, the summation of all probabilities of all such words is given by = γM (M + 1) 2 (32) and hence the normalization constant can be found as Li subsequently showed that, using an exponential transformation from the word length to word rank model, it is possible to derive a rank ordered, parametric probabilistic model-which extends the Zipf-Mandelbrot model and is defined in terms of the alphabet size M [30].

Constrained Linguistic Probabilistic Model
The model proposed by Li is particularly advantageous in a number of ways; however, in terms of our interest in synthetic language, the model makes a number of assumptions which depart from known statistical linguistics. For example, the model assumes that for a word of length L, the total number of words possible is M L given by (28). However, for a typical alphabet size, this vastly overestimates the number of words expected in a language, including many words which would not occur in known human languages.
As described in the previous section, the problem with the current ZML law is that the model is based on a simple estimate of the upper limit of possible words without consideration given to linguistic rules or other natural language principles beyond the initial power law. For example, in the English language, this might include orthographic spelling rules such as: (a) every word has at least one vowel; (b) "q" is almost always followed by "u"; (c) "s" never follows "x"; and (d) words never end in "v" or "j". These could potentially be considered as priors in a model, and there are other aspects of interaction in human communication and natural languages beyond linguistics which could be considered as statistical principles (For convenience we refer to these broadly as linguistic constraints and note that they may be related to verbal or written language.)to include in a model.
Another view of this problem is in terms of optimal coding, where the aim is typically to encode the most frequently used words in the most efficient way [67,68]. Despite ongoing interest in this area, there is strong evidence for Zipf's law of abbreviation which indicates that the highest ranking most frequent words tend to be shorter [38].
In contrast to a considerable body of work in deriving models to analyze and understand natural language, our interest is in deriving a probabilistic framework for constructing synthetic languages. Hence, in this section, we propose to consider a modified cZML law which is constrained to include linguistic principles.
As an example of linguistic features, we might consider the example of double-letter words. An extremely small number of words have double letters in comparison to the number of actual words possible. For five-letter English words, there are approximately 100 readily identifiable double letter words out of a possible 1.5 × 10 18 words. Hence, we can introduce a new ZML model which introduces the constraint of not permitting words with adjacent double letters.
The derivation of the ZML with linguistic constraints is given in Appendix A.1. The effect of parametrization due to the constraints is indicated in Figure 3 where it can be readily seen that the effect is most significant for smaller values of M but diminishes quickly as M increases. Similarly, the effect of the constrained cZML model can be observed in Figure 4 where the change in the symbolic probabilities can be observed for small values of M.  This new cZML model introduces synthetic linguistic constraints based on having no adjacent repeating symbols. It is evident that other constraints can be considered to improve the accuracy of the model from a linguistics perspective, which we derive in the next section.

Constrained Linguistic Probabilistic Model II
This issue of the language space is well known in terms of language smoothing, where techniques such as the CN-gram (continuation n-gram [69]) have been proposed to reduce the space of possible words. Based on known human languages, the value of N w (L; M) used by the constrained ZML model considered above is generally too large for a given value of M.
Hence, in this section, we derived a constrained cZML which introduces a reduced lexicon space through a CN-gram-style approach. The derivation of the ZML with linguistic constraints in this case is given in Appendix A.2.
The effect of the second form of the constrained ZML model can be observed in Figure 5 where the change in the symbolic probabilities can be observed for small values of M. While only a small probabilistic variation is observed, this corresponds to a significantly reduced maximum vocabulary size, which will lead to a more accurate estimation of entropy. In the next section, we consider the performance of these newly proposed constrained ZML models within the efficient model-based entropy estimation algorithm described in Section 2.

Constrained Linguistic ZML Model for Natural Language
To test the performance of the proposed models, we applied them to English language data from the Google Web Trillion Word Corpus [70,71].
The proposed linguistically constrained cZML model approximates the higher ranked probabilities with better accuracy than the original ZML model and also enables better accuracy for the low-ranked probabilities (Figures 6 and 7).
We consider the full set of 676 two letter bigrams from the Google data set. The nonlinear behavior of the actual data is evident (Figure 7). The original ZML model shows linear behavior and does not approximate the low ranked probabilities very well. In contrast, the proposed linguistically constrained cZML model has the effect of flattening the ranked probabilities to give better high-ranked approximation, while also producing more accurate behavior for the low ranked probabilities. Note that since we utilize a model-based approach for estimating entropy, the choice of Zipfian model is significant to the outcome. Hence, because the proposed model more accurately approximates typical linguistic sequences, this can can be expected to lead to a more accurate entropy estimation process, though still within the limitations described above. It is evident that, given the success of this approach, it is possible to consider numerous other constraints to improve the accuracy of the model from a linguistic perspective.

Entropy Rate Estimation
The entropy rate can be defined as the limit of joint entropy for an increasing number of symbols given by [67] h r (X) = lim where X = X 1 , . . . , X N can represent successive blocks of symbols. The task of entropy rate estimation is known to present a challenge due to the difficulty in obtaining a consistent estimate [72] and various number methods have been described. It is shown that, for the condition of stationarity, the entropy rate can be defined in terms of conditional entropy as [67]: In practice, for finite sequences of symbols, the condition of stationarity may not hold and therefore estimating the entropy rate using (35) may result in a different value than with (36). The value of entropy rate was estimated by Shannon using an experimental approach and found to be about one bit-per-character (bpc) [5]. Cover estimated the entropy rate for the novel Jefferson the Virginian, by Dumas Malone to be 1.25 bpc [73]. A word trigram method which used the cross-entropy between this model and a balanced sample of English text trained on a language model of 583 million symbols was applied to Form C of the Brown corpus which yielded an upper bound entropy rate estimate of 1.75 bpc [74]. A number of unigram entropy estimation methods using a stabilization criterion and a linear entropy to entropy rate conversion model were considered in terms of a large scale study across three parallel corpora, encompassing approximately 450M words in 1259 languages, leading to estimates of the entropy rate of 6 bits per word in [75]. An estimation method using the limit of successive backward differences in n-gram entropies was proposed in [76]. Compression algorithms have also been used as a basis for entropy rate estimation [77,78].
A method using an exponential extrapolation function was proposed in [79] to provide an estimate of entropy rate across multiple languages and 20 corpora provided results tending towards infinity. Interestingly, this result indicated that the entropy rates of human languages are positive but approximately 20% smaller than without extrapolation, which appears to be in agreement with the results obtained for entropy rate estimation using the algorithm proposed here.
Some issues are evident in the experimental studies because of the use of different entropy rate estimation algorithms, different corpora, the treatment of how n-grams are evaluated, for example whether only actually occurring n-grams within words are used or not (see for example the contrasting discussions between [3,74,75]), the inclusion of only alphabetic characters or whether to include punctuation, and various other factors.
Here, we estimate the entropy rate using the proposed cZML model-based entropy estimator applied to the Brown corpus which consists of approximately 5.5 million characters [80]. While a comprehensive comparison of the various entropy estimation algorithms is beyond the scope of this paper, the results of the proposed algorithm are compared with the plug-in entropy estimation approach. Prefiltering was performed to remove all non-alphanumeric characters except for spaces, and n-grams which are not part of any word were excluded. It is well recognized that a smaller data set presents a challenge for entropy estimation especially with increasing word lengths [3,75] and so it is of interest to observe the relative performance of the proposed algorithm.
The conditional entropy rate estimation approach of (36) was adopted in each case. For the entire Brown corpus, the entropy rate was estimated as 1.29 bpc, whereas using plug-in method, the result was 2.04 bpc and outside the upper bound indicated in [74].
The results from the proposed algorithm compare well with the result of 1.25 bpc for a large collection of English corpora in [3]; however, it is evident that the results from the plug-in method are significantly different, giving less confidence in their reliability. The advantage of requiring fewer samples for the proposed entropy estimation algorithm is also apparent in this case of entropy rate estimation.

Convergence of Constrained cZML Entropy Estimation Algorithm
The convergence characteristics of the proposed algorithm can be thereby obtained by generating a random sequence of symbols according to p j (x) and the estimated entropy is developed as a function of the sample size N s . This approach is applied to an example case where M = 30 with the results shown in Figure 8. The performance of the proposed algorithm is compared with a conventional plug-in estimator on a defined entropy estimation task over a large range of samples sizes up to N s = 10 6 symbolic samples. The mean entropy estimate is obtained by averaging over N v = 75 trials, where it can be observed that the new algorithm converges very rapidly, requiring significantly fewer samples to converge compared to a conventional plug-in estimator approach [14].
Note that since both the proposed algorithm and the conventional plug-in entropy estimation algorithm rely on a maximum likelihood method, the computational burden of each is O(N). However, the key advantage of the proposed model-based algorithm is that it requires substantially fewer samples than the conventional plug-in entropy estimator. Figure 8. The convergence performance of the proposed entropy estimation algorithm is compared to a regular plug-in estimator for M = 30 as a function of the sample size, up to N s = 10 6 samples averaged across N v = 75 trials. It can be noted that the estimate H 0 (N s ) from the proposed algorithm converges rapidly very closely towards the true entropy value (within three decimal places). In contrast, the conventional plug-in estimator H p (N s ) converges much more slowly towards a biased estimate.
It is well known that conventional plug-in estimators will give biased estimates. For the proposed algorithm, any such bias will result from the accuracy with which D r (M) can be estimated. It can be noted that as with conventional probability estimates, provided the system is stationary, it is possible to improve the estimate of D r (M) by increasing the number of samples.

Conclusions
Entropy estimators which go beyond naive maximum likelihood methods are of considerable interest, particularly in terms of overcoming limitations due to data. The approach of coincidence counting is recognized as a potentially powerful approach for model-based estimators. Here, we show that an efficient coincidence counting estimator can be derived using a Zipf-Mandelbrot-Li law which provides a significant reduction in the data required.
Interestingly, while Zipfian laws are ubiquitous in fields such as natural language, surprisingly, it appears that these models have evidently been developed essentially without particular regard for the possible inclusion of linguistic constraints.
In this paper, by introducing some simple linguistic constraints, we extended the regular rank-based Zipf-Mandelbrot-Li model to one which provides more realistic assumptions and makes it more suitable for a broad range of probabilistic language models which rely on an analytical Zipfian framework. Such models can be applied to the human language and provide a necessary foundation for developing synthetic language models.
Conceptually, the idea of model-based entropy estimators seems like it may potentially sacrifice accuracy; however, our results show that for natural systems where the symbolic events follow an approximate Zipfian distribution and using limited data, the performance is better than that obtained by a naive entropy estimator. We derived results which indicate that the expected improvement in convergence and demonstrated the efficacy of the proposed model on the entropy rate estimation and two-letter bigram entropy estimation where it was shown to produce more accurate behavior for both low-ranked and high-ranked symbolic probabilities.
The proposed constrained linguistic Zipf-Mandelbrot-Li model appears to be the first time this approach has been adopted. In future work, it would be of interest to further extend this concept by introducing more sophisticated linguistic constraints and to explore applications where limited data are available.

Acknowledgments:
The authors gratefully acknowledge funding support from the University of Queensland and from the Australian Government through the Defence Cooperative Research Centre for Trusted Autonomous Systems. The DCRC-TAS receives funding support from the Queensland Government.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A. Derivation of Constrained Linguistic Zipf-Mandelbrot-Li Models
Appendix A.1. cZML Model I The Zipf-Mandelbrot-Li law with linguistic constraints of the first type is derived as follows. For a word of length L, which is constrained to have no repeating adjacent symbols, the total number of words possible is given by and the frequency of occurrence for a word of length L using the constrained probabilistic model is: where γ is a normalization constant. It follows that γ can be determined from the word frequency normalization condition via the summation of all probabilities of such words [30], hence: and hence the normalization constant can be found as From (A2), the probability for any possible constrained word is: and hence the frequency of occurrence of all words of length L and with the constraint that there are no adjacent repeating symbols is given by an exponential function of L as Now it follows that the rank can be considered either in terms of the direct probability or equivalently, as the incremental change in probability as we change the word length, i.e., from L − 1 to L. Consider the rank of probabilities in the exponentially decreasing distribution {p i (M, L)}, defined as then from (A2) it follows that: and hence: where it is evident that the rank r(L) is proportional to an inverse function g of the corresponding probabilities, such that: Hence, since the rank is effectively determined by the inverse of the probabilities, and the probabilities are found as the inverse of the number of occurrences, and accordingly, we can calculate the total number of cumulative occurrences for all words up to length L, which in turn provides a measure of the ranking. Therefore, we have: where: and: and hence: which represents the exponential transformation from the constrained word's length to the word's rank under the given constraint: Rearranging (A19) and taking logs, for the lower bound, we have: and for the upper bound, we have: Following a similar approach to [30], raising 1 M to the power of the terms in (A21)-(A23) gives: 1 where: Using the identity 1 a then, it follows using a change of base: noting that: 1 then (A24) can be multiplied by 1/M to give probabilistic bounds, and hence we have: where: which leads to: which can be considered as a new form of Zipf-Mandelbrot-Li law which includes the specified linguistic constraints such that the constants originally computed according to (12)- (16), are now computed according to (A34)-(A36). Note that as the alphabet size M increases, the parameter given by (A34) converges to that of (12) in the original formulation: A formulation of the model using a maximum word length L max can now be introduced. It follows that a new value for γ can be determined from the word frequency normalization condition via the summation of all probabilities as before, hence: which can be shown to reduce to: and hence:

cZML Model II
The Zipf-Mandelbrot-Li law with linguistic constraints of the second type is derived as follows. For a word of length L, the total number of words possible is given by and hence the frequency of occurrence for a word of length L using the constrained probabilistic model is where: γ is a normalization constant. It follows that γ can be determined from the word frequency normalization condition via the summation of all probabilities of such words [30], hence: and hence the normalization constant can be found as Hence, since the rank is effectively determined by the inverse of the probabilities, and the probabilities are found as the inverse of the number of occurrences, accordingly, we can calculate the total number of cumulative occurrences for all words up to length L, which in turn provides a measure of the ranking. Therefore, we have: where: where: which leads to: which can be considered as a further linguistically constrained form of the Zipf-Mandelbrot-Li law, which includes the specified linguistic constraints such that the constants originally computed according to (12)-(16) are now computed according to (A79)-(A81). Note that as before, as the alphabet size M increases, the parameter given by (A79) converges to that of (12) in the original formulation: and the effect of the parametrization due to the constraints is similar to the first case. Following the same approach as before, a formulation of the model using a maximum word length L max can now be introduced. It follows that a new value for γ can be determined from the word frequency normalization condition via the summation of all probabilities as before, hence: and hence: This ZML model introduces further synthetic linguistic constraints based on having no adjacent repeating symbols.