Exact Probability Distribution versus Entropy

The problem addressed concerns the determination of the average number of successive attempts of guessing a word of a certain length consisting of letters with given probabilities of occurrence. Both first- and second-order approximations to a natural language are considered. The guessing strategy used is guessing words in decreasing order of probability. When word and alphabet sizes are large, approximations are necessary in order to estimate the number of guesses. Several kinds of approximations are discussed demonstrating moderate requirements concerning both memory and CPU time. When considering realistic sizes of alphabets and words (100) the number of guesses can be estimated within minutes with reasonable accuracy (a few percent). For many probability distributions the density of the logarithm of probability products is close to a normal distribution. For those cases it is possible to derive an analytical expression for the average number of guesses. The proportion of guesses needed on average compared to the total number decreases almost exponentially with the word length. The leading term in an asymptotic expansion can be used to estimate the number of guesses for large word lengths. Comparisons with analytical lower bounds and entropy expressions are also provided.


Introduction
This work has been inspired by problems addressed in the field of computer security, where the attacking of, e.g., password systems is an important issue. In a brute-force attack the password, for instance, can be broken in a worst-case time proportional to the size of the search space and on average a time half of that. However, if it is assumed that some words are more probable than others, the words can be ordered in the search space in decreasing order of probability. The number of guesses can then be drastically reduced. Properties of the average number of successive guesses have been discussed in detail by Pliam [1], who introduces the word guesswork to denote this quantity. Further, Lundin et al. [2] discuss confidentiality measures related to guesswork.
As will be demonstrated below the calculation of the guesswork may require a substantial amount of computational efforts, especially if the search space is large. Therefore lower bounds, which are easy to calculate, have been provided by several authors, e.g., Arikan [3] and Massey [4]. Lower and upper bounds are provided by Pliam [1], but they involve similar calculations as those needed for the guesswork itself and may therefore be of less practical use.
In this paper numerical approaches are suggested for evaluating the average number of successive guesses (guesswork) required for correctly guessing a word from a given language. The guessing strategy used is guessing words in decreasing order of probability. This is a continuation of investigations presented elsewhere [5]. In Section 2 the languages used in this paper are presented together with the corresponding expressions for the guesswork and entropy. The reason for considering entropy here depends on the prevalent use of entropy instead of guesswork in applications due to its simpler determination. In Section 3 approximate numerical estimations of guesswork are discussed and in Section 4 the results for some probability distributions are given. Finally, in Section 5 the conclusions of the investigations presented in the paper are summarized.

Languages
A language is a set of strings and a string is a finite sequence of symbols from a given alphabet. Consider a stochastic variable X belonging to a state space X = {x 1 , x 2 , . . . , x n }, where the probability distribution is given by p X (x) = P r(X = x). Introduce the short-hand notation p i = p X (x i ), where n i=1 p i = 1. In the following the state space X and its size n are considered as an alphabet with a certain number of symbols. Words are formed by combining symbols to strings. From n symbols it is possible to form n m different words of length m. Shannon introduced various orders of approximations to a natural language, where the zero-order approximation is obtained by choosing all letters independently and with the same probability. In the first-order approximation the complexity is increased by choosing the letters according to their probability of occurrence in the natural language. In zero-and first-order approximation the strings thus consist of independent and identically-distributed (iid) random variables. For higher-order approximations the variables are no longer independent [6].

Zero-order Approximation
In a zero-order approximation all symbols in the alphabet (of size n) have the same probability of occurrence (p i = 1/n, ∀i = 1, . . . , n). The average number of guesses G 0 required to correctly guess a word of length m is given by where X 1 , . . . , X m ∈ X . The entropy H 0 of a word of length m is given by [6] where H 0 (X 1 ) = log b n and b is the base of the logarithm used. The average number of guesses grows exponentially with the size of the word, while the entropy grows linearly with the size of the word. This is in accordance with the definition of entropy, since it should be an extensive property growing linearly with the size of the system. In a zero-order approximation the relation between guesswork and entropy is G 0 (X 1 , . . . , X m ) = (b H 0 (X 1 ,...,Xm) + 1)/2. This relationship between guesswork and entropy is true in zero-order approximation, but not necessarily so using higher-order approximations, which has been demonstrated by several authors (see, e.g., [1] and [7]). These authors strongly argue against the use of entropy in the estimation of the number of required guesses.

First-order Approximation
In a first-order approximation the symbols in the alphabet (of size n) do not necessarily have the same probability of occurrence. Assume the symbols are ordered in decreasing order of probability (p 1 ≥ p 2 ≥ . . . ≥ p n ). In first order the symbols in a word are considered as stochastically independent and then the most probable word (of a given length) would consist of only x 1 . The most improbable word, on the other hand, would consist of only x n . The average number of guesses G 1 required for making the correct guess of a word of length m is given by the summation: where the function g(i 1 , . . . , i m ) represents the number of guesses, one guess for the most probable word, two guesses for the second most probable word and n m guesses for the most improbable word, etc. The entropy H 1 of a word of length m is given by [6] H 1 (X 1 , . . . , X m ) = n i 1 =1,...,im=1 where b is the base of the logarithm used. The calculation of Equation 3 is more complicated than Equation 4 since it requires that the products of probabilities (p i 1 · · · p im ) are sorted in decreasing order. Such a procedure can be realized only for a moderate size of n m due to both storage and CPU time requirements. For larger values of n m approximate methods have to be used in order to get an estimate of the summation. Lower bounds of the guesswork, that are easy to calculate, have been provided by Massey [4] and Arikan [3]. Massey demonstrates that and Arikan that In Figure 1 the exact value of the guesswork for correctly guessing a word of size m < 6 using an alphabet of size 10 (with the randomly chosen probability distribution given in the figure) is displayed. The lower bounds provided by Equations 5 and 6 are given for word sizes m ≤ 20. For comparison, the exponential entropy expression which is identical to guesswork in zero order, is given for word sizes m ≤ 20. For word sizes m < 6 Equation 7 clearly overstimates the exact value of the guesswork. In fact Pliam has shown that it is possible to construct probability distributions that make guesswork differ arbitrariliy much from Equation 7 [1]. In Section 3 approximate numerical evaluations of guesswork are discussed.

Second-order Approximation
In a second-order approximation the variables are no longer independent. Consider two jointly distributed stochastic variables X, Y ∈ X , then the conditional probability distribution of Y given X is given by p Y (y|X = x) = P r(Y = y|X = x). Introduce the short-hand notation P ij = p Y (x j |X = x i ), the probability that symbol x j follows symbol x i . P is an n×n matrix, where n is the size of the alphabet and the sum of the elements in each row is one. The probability of occurrence of each symbol in the alphabet, p i , can easily be obtained from matrix P using the two equations (P T − I)p = 0 and |p| = 1, where p is a vector of length n with elements p i .
The guesswork G 2 , i.e., the average number of guesses required for making the correct guess of a word of length m using an alphabet of size n is given by where the function g(i 1 , . . . , i m ) is the same as the one in Equation 3. The entropy H 2 of a word of length m is given by [8] where b is the base of the logarithm used. In Section 4 the value of Equation 8 will be compared to the value of Equation 7 (with H 1 replaced by H 2 ) for a given probability distribution.

Numerical Evaluation of Guesswork
In this section a number of approaches will be given in order to evaluate the guesswork.

Quantification
One simple procedure for numerically estimating Equation 3 and in addition reducing the storage requirements is to split the range log( 1 where a larger value of N gives a better estimate. The range log( 1 p m 1 ), log( 1 p m n ) will consequently be split into m · N equal pieces of size ∆. Instead of sorting the products p i 1 p i 2 · · · p im they simply have to be evaluated and brought into one of the m · N subranges. When the number of products in each subrange has been determined an estimate of Equation 3 can be made, giving where c j is the number of probability products in subrange j, C j = j−1 k=1 c k and log(P −1 j ) is the middle value of subrange j. By instead using the boundary values of the subranges, lower and upper bounds of the numerically estimated guesswork can be given as where Q 1 = (p 1 /p n ) 1/2N . Here the short-hand notation Q 1 is used instead of the more correct notation Q 1 (X 1 , . . . , X m ; N ) in order to increase the transparency of Equation 11.
By introducing the density of products d i = c i /(∆n m ) the summations in Equation 10 can be replaced by integrals for large values of N , giving Equation 12 will be of importance in Section 3.3, where a normal distribution approximation of the density of products is discussed. The method of quantification can be used in both first-and second-order approximation. However, since it is less obvious in second order which is the smallest and largest value of the product of probabilities p i 1 P i 1 i 2 · · · P i m−1 im a lower bound of min(p i ) · min(P ij ) m−1 and an upper bound of max(p i ) · max(P ij ) m−1 can be used instead to determine the range of possible values. When determining min(P ij ) only non-zero values are considered. In second order a similar expression as the one in Equation 11 can be used for estimating the guesswork, namely where and G Q 2 is given by Equation 10 using the values given above as interval limits for probability products. Here the short-hand notation Q 2 is used instead of the more correct notation Q 2 (X 1 , . . . , X m ; N ) in order to increase the transparency of Equation 13.

Random selection
The storage and CPU time requirements using the strategy in Section 3.1 for calculating the guesswork are of O(m · N ) and O(m · n m ), respectively. One simple modification for decreasing the time requirements is to reduce the number of probability products formed. Instead of calculating all n m different products a smaller number of randomly chosen probability products is used and brought into the m · N subranges. The smaller number has been determined to be proportional to m, i.e., equal to m · S, where S is a parameter whose value has to be chosen. After normalization, where the number of products in each subrange is multiplied by the factor n m /(m · S), the strategy is identical to the one in Section 3.1.
By not using all n m different probability products another error is introduced. This error can be estimated by repeating the random selection calculations a number of times (given by T ). Through these calculations an average value (G R i ) and a standard deviation (s R i ) can be estimated (where i = 1 or 2). A 99 % confidence interval for G Q i is then given as where R i = λ 0.01/2 · s R i /(G R i · √ T ) and λ 0.01/2 = 2.58 (the quantile function of the normal distribution) [9]. In Equation 14 all parameters have been excluded in order to increase the transparency of the equation.

Normal distribution
Another interesting approach is given by the central limit theorem in probability theory, which roughly states that the mean of a large number of independent stochastic variables is approximately normally distributed. In Figure 2 the density of the logarithm of products of probabilities for the randomized probability distribution given in Figure 1 is displayed. The density fits nicely to a normal distribution, with a better fit for larger values of m (the number of independent stochastic variables). As expected the average value is proportional to m and the standard deviation to √ m. Denote the proportionality constants as µ 1 and σ 1 , respectively. For large values of n and m and for most probability distributions {p 1 , . . . , p n } it can be assumed (according to the central limit theorem) that the logarithm of products of probabilities will be normally distributed. The parameters of the normal distribution (the average value and the standard deviation) can be estimated from a sample of a small number of random products of probabilities (considerably smaller than required in the method described in Section 3.2). The normal distribution is used to estimate the number of probability products in each subrange, otherwise the strategy is identical to the one in Section 3.1. When approximating the density of logarithms of products by a normal distribution where µ is the average value and σ the standard deviation, Equation 12 can be expressed as where base e has been adopted. The factor e −x (representing a product of pobabilities) in Equation 12 cause a left shift of the normal distribution. This requires that also the tails of the normal distribution are accurate in order for this approximation to be valid. To make an error estimate for this kind of approximation is hard. However, if the density of logarithm of probability products resembles a normal distribution also at its tails then an error estimate similar to Equation 14 can be made. Further, the distance between the peaks of the two normal distributions in Equation 16 is increasing for increasing values of m resulting in decreasing values of the integral. In fact it can be shown that Equation 16 can be further approximated as for large values of m 1 . In Figure 3    Fit to AB m /m 1/2 (data from Equation 10) By using an asymptotic expansion of the error function 2 it can be shown that the leading term of Equation 17 is from Reference [10] has been used.
for large x.
if µ 1 /(σ 1 √ 2) > σ 1 /2. Thus the leading term is of the form n m · A · B m · m −1/2 , where A and B are constants for the given probability distribution. The result of fitting the data from Equation 10 (using a normal distribution) to such an expression is displayed in Figure 3. The results will be further discussed in the following section.

Results
In this section two probability distributions will be discussed. First, the distribution given in Figure 1 is investigated in more detail in Section 4.1 and second, the English language is considered in Section 4.2.

Random probability distribution
In Figure 1 the average number of guesses required for correctly guessing a word of size m using an alphabet of size 10 is given using various techniques. First, the exact solution is given (for m < 6). Second, three approximate solutions (as discussed in Section 3) are given (quantification using all probability products could be performed within reasonable time limits only for m < 11). Third, an estimate based on entropy (Equation 7) is provided. Fourth, lower bounds derived by Massey (Equation 5) and Arikan (Equation 6) are included.
As is illustrated in Figure 1 the approximate techniques of quantification and random selection may provide accurate estimates of guesswork (with reasonable amount of storage and CPU time). The third approximate technique (using a normal distribution) is demonstrating accurate estimates for large values of m (> 6) in accordance with the central limit theorem. By using a fitting procedure for values in the range 9 ≤ m ≤ 40 an approximate expression for guesswork is given by G 1 /n m ≈ 0.592 · 0.920 m · m −1/2 (see Figure 3). By evaluating the leading term of Equation 17 (see Equation 18) the expression 0.832 · 0.912 m · m −1/2 is obtained.
However, as is shown in Figure 3, for large values of m the result given by a normal distribution deviates from that given by the true distribution. When rescaling the x axis in Figure 2, so that the range of x values is the same for all m values the distribution becomes sharper for increasing values of m. For sharper distributions the fitting to a normal distribution is more sensitive. Small changes to the parameters µ and σ may give large changes in the value of guesswork.
The exponential entropy expression overestimates guesswork for small values of m (< 11) and underestimates it for large values. The lower bound of Massey is closer to the exact value than the lower bound of Arikan. However, both of the lower bounds underestimate the number of guesses by an order of magnitude for m = 20.

Error estimates
Using the data in Figure 1 and Equation 11 the exact value can be determined to be in the interval [0.999, 1.001]·G Q 1 , i.e., the error using quantification is about 0.1 %. The additional error of using random selection (see Equation 14) is determined to be between 0.26 and 0.56 % (depending on the m value) to a certainty of 99 %. The error due to random selection in the normal distribution appproximation is determined to be between 0.30 and 0.63 % (depending on the m value) to a certainty of 99 %. Observe that this error does not include the fitness of a normal distribution to the density of the logarithm of probability products.

English language
While the probability distribution discussed in the previous section was randomly chosen, the probability distribution considered here originates from the English language [11]. In Appendix A the English digram frequencies from Reference [11] are repeated. In order to calculate the conditional probability distribution discussed in Section 2.3 each row in the table in Appendix A has to be normalized. The probability distribution for each letter in the English alphabet can be obtained by the procedures discussed in Section 2.3. In Figure 4 the average number of guesses required for correctly guessing a word of size m using the English alphabet of size 26 (with the data given in Appendix A) is displayed. Guesswork has been numerically evaluated in zero, first and second order.   As is illustrated in Figure 4 all approaches display an exponential behaviour (in first and second order) in accordance with Equations 5, 6, 7 and 17. A normal distribution was not applied since it is not in agreement with the true distribution. It overestimates guesswork by about an order of magnitude for m = 30. However, it is possible to make a fairly accurate fit of the guesswork data in first and second order to an expression of the form n m · A · B m · m −1/2 as was discussed in Section 3.3. By using a fitting procedure for the guesswork graphs in Figure 4 for 9 ≤ m ≤ 30 the average number of guesses of words in the English language can be expressed according to the functions in Table 1. The deviation between the true and estimated values (according to Table 1) is less than 10 % (except for the smallest m values). For both first and second order the entropy ansatz underestimates the number of guesses required, for first order by a factor of around 10 and for second order by a factor of around 100 for word lengths of 30. Further, using the extra information provided by a second-order approximation as compared to a first-order approximation reduces the number of guesses by a factor of around 10 5 for word lengths of 30. The lower bounds of Massey (Equation 5) and Arikan (Equation 6) are underestimating the number of guesses by approximately the same amount as the entropy expression for word lengths of 30.

Error estimates
In first order the errors introduced by using quantification can be calculated using Equation 11. Using the data in Figure 4 the exact value can be determined to be in the interval [0.9998, 1.0002] · G Q 1 , where G Q 1 is the approximate guess work using quantification for first-order English. In second order Equation 13 and the data in Figure 4 make it possible to determine that the exact value is in the interval [0.9996, 1.0004] · G Q 2 . Using the same procedure as in Section 4.1 the error introduced when randomly selecting probability products can be estimated. In first order a 99 % confidence interval for the guesswork is given by Equation 14 and by using the data in Figure 4 the error (R 1 ) is determined to be between 0.006 and 0.05 % (depending on the m value). This should be added to the error of 0.02 % introduced by quantification. In second order exactly the same procedure can be used and then the error is estimated to be in the range 0.006 and 6 % (depending on the m value). Again this error should be added to the error of 0.04 % introduced by quantification. The large error introduced by random selection in second order for large m values is due to the fact that the fraction of probability products that are zero is larger for larger m values. By randomly selecting S · m probability products the number of non-zero probability products is decreasing with an increasing value of m. To increase the accuracy of the guesswork estimate in second order another m dependence of the number of selected probability products has to be chosen.

Conclusion
In the paper it has been demonstrated that it is possible to estimate the average number of guesses (guesswork) of a word with a given length numerically with reasonable accuracy (of a couple of percent) for large alphabet sizes (≈ 100) and word lengths (≈ 100) within minutes. Thus, a numerical estimate of guesswork constitutes an alternative to, e.g., various entropy expressions. For many probability distributions the density of the logarithm of probability products is close to a normal distribution. For those cases it is possible to derive an analytical expression for guesswork showing the functional dependence of the word length. The proportion of guesses needed on average compared to the total number decreases almost exponentially with the word length. The leading term in an asymptotic expansion of guesswork has the form n m · A · B m · m −1/2 , where A and B are constants (however, different for different probability distributions), n is the size of the alphabet and m is the word length. Such an expression can be determined for medium-sized values of m, using some fitting procedure, and used with fairly good accuracy for large values of m.
In the paper the English language has been investigated. The average number of guesses has been calculated numerically in both first and second order giving a reduction of the number of guesses by a factor 10 5 for word lengths of 30 when the extra information provided by second order is included. A normal distribution of the logarithm of probability products was not applied since it is not in agreement with the true distribution. It overestimates guesswork by about an order of magnitude for word lengths of 30. Still it is possible to find accurate expressions for guesswork (0.481 · 0.801 m · m −1/2 in first order and 0.632 · 0.554 m · m −1/2 in second order) in agreement with the true values (the deviation is less than 10 % for word lengths of 30).
Comparison between guesswork and entropy expressions has been performed showing that the entropy ansatz underestimates the number of guesses required, for first order by a factor of around 10 and for second order by a factor of around 100 for English words of length 30. Lower bounds of guesswork by Massey and Arikan have also been investigated. They are underestimating the number of guesses by approximately the same amount as the entropy expression for word lengths of 30.

A. English Digram Frequencies
The information in the matrix below is used for creating matrix P used in Section 2.3 [11]. After normalization of each row matrix P is obtained.