Although clearly useful for certain applications such as web data caching, determination of hot memory addresses, string matching in DNA sequence analysis, and protection again DDOS attacks, there has been no previous detailed theoretical analysis of CBFs used for count thresholding in the open literature (previous papers have only dealt with CBFs used to permit deletions of elements in large data sets). Such an analysis is necessary in order to be able to predict the effectiveness of a CBF solution and the specific CBF parameters to use. For example, m, the number of CBF elements to be used, must be selected such that the resulting false positive probability level is acceptable for the chosen application. In addition, k, the number of hash functions to be used, must be selected to minimize the the false positive probability. This type of analysis is provided in this section.

#### 3.1. False Positive Probability

The analysis starts with a derivation of a close approximation for the false positive probability, which is necessary since the exact form given in Equation (

2) involves a sum of binomial distributions, which is extremely difficult and time-consuming to compute for large

n and

m values. For large

${x}_{n}$ and small

${x}_{p}$, it is well known that a binomial distribution

$b(x,{x}_{n},{x}_{p})$ can be approximated by a Poisson distribution with mean

${x}_{n}{x}_{p}$ [

20]. For CBF applications, large

n (data set size) and

m (CBF size) values satisfy these conditions since

${x}_{n}=kn$ and

${x}_{p}=\frac{1}{m}$. Thus, the approximate false positive probability

${\widehat{p}}_{fp}$ can be written as follows.

The cumulative mass function (cmf) of a Poisson distribution is a regularized incomplete gamma function [

21]. Thus, the approximate false positive probability can be written as

where the mean of the Poisson distribution used is defined as

$\kappa =\frac{kn}{m}$ and

$\mathsf{\Gamma}(\theta ,\kappa )={\int}_{\kappa}^{\infty}\phantom{\rule{-0.166667em}{0ex}}{t}^{\theta -1}{e}^{-t}\phantom{\rule{0.166667em}{0ex}}\mathrm{d}t\phantom{\rule{3.33333pt}{0ex}}.$As shown in

Figure 1, this incomplete Gamma function approximation results in a highly accurate approximation of

${p}_{fp}$. Note that the approximation

${\widehat{p}}_{fp}$ only depends on the ratio of

$kn$ to

m.

Figure 1 shows that

${p}_{fp}$ and

${\widehat{p}}_{fp}$ overlap almost 100%. The exact relative error of

${\widehat{p}}_{fp}$ is shown in

Figure 2. For the parameters shown, the relative error is less than 0.48% when an optimal number of hash functions is used.

The optimal

k values

${k}_{opt}(\theta )$, for which the false positive probabilities are the lowest, are shown using a dashed cyan line in

Figure 1. As can be seen in the figure,

${k}_{opt}(\theta )$ is definitely not the same, or even close, to

${k}_{opt}^{trad}(\theta )={k}_{opt}(1)$, shown as a solid vertical orange line in

Figure 1, when

$\theta >1$. Before proposing a systematic method for finding

${k}_{opt}(\theta )$ for general values of

$\theta $, a rigorous analysis will be presented that shows that only one such value exists.

#### 3.2. Uniqueness of Optimal Number of Hash Functions

A sequence of lemmas are used to prove that there exists a unique value of

${k}_{opt}$ for which the false positive probability is minimized. To follow this proof process, the reader is advised to refer to the plots in

Figure 3 and

Figure 4 when reading the following lemmas.

Since the optimal false positive probability point occurs when its slope is 0, the proof starts by taking the derivative of

${\widehat{p}}_{fp}(\theta ,k,n,m)$ with respect to

k. To find the shape of the derivative of

${\widehat{p}}_{fp}$, the logarithm of

${\widehat{p}}_{fp}$ can be used.

By taking the derivative of Equation (

4),

By Leibniz’s rule and the definition of the incomplete gamma function [

21],

Therefore, by applying Equation (

6) to the right side of Equation (

5) and multiplying both sides of Equation (

5) by

${\widehat{p}}_{fp}(\theta ,k,n,m)$,

For

${k}_{opt}(\theta )$, this derivative should be set to 0. Since

${\widehat{p}}_{fp}>0$, the second part must be 0. Then, multiplying this second part by a common factor and denoting this term as

$g(\theta ,\kappa )$, the following equations and lemmas follow.

To determine whether

g is a decreasing or increasing function, the derivative of

g is needed.

In Equation (

7), the first part is greater than 0. Thus,

g is a decreasing or increasing function depending on the polarity of

$1+\theta +ln(1-\frac{\mathsf{\Gamma}(\theta ,\kappa )}{\mathsf{\Gamma}(\theta ,0)})-\kappa $. Let

${y}_{1}(\kappa )=ln(1-\frac{\mathsf{\Gamma}(\theta ,\kappa )}{\mathsf{\Gamma}(\theta ,0)})$ and

${y}_{2}(\kappa )=(\kappa -\theta -1)$. The

${y}_{1}$ and

${y}_{2}$ terms are defined in this manner in order to facilitate the examination of the exact conditions under which

g is a decreasing or increasing function, and thereby determine the conditions for the changes in slope of the false positive probability function. Then,

Examples of the shapes of

${y}_{1}$ and

${y}_{2}$ are shown in

Figure 3. An example of the

$g(\theta ,\kappa )$ function is shown in

Figure 4.

**Lemma** **1.** For a fixed value of θ, ${y}_{1}$ is a strictly increasing concave function of κ.

**Proof** **of Lemma 1.** Using the definition of an incomplete Gamma function, the first partial derivative of

${y}_{1}$ can be shown to be greater than zero. i.e.,

Then, using the second partial derivative, it can also be verified that

${y}_{1}$ is concave when

$\kappa \ge \theta -1$.

Now consider the situation when

$\kappa <\theta -1$. The following are well-known properties of a Poisson distribution with mean

$\kappa $, denoted by

$Po{i}_{\kappa}(X=\theta )$ [

22],

Then, based on [

21] and applying Equation (

9) to

$(\theta -1-\kappa )(1-\frac{\mathsf{\Gamma}(\theta ,\kappa )}{\mathsf{\Gamma}(\theta ,0)})-\frac{{\kappa}^{\theta}{e}^{-\kappa}}{\mathsf{\Gamma}(\theta ,0)}$ in Equation (

8),

Thus,

$\frac{{\partial}^{2}{y}_{1}}{\partial {\kappa}^{2}}<0$ when

$\kappa <\theta -1$. Therefore,

${y}_{1}$ is a strictly increasing concave function for all values of

$\kappa $. □

**Lemma** **2.** ${y}_{1}(\kappa =\frac{\theta +1}{2})>{y}_{2}(\kappa =\frac{\theta +1}{2})$

**Proof** **of Lemma 2.** From [

21], by putting

$\frac{\theta +1}{2}$ into the

$ln(\phantom{\rule{3.33333pt}{0ex}})$ function,

which is equivalent to

Then, by the Stirling inequality,

The function on the right hand side decreases from $\theta =0$ to 1 and increases from $\theta =1$ to ∞. The minimum value of this function is 1, which occurs at some point $\theta $ with $\theta >0$. Thus, $\frac{{(\frac{\theta +1}{2})}^{\theta}}{\theta !}\ge 1.$ Therefore, $\sum _{l\ge \theta}}\frac{{(\frac{\theta +1}{2})}^{l}}{l!}>\frac{{(\frac{\theta +1}{2})}^{\theta}}{\theta !}\ge 1.$ □

**Lemma** **3.** $\underset{\kappa \to {0}^{+}}{lim}g(\theta ,\kappa )=0$, and $g(\theta ,\kappa )$ is a strictly decreasing function for $\kappa \in (0,{\kappa}_{1})$, where ${\kappa}_{1}\in (0,\frac{\theta +1}{2})$.

**Proof** **of Lemma 3.** From L’Hopital’s rule, $\underset{x\to {0}^{+}}{lim}xlnx=0$. This implies that $\underset{\kappa \to {0}^{+}}{lim}(1-\frac{\mathsf{\Gamma}(\theta ,\kappa )}{\mathsf{\Gamma}(\theta ,0)})ln(1-\frac{\mathsf{\Gamma}(\theta ,\kappa )}{\mathsf{\Gamma}(\theta ,0)})=0$. Therefore, $\underset{\kappa \to {0}^{+}}{lim}g(\theta ,\kappa )=0$.

As $\kappa $ approaches 0 from the right, $\underset{\kappa \to {0}^{+}}{lim}{y}_{1}(\kappa )=-\infty $, and $\underset{\kappa \to {0}^{+}}{lim}{y}_{2}(\kappa )=-\theta -1$. This implies that $\underset{\kappa \to {0}^{+}}{lim}{y}_{1}(\kappa )<\underset{\kappa \to {0}^{+}}{lim}{y}_{2}(\kappa )$. On the other hand, by Lemma 2, ${y}_{1}(\kappa =\frac{\theta +1}{2})>{y}_{2}(\kappa =\frac{\theta +1}{2})$. Therefore, by the intermediate value theorem, there exists a point ${\kappa}_{1}\in (0,\frac{\theta +1}{2})$ that satisfies ${y}_{1}({\kappa}_{1})={y}_{2}({\kappa}_{1})$. Then, since ${y}_{1}({\kappa}_{1})<{y}_{2}(\kappa )$ as $\kappa \to {0}^{+}$, ${y}_{2}$ is a straight line, and Lemma 1 states that ${y}_{1}$ is a strictly increasing function of $\kappa $, ${y}_{1}(\kappa )<{y}_{2}(\kappa )$ for $\kappa \in (0,{\kappa}_{1})$. Thus, $\frac{\partial}{\partial \kappa}g(\theta ,\kappa )<0$ for $\kappa \in (0,{\kappa}_{1})$. □

**Lemma** **4.** The function $g(\theta ,\kappa )$ is a strictly increasing function for $\kappa \in ({\kappa}_{1},{\kappa}_{2})$, where $\frac{\theta +1}{2}<{\kappa}_{2}<\theta +1$.

**Proof** **of Lemma 4.** From Lemma 2 again, ${y}_{1}(\kappa =\frac{\theta +1}{2})>{y}_{2}(\kappa =\frac{\theta +1}{2})$. On the other hand, for all real positive values $\kappa $, ${y}_{1}(\kappa )<0$, whereas ${y}_{2}(\theta +1)=0$. Thus, ${y}_{1}(\theta +1)<{y}_{2}(\theta +1)$. Therefore, by the intermediate value theorem again, there is a point $\frac{\theta +1}{2}<{\kappa}_{2}<\theta +1$ that satisfies ${y}_{1}({\kappa}_{2})={y}_{2}({\kappa}_{2})$. Finally, in the interval of $\kappa \in ({\kappa}_{1},{\kappa}_{2})$, $\frac{\partial}{\partial \kappa}g(\theta ,\kappa )>0$ because ${y}_{1}(\kappa )>{y}_{2}(\kappa )$ in this interval. □

**Lemma** **5.** The function $g(\theta ,\kappa )$ is a strictly decreasing function for $\kappa \in ({\kappa}_{2},\infty )$, and $\underset{\kappa \to \infty}{lim}g(\theta ,\kappa )=0$.

**Proof** **of Lemma 5.** By the definition of an incomplete gamma function, $\mathsf{\Gamma}(\theta ,\kappa )={\int}_{\kappa}^{\infty}\phantom{\rule{-0.166667em}{0ex}}{t}^{\theta -1}{e}^{-t}\phantom{\rule{0.166667em}{0ex}}\mathrm{d}t\phantom{\rule{3.33333pt}{0ex}}$, the limit of $\mathsf{\Gamma}(\theta ,\kappa )$ as $\kappa $ approaches positive infinity is 0. This makes the left term of $g(\theta ,\kappa )$ become $1\phantom{\rule{3.33333pt}{0ex}}ln1=0$ when $\kappa $ approaches positive infinity. The right term becomes 0 by L’Hopital’s rule. Therefore, $\underset{\kappa \to \infty}{lim}g(\theta ,\kappa )=0$. From Lemma 1, ${y}_{1}$ is strictly concave, and from the proof of Lemma 4, ${y}_{1}({\kappa}_{2})={y}_{2}({\kappa}_{2})$. Thus, for $\kappa >{\kappa}_{2}$, $\frac{\partial}{\partial \kappa}g(\theta ,\kappa )<0$ because ${y}_{1}(\kappa )<{y}_{2}(\kappa )$. □

**Theorem** **1.** Given n, m and a count threshold $\theta >0$, there exists a unique value $k={k}_{opt}$ for which the false positive probability of a CBF has the minimum value. Furthermore, ${k}_{opt}(\theta )=\lfloor \frac{m}{n}{\kappa}^{*}(\theta )\rfloor $ or ${k}_{opt}(\theta )=\lceil \frac{m}{n}{\kappa}^{*}(\theta )\rceil $.

**Proof** **of Theorem 1.** In the interval of $\kappa \in (0,{\kappa}_{1}]$, $g(\theta ,\kappa )<0$ due to Lemma 3. Also, $g(\theta ,{\kappa}_{2})>0$ because Lemma 5 states that $g(\theta ,\kappa )$ is a strictly decreasing function from ${\kappa}_{2}$ to ∞, at which point $g(\theta ,\kappa )$ approaches zero. Then, due to Lemma 4, there is a unique value ${\kappa}^{*}\in ({\kappa}_{1},{\kappa}_{2})$ such that $g(\theta ,{\kappa}^{*})=0$. Finally, using the definition of $\kappa $, the theorem follows. □

#### 3.3. Procedure for Determining the Optimal Number of Hash Functions

Based on the lemmas and theorem of the previous subsection, a general procedure to be used to find the optimal number of hash functions

${k}_{opt}(\theta )$ is as follows. Start from

$k=1$ and compute the false positive probability

${\widehat{p}}_{fp}^{*}(\theta ,k,n,m)$ using Equation (

3). Then increment

k by one and recompute the false positive probability. Continue until the false positive probability starts to increase or

$k=(\theta +1)n/m$, whichever comes first. The

k value that results in the minimum

${\widehat{p}}_{fp}^{*}(\theta ,k,n,m)$ is

${k}_{opt}(\theta )$.

The above procedure can be simplified by using precomputed tables or a linear approximation.

Table 1 shows a table of precomputed optimal

${\kappa}^{*}$ values for count thresholds

$\theta $ ranging from 1 to 30. This table was created by following the procedure outlined above. Since

$\kappa =kn/m$, this table can be used to determine

${k}_{opt}$ by simply using the relationship shown in Theorem 1; i.e.,

${k}_{opt}$ is either the floor or ceiling of

${\kappa}^{*}m/n$.

For large count thresholds

$\theta $, a straight line approximation can be used to determine

${\kappa}^{*}$, and thereby

${k}_{opt}$.

Figure 5 shows that the straight line approximation

closely tracks

${\kappa}^{*}(\theta )$ for large

$\theta $ values. By plotting the relative errors, as shown in

Figure 6, it can be seen that there is a relative error of less than about 2 percent when

$\theta >30$.