Analysis of Counting Bloom Filters Used for Count Thresholding

Kim, Kibeom; Jeong, Yongjo; Lee, Youngjoo; Lee, Sunggu

doi:10.3390/electronics8070779

Open AccessArticle

Analysis of Counting Bloom Filters Used for Count Thresholding

Department of Electrical Engineering, Pohang University of Science and Technology (POSTECH), Pohang 37673, Korea

^*

Author to whom correspondence should be addressed.

Electronics 2019, 8(7), 779; https://doi.org/10.3390/electronics8070779

Submission received: 7 June 2019 / Revised: 7 July 2019 / Accepted: 9 July 2019 / Published: 11 July 2019

(This article belongs to the Section Computer Science & Engineering)

Download

Browse Figures

Versions Notes

Abstract

:

A bloom filter is an extremely useful tool applicable to various fields of electronics and computers; it enables highly efficient search of extremely large data sets with no false negatives but a possibly small number of false positives. A counting bloom filter is a variant of a bloom filter that is typically used to permit deletions as well as additions of elements to a target data set. However, it is also sometimes useful to use a counting bloom filter as an approximate counting mechanism that can be used, for example, to determine when a specific web page has been referenced more than a specific number of times or when a memory address is a “hot” address. This paper derives, for the first time, highly accurate approximate false positive probabilities and optimal numbers of hash functions for counting bloom filters used in count thresholding applications. The analysis is confirmed by comparisons to existing theoretical results, which show an error, with respect to exact analysis, of less than 0.48% for typical parameter values.

Keywords:

counting bloom filter; database search; count thresholding; hash function

1. Introduction

A bloom filter (BF) is a powerful tool that can be used to create novel, low-overhead methods for dealing with big data sets in various software/hardware applications. Proposed by Burton Howard Bloom in 1970 [1], a BF is an m-bit vector that is initialized to 0. Assuming that n elements are stored into a data set, each time a new element is stored, k hash functions, each of which maps the element to one of m bit locations, are applied and the corresponding bits in the BF are set to 1. To determine if a new unknown element is a member of the data set or not, the k hash functions are applied to that element and the corresponding bits in the BF are checked. A positive answer is returned if all of those bits are found to be 1. Since the search time is independent of the number of elements in the data set, this results in extremely space-efficient and fast search of large data sets. Although false positives are possible, false negatives are not since the element could not have been stored in the data set if any of the k hash function bits in the BF are 0.

Over the years, various BF variants have been proposed. One commonly cited variant is a counting bloom filter (CBF), which has been proposed as a method that supports deletion as well as addition of elements to a data set [2]. In a CBF, the m elements are multiple-bit counts instead of bits. Every time an element is added to or deleted from the data set, k hash functions are applied to the element and all of the k locations in the m-element CBF are incremented or decremented by one. In order to check if a specific element is still in the data set, the k hashed locations in the CBF can be checked for the absence of 0 values. Several methods have been proposed to enable efficient support of m-element CBFs [3,4].

BFs and CBFs have been found to be useful for numerous applications. For example, in a web cache, instead of storing all web objects that are accessed, disk writes can be significantly reduced by only storing those web objects that are referred to more than once, thereby eliminating “one-hit wonders” (accessed by a set of users once and never again). A BF can be used to quickly determine if a specific web page has been referenced before or not. For long-term usage, a CBF can be used instead of a BF in order to permit deletion of long unused web objects from the web cache. This approach has been found to reduce the rate of disk writes by nearly half in an actual system of servers [5].

Other applications where BFs and CBFs have been found to be useful include communications and networking applications [6,7,8,9], Huffman coding [10], cache architecture design [3,11], memory wear leveling [4], and string search for DNA sequence identification [12,13,14,15].

Given that the elements in a CBF contain approximate “counts”, this paper examines the problem of using a CBF as an approximate counting mechanism, in particular to check whether a certain data element has been referred to

θ

or more times, where

θ

is a count threshold. For example, when applied to the above web cache application,

θ = 2

could be used to eliminate one-hit and two-hit web objects; i.e., disk write usage could be reduced by only storing web objects that have been accessed two or more times. As a second example, when applied to memory management in a computer system, a value such as

θ = 5

could be used to identify hot memory addresses. As a third example, when applied to DNA sequence identification, approximate count thresholding could be used to quickly identify or match specific strings in a DNA sequence that occur more than

θ

times. As a fourth example, approximate count thresholding could be used to determine if a set of nodes are accessing a given web page a large number of times within a short timespan and thus help guard that web page against distributed denial of service (DDOS) attacks [16].

The main motivation for this paper to provide a solid theoretical foundation for the use of CBFs for count thresholding applications. Towards this end, Section 2 introduces the traditional BF and CBF analysis. Then Section 3 presents the newly proposed CBF analysis method and its main results. Next, Section 4 uses comparisons to existing theoretical analysis to confirm this new analysis method. Finally, Section 5 concludes this paper.

2. Previous Bloom Filter (BF) and Counting Bloom Filter (CBF) Analysis

The traditional BF analysis method proceeds as follows [17,18]. For insertion of n elements into a data set, an m-bit BF, initialized with all 0 bits, is used. Every time an element is inserted into the data set, k hash functions are applied and the corresponding bits in the m-bit BF are set to 1. After the first element is inserted into the data set and the first hash function is used to set one bit of the BF, an arbitrary bit in the BF is 0 with probability

(1 - 1 / m)

. After all k hash functions are used, an arbitrary bit in the BF is still 0 with probability

{(1 - 1 / m)}^{k}

. Thus, after all n elements are inserted and all k hash functions applied to each of those n elements, the probability that an arbitrary bit in the BF is 0 is

p_{0} = {(1 - 1 / m)}^{k n} \approx e^{- k n / m}

, where the approximation is based on the definition of e [17,18].

If a user wishes to determine if an element is present in the data set or not, he/she applies the k hash functions and checks all k bits in the BF. If an element is not in the data set, an erroneous result is produced when all k hashed bits in the BF are unity. Using this as an approximation, the false positive probability

p_{f p}^{t r a d} = {(1 - {(1 - 1 / m)}^{k n})}^{k} \approx {(1 - e^{- k n / m})}^{k}

. An important BF parameter that must be selected is the number of hash functions, k, to be used. For this purpose, the k value that is typically used is the one that minimizes the false positive probability. Thus, based on the above approximation,

k_{o p t}^{t r a d} = (m / n) ln 2 .

(1)

The careful reader will note that the above analysis is not strictly correct as it assumes independence of the values of bits in the BF, even if the k hashed locations are for an element in the data set [11]. However, using Chernoff bounds, Mitzenmacher and Upfal have shown that, for large m and n, the same result is obtained even without the independence assumption [19]. Therefore, as can be easily verified by the reader, given sufficiently large n and m (e.g.,

n = 25

and

m = 100

or larger), the above equations are highly accurate and k can be chosen based on Equation (1).

Using the same assumptions as [18,19], the above analysis can be extended to a CBF. To do this, it is noted that after n elements have been inserted into a data set and

k n

uniformly random hash mappings have been used to increment the values in an m-element CBF, the probability that an arbitrary element in the CBF has the value l is simply defined by the probability mass function (pmf) of a binomial distribution with success probability

1 / m

. Denoting this as

b (l, k n, \frac{1}{m})

,

b (l, k n, \frac{1}{m}) = (\binom{k n}{l}) {(\frac{1}{m})}^{l} {(1 - \frac{1}{m})}^{k n - l} .

Note that when

l = 0

, which corresponds to checking whether an arbitrary element in the CBF is 0, this equation simplifies to

b (0, k n, \frac{1}{m}) = {(1 - \frac{1}{m})}^{k n} = p_{0}

.

Suppose an m-element CBF is used to determine if an element has been referenced

θ

or more times. After insertion of n elements in a data set, the probability that an arbitrary element of the CBF has a value less than

θ

is simply the sum of

b (l, k n, \frac{1}{m})

from

l = 0

to

l = θ - 1

. Thus, the false positive probability with count threshold

θ

is as follows.

p_{f p} (θ, k, n, m) = {(1 - \sum_{l < θ} b (l, k n, \frac{1}{m}))}^{k} .

(2)

3. Proposed Analysis and Results

Although clearly useful for certain applications such as web data caching, determination of hot memory addresses, string matching in DNA sequence analysis, and protection again DDOS attacks, there has been no previous detailed theoretical analysis of CBFs used for count thresholding in the open literature (previous papers have only dealt with CBFs used to permit deletions of elements in large data sets). Such an analysis is necessary in order to be able to predict the effectiveness of a CBF solution and the specific CBF parameters to use. For example, m, the number of CBF elements to be used, must be selected such that the resulting false positive probability level is acceptable for the chosen application. In addition, k, the number of hash functions to be used, must be selected to minimize the the false positive probability. This type of analysis is provided in this section.

3.1. False Positive Probability

The analysis starts with a derivation of a close approximation for the false positive probability, which is necessary since the exact form given in Equation (2) involves a sum of binomial distributions, which is extremely difficult and time-consuming to compute for large n and m values. For large

x_{n}

and small

x_{p}

, it is well known that a binomial distribution

b (x, x_{n}, x_{p})

can be approximated by a Poisson distribution with mean

x_{n} x_{p}

[20]. For CBF applications, large n (data set size) and m (CBF size) values satisfy these conditions since

x_{n} = k n

and

x_{p} = \frac{1}{m}

. Thus, the approximate false positive probability

{\hat{p}}_{f p}

can be written as follows.

{\hat{p}}_{f p} (θ, k, n, m) = {(1 - e^{- \frac{k n}{m}} \sum_{l < θ} \frac{1}{l!} {(\frac{k n}{m})}^{l})}^{k} \approx p_{f p} (θ, k, n, m) .

The cumulative mass function (cmf) of a Poisson distribution is a regularized incomplete gamma function [21]. Thus, the approximate false positive probability can be written as

{\hat{p}}_{f p} (θ, k, n, m) = {\hat{p}}_{f p} (θ, κ) = {(1 - \frac{Γ (θ, κ)}{Γ (θ, 0)})}^{k},

(3)

where the mean of the Poisson distribution used is defined as

κ = \frac{k n}{m}

and

Γ (θ, κ) = \int_{κ}^{\infty} t^{θ - 1} e^{- t} d t .

As shown in Figure 1, this incomplete Gamma function approximation results in a highly accurate approximation of

p_{f p}

. Note that the approximation

{\hat{p}}_{f p}

only depends on the ratio of

k n

to m. Figure 1 shows that

p_{f p}

and

{\hat{p}}_{f p}

overlap almost 100%. The exact relative error of

{\hat{p}}_{f p}

is shown in Figure 2. For the parameters shown, the relative error is less than 0.48% when an optimal number of hash functions is used.

The optimal k values

k_{o p t} (θ)

, for which the false positive probabilities are the lowest, are shown using a dashed cyan line in Figure 1. As can be seen in the figure,

k_{o p t} (θ)

is definitely not the same, or even close, to

k_{o p t}^{t r a d} (θ) = k_{o p t} (1)

, shown as a solid vertical orange line in Figure 1, when

θ > 1

. Before proposing a systematic method for finding

k_{o p t} (θ)

for general values of

θ

, a rigorous analysis will be presented that shows that only one such value exists.

3.2. Uniqueness of Optimal Number of Hash Functions

A sequence of lemmas are used to prove that there exists a unique value of

k_{o p t}

for which the false positive probability is minimized. To follow this proof process, the reader is advised to refer to the plots in Figure 3 and Figure 4 when reading the following lemmas.

Since the optimal false positive probability point occurs when its slope is 0, the proof starts by taking the derivative of

{\hat{p}}_{f p} (θ, k, n, m)

with respect to k. To find the shape of the derivative of

{\hat{p}}_{f p}

, the logarithm of

{\hat{p}}_{f p}

can be used.

ln {\hat{p}}_{f p} (θ, k, n, m) = k ln (1 - \frac{Γ (θ, κ)}{Γ (θ, 0)}) .

(4)

By taking the derivative of Equation (4),

\frac{\frac{\partial}{\partial k} {\hat{p}}_{f p} (θ, k, n, m)}{{\hat{p}}_{f p} (θ, k, n, m)} = ln (1 - \frac{Γ (θ, \frac{k n}{m})}{Γ (θ, 0)}) + k \frac{\partial}{\partial k} ln (1 - \frac{Γ (θ, \frac{k n}{m})}{Γ (θ, 0)}) .

(5)

By Leibniz’s rule and the definition of the incomplete gamma function [21],

\frac{\partial}{\partial k} Γ (θ, \frac{k n}{m}) = - \frac{n}{m} {(\frac{k n}{m})}^{θ - 1} e^{- \frac{k n}{m}} .

(6)

Therefore, by applying Equation (6) to the right side of Equation (5) and multiplying both sides of Equation (5) by

{\hat{p}}_{f p} (θ, k, n, m)

,

\frac{\partial}{\partial k} {\hat{p}}_{f p} (θ, k, n, m) = {\hat{p}}_{f p} (θ, k, n, m) (ln (1 - \frac{Γ (θ, \frac{k n}{m})}{Γ (θ, 0)}) + \frac{{(\frac{k n}{m})}^{θ} e^{- \frac{k n}{m}}}{Γ (θ, 0) - Γ (θ, \frac{k n}{m})}) .

For

k_{o p t} (θ)

, this derivative should be set to 0. Since

{\hat{p}}_{f p} > 0

, the second part must be 0. Then, multiplying this second part by a common factor and denoting this term as

g (θ, κ)

, the following equations and lemmas follow.

g (θ, κ) = (1 - \frac{Γ (θ, κ)}{Γ (θ, 0)}) ln (1 - \frac{Γ (θ, κ)}{Γ (θ, 0)}) + \frac{κ^{θ} e^{- κ}}{Γ (θ, 0)}

To determine whether g is a decreasing or increasing function, the derivative of g is needed.

\frac{\partial}{\partial κ} g (θ, κ) = \frac{κ^{θ - 1} e^{- κ}}{Γ (θ, 0)} (1 + θ + ln (1 - \frac{Γ (θ, κ)}{Γ (θ, 0)}) - κ) .

(7)

In Equation (7), the first part is greater than 0. Thus, g is a decreasing or increasing function depending on the polarity of

1 + θ + ln (1 - \frac{Γ (θ, κ)}{Γ (θ, 0)}) - κ

. Let

y_{1} (κ) = ln (1 - \frac{Γ (θ, κ)}{Γ (θ, 0)})

and

y_{2} (κ) = (κ - θ - 1)

. The

y_{1}

and

y_{2}

terms are defined in this manner in order to facilitate the examination of the exact conditions under which g is a decreasing or increasing function, and thereby determine the conditions for the changes in slope of the false positive probability function. Then,

\frac{\partial}{\partial κ} g (θ, κ) = \frac{κ^{θ - 1} e^{- κ}}{Γ (θ, 0)} (y_{1} (κ) - y_{2} (κ))

Examples of the shapes of

y_{1}

and

y_{2}

are shown in Figure 3. An example of the

g (θ, κ)

function is shown in Figure 4.

Lemma 1.

For a fixed value of θ,

y_{1}

is a strictly increasing concave function of κ.

Proof of Lemma 1.

Using the definition of an incomplete Gamma function, the first partial derivative of

y_{1}

can be shown to be greater than zero. i.e.,

\frac{\partial y_{1}}{\partial κ} = \frac{κ^{θ - 1}}{e^{κ} (Γ (θ, 0) - Γ (θ, κ))} > 0

Then, using the second partial derivative, it can also be verified that

y_{1}

is concave when

κ \geq θ - 1

.

\frac{\partial^{2} y_{1}}{\partial κ^{2}} = \frac{κ^{θ - 2} Γ (θ, 0)}{e^{κ}} \frac{(θ - 1 - κ) (1 - \frac{Γ (θ, κ)}{Γ (θ, 0)}) - \frac{κ^{θ} e^{- κ}}{Γ (θ, 0)}}{{(Γ (θ, 0) - Γ (θ, κ))}^{2}} .

(8)

Now consider the situation when

κ < θ - 1

. The following are well-known properties of a Poisson distribution with mean

κ

, denoted by

P o i_{κ} (X = θ)

[22],

P o i_{κ} (X = θ) < P o i_{κ} (X \geq θ) < \frac{P o i_{κ} (X = θ)}{1 - \frac{κ}{θ + 1}} .

(9)

Then, based on [21] and applying Equation (9) to

(θ - 1 - κ) (1 - \frac{Γ (θ, κ)}{Γ (θ, 0)}) - \frac{κ^{θ} e^{- κ}}{Γ (θ, 0)}

in Equation (8),

(θ - 1 - κ) P o i_{κ} (X \geq θ) - θ P o i_{κ} (X = θ) < (\frac{θ - 1 - κ}{1 - \frac{κ}{θ + 1}} - θ) P o i_{κ} (X = θ) < - \frac{1 + κ + θ}{θ + 1 - κ} P o i_{κ} (X = θ) < 0 .

Thus,

\frac{\partial^{2} y_{1}}{\partial κ^{2}} < 0

when

κ < θ - 1

. Therefore,

y_{1}

is a strictly increasing concave function for all values of

κ

. □

Lemma 2.

y_{1} (κ = \frac{θ + 1}{2}) > y_{2} (κ = \frac{θ + 1}{2})

Proof of Lemma 2.

Lemma 2 is equivalent to

\frac{θ + 1}{2} + ln (1 - \frac{Γ (θ, \frac{θ + 1}{2})}{Γ (θ, 0)}) > 0 .

From [21], by putting

\frac{θ + 1}{2}

into the

ln ()

function,

ln (\sum_{l \geq θ} \frac{{(\frac{θ + 1}{2})}^{l}}{l!}) > 0,

which is equivalent to

\sum_{l \geq θ} \frac{{(\frac{θ + 1}{2})}^{l}}{l!} > 1 .

(10)

Then, by the Stirling inequality,

\frac{{(\frac{θ + 1}{2})}^{θ}}{θ!} \geq \frac{1}{e \sqrt{θ}} {(\frac{e}{θ})}^{θ} {(\frac{θ + 1}{2})}^{θ} = \frac{1}{e \sqrt{θ}} {(\frac{e}{2})}^{θ} {(1 + \frac{1}{θ})}^{θ} .

The function on the right hand side decreases from

θ = 0

to 1 and increases from

θ = 1

to ∞. The minimum value of this function is 1, which occurs at some point

θ

with

θ > 0

. Thus,

\frac{{(\frac{θ + 1}{2})}^{θ}}{θ!} \geq 1 .

Therefore,

\sum_{l \geq θ} \frac{{(\frac{θ + 1}{2})}^{l}}{l!} > \frac{{(\frac{θ + 1}{2})}^{θ}}{θ!} \geq 1 .

□

Lemma 3.

lim_{κ \to 0^{+}} g (θ, κ) = 0

, and

g (θ, κ)

is a strictly decreasing function for

κ \in (0, κ_{1})

, where

κ_{1} \in (0, \frac{θ + 1}{2})

.

Proof of Lemma 3.

From L’Hopital’s rule,

lim_{x \to 0^{+}} x ln x = 0

. This implies that

lim_{κ \to 0^{+}} (1 - \frac{Γ (θ, κ)}{Γ (θ, 0)}) ln (1 - \frac{Γ (θ, κ)}{Γ (θ, 0)}) = 0

. Therefore,

lim_{κ \to 0^{+}} g (θ, κ) = 0

.

As

κ

approaches 0 from the right,

lim_{κ \to 0^{+}} y_{1} (κ) = - \infty

, and

lim_{κ \to 0^{+}} y_{2} (κ) = - θ - 1

. This implies that

lim_{κ \to 0^{+}} y_{1} (κ) < lim_{κ \to 0^{+}} y_{2} (κ)

. On the other hand, by Lemma 2,

y_{1} (κ = \frac{θ + 1}{2}) > y_{2} (κ = \frac{θ + 1}{2})

. Therefore, by the intermediate value theorem, there exists a point

κ_{1} \in (0, \frac{θ + 1}{2})

that satisfies

y_{1} (κ_{1}) = y_{2} (κ_{1})

. Then, since

y_{1} (κ_{1}) < y_{2} (κ)

as

κ \to 0^{+}

,

y_{2}

is a straight line, and Lemma 1 states that

y_{1}

is a strictly increasing function of

κ

,

y_{1} (κ) < y_{2} (κ)

for

κ \in (0, κ_{1})

. Thus,

\frac{\partial}{\partial κ} g (θ, κ) < 0

for

κ \in (0, κ_{1})

. □

Lemma 4.

The function

g (θ, κ)

is a strictly increasing function for

κ \in (κ_{1}, κ_{2})

, where

\frac{θ + 1}{2} < κ_{2} < θ + 1

.

Proof of Lemma 4.

From Lemma 2 again,

y_{1} (κ = \frac{θ + 1}{2}) > y_{2} (κ = \frac{θ + 1}{2})

. On the other hand, for all real positive values

κ

,

y_{1} (κ) < 0

, whereas

y_{2} (θ + 1) = 0

. Thus,

y_{1} (θ + 1) < y_{2} (θ + 1)

. Therefore, by the intermediate value theorem again, there is a point

\frac{θ + 1}{2} < κ_{2} < θ + 1

that satisfies

y_{1} (κ_{2}) = y_{2} (κ_{2})

. Finally, in the interval of

κ \in (κ_{1}, κ_{2})

,

\frac{\partial}{\partial κ} g (θ, κ) > 0

because

y_{1} (κ) > y_{2} (κ)

in this interval. □

Lemma 5.

The function

g (θ, κ)

is a strictly decreasing function for

κ \in (κ_{2}, \infty)

, and

lim_{κ \to \infty} g (θ, κ) = 0

.

Proof of Lemma 5.

By the definition of an incomplete gamma function,

Γ (θ, κ) = \int_{κ}^{\infty} t^{θ - 1} e^{- t} d t

, the limit of

Γ (θ, κ)

as

κ

approaches positive infinity is 0. This makes the left term of

g (θ, κ)

become

1 ln 1 = 0

when

κ

approaches positive infinity. The right term becomes 0 by L’Hopital’s rule. Therefore,

lim_{κ \to \infty} g (θ, κ) = 0

. From Lemma 1,

y_{1}

is strictly concave, and from the proof of Lemma 4,

y_{1} (κ_{2}) = y_{2} (κ_{2})

. Thus, for

κ > κ_{2}

,

\frac{\partial}{\partial κ} g (θ, κ) < 0

because

y_{1} (κ) < y_{2} (κ)

. □

Theorem 1.

Given n, m and a count threshold

θ > 0

, there exists a unique value

k = k_{o p t}

for which the false positive probability of a CBF has the minimum value. Furthermore,

k_{o p t} (θ) = ⌊ \frac{m}{n} κ^{*} (θ) ⌋

or

k_{o p t} (θ) = ⌈ \frac{m}{n} κ^{*} (θ) ⌉

.

Proof of Theorem 1.

In the interval of

κ \in (0, κ_{1}]

,

g (θ, κ) < 0

due to Lemma 3. Also,

g (θ, κ_{2}) > 0

because Lemma 5 states that

g (θ, κ)

is a strictly decreasing function from

κ_{2}

to ∞, at which point

g (θ, κ)

approaches zero. Then, due to Lemma 4, there is a unique value

κ^{*} \in (κ_{1}, κ_{2})

such that

g (θ, κ^{*}) = 0

. Finally, using the definition of

κ

, the theorem follows. □

3.3. Procedure for Determining the Optimal Number of Hash Functions

Based on the lemmas and theorem of the previous subsection, a general procedure to be used to find the optimal number of hash functions

k_{o p t} (θ)

is as follows. Start from

k = 1

and compute the false positive probability

{\hat{p}}_{f p}^{*} (θ, k, n, m)

using Equation (3). Then increment k by one and recompute the false positive probability. Continue until the false positive probability starts to increase or

k = (θ + 1) n / m

, whichever comes first. The k value that results in the minimum

{\hat{p}}_{f p}^{*} (θ, k, n, m)

is

k_{o p t} (θ)

.

The above procedure can be simplified by using precomputed tables or a linear approximation. Table 1 shows a table of precomputed optimal

κ^{*}

values for count thresholds

θ

ranging from 1 to 30. This table was created by following the procedure outlined above. Since

κ = k n / m

, this table can be used to determine

k_{o p t}

by simply using the relationship shown in Theorem 1; i.e.,

k_{o p t}

is either the floor or ceiling of

κ^{*} m / n

.

For large count thresholds

θ

, a straight line approximation can be used to determine

κ^{*}

, and thereby

k_{o p t}

. Figure 5 shows that the straight line approximation

{\hat{κ}}^{*} (θ) = 0.2037 θ + 0.9176

(11)

closely tracks

κ^{*} (θ)

for large

θ

values. By plotting the relative errors, as shown in Figure 6, it can be seen that there is a relative error of less than about 2 percent when

θ > 30

.

4. Simulation Results

Simulations were conducted to verify the proposed theoretical analysis. A simulation program was written in Java for a general CBF with n data entries, m CBF elements, and k hash functions. The hash functions were created as uniform random distributions between 0 and

m - 1

using the pseudorandom number generator provided in the java.util.Random package and stored in tables so that they could be reused during hashing. Care was taken to ensure that the hash functions created were orthogonal to each other. This simulator program is freely available for all interested readers.

Figure 7 shows the false positive simulation results obtained with an example set of n, m, and count threshold

θ

values. The open triangle, circle, square, and diamond marks refer to the simulation results

{\bar{p}}_{f p}

while the solid curves show the expected false positive probabilties

{\hat{p}}_{f p}

.

Each simulation result, which was the average of 100 simulation runs, was obtained in the following manner. The CBF was initialized by setting all m CBF entries to 0. Then, n data entries were generated randomly. For each data entry, k hash functions, applied by looking up table values from precomputed random hashes (as described above), are applied and used to increment the CBF entries corresponding to the hash function outputs. Finally,

n_{q} = n

queries were randomly generated and the k hash functions are applied to each of those queries. Each query resulted in a “true” answer if all k CBF elements mapped by the k hash functions are greater than or equal to the count threshold

θ

. Finally, the number of false positives generated in this manner were counted and divided by

n_{q}

to produce the false positive probability.

Figure 7 shows the false positive rate simulation and analysis results, as a function of k, with

n = 10

million,

m = 40

million, and

θ

values ranging from 1 to 5. As can be seen from the figure, the simulation results closely map the theoretical results, with slight variations only visible for exceedingly low false probabilities of

10^{- 7}

or smaller. Exceedingly low false probabilities imply rare occurrences of false positives, thus requiring longer simulation runs to obtain accurate results. Even lower false positive probabilities (smaller than

10^{- 9}

) then result in zero occurrences of false positive events in our simulations. Thus, simulation results were not recorded, since false positive events did not occur, for

θ = 5

and

k = 4

through 10 in Figure 7.

5. Conclusions

This paper has investigated the problem of determining the optimal parameter values to be used for counting bloom filters used in applications requiring approximate count thresholds. Rigorous analysis has led to a highly accurate equation for the false positive probability, with relative errors of less than 0.48% given typical parameter values. It has also been proven that there exists a unique number of hash functions

k_{o p t}

for which an minimal false positive probability is obtained. Next, a systematic procedure based on precomputed tables and a linear approximation has been presented for finding

k_{o p t}

. Finally, realistic simulations modeling the use of a CBF for a count thresholding application has been used to show that the theoretical analysis closely models actual CBF behavior.

Author Contributions

Conceptualization, K.K. and S.L.; methodology, K.K. and Y.J.; software, K.K.; validation, K.K.; formal analysis, K.K.; investigation, K.K. and S.L.; resources, K.K., Y.J. and S.L.; data curation, K.K. and Y.J.; writing—original draft preparation, K.K. and S.L.; writing—review and editing, K.K., Y.L. and S.L.; visualization, K.K.; supervision, Y.L. and S.L.; project administration, S.L.; funding acquisition, S.L.

Funding

This research was funded by Samsung Electronics, Samsung Research Funding and Incubation Center of Samsung Electronics under Project Number SRFC-TB1703-07.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

BF	Bloom filter
CBF	Counting bloom filter
pmf	Probability mass function
cmf	Cumulative mass function

References

Bloom, B.H. Space/time trade-offs in hash coding with allowable errors. Commun. ACM 1970, 13, 422–426. [Google Scholar] [CrossRef]
Guo, D.; Liu, Y.; Li, X.; Yang, P. False negative problem of counting bloom filter. IEEE Trans. Knowl. Data Eng. 2010, 22, 651–664. [Google Scholar]
Ghosh, M.; Ozer, E.; Ford, S.; Biles, S.; Lee, H.H.S. Way Guard: A segmented counting Bloom filter approach to reducing energy for set-associative caches. In Proceedings of the International Symposium on Low Power Electronics and Design (ISPLED), San Fancisco, CA, USA, 19–21 August 2009; pp. 165–170. [Google Scholar]
Yun, J.; Lee, S.; Yoo, S. Dynamic wear leveling for phase-change memories with endurance variations. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2015, 23, 1604–1615. [Google Scholar] [CrossRef]
Maggs, B.M.; Sitaraman, R.K. Algorithmic nuggets in content delivery. ACM SIGCOMM Comput. Commun. Rev. 2015, 45, 52–66. [Google Scholar] [CrossRef]
Lu, Y.; Montanari, A.; Prabhakar, B.; Dharmapurikar, S.; Kabbani, A. Counter braids: A novel counter architecture for per-flow measurement. ACM SIGMETRICS Perform. Eval. Rev. 2008, 36, 121–132. [Google Scholar] [CrossRef]
Bonomi, F.; Mitzenmacher, M.; Panigrah, R.; Singh, S.; Varghese, G. Beyond bloom filters: From approximate membership checks to approximate state machines. In Proceedings of the ACM SIGCOMM 2006 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications, Pisa, Italy, 11–15 September 2006; Volume 36, pp. 315–326. [Google Scholar]
Dharmapurikar, S.; Krishnamurthy, P.; Taylor, D.E. Longest prefix matching using bloom filters. In Proceedings of the 2003 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications, Karlsruhe, Germany, 25–29 August 2003; pp. 201–212. [Google Scholar]
Song, H.; Hao, F.; Kodialam, M.; Lakshman, T. Ipv6 lookups using distributed and load balanced bloom filters for 100gbps core router line cards. In Proceedings of the IEEE INFOCOM 2009, Rio de Janeiro, Brazil, 19–25 April 2009; pp. 2518–2526. [Google Scholar]
Ficara, D.; Di Pietro, A.; Giordano, S.; Procissi, G.; Vitucci, F. Enhancing counting Bloom filters through Huffman-coded multilayer structures. IEEE/ACM Trans. Netw. (TON) 2010, 18, 1977–1987. [Google Scholar] [CrossRef]
Fan, L.; Cao, P.; Almeida, J.; Broder, A.Z. Summary cache: A scalable wide-area web cache sharing protocol. IEEE/ACM Trans. Netw. (TON) 2000, 8, 281–293. [Google Scholar] [CrossRef]
Chazelle, B.; Kilian, J.; Rubinfeld, R.; Tal, A. The Bloomier filter: An efficient data structure for static support lookup tables. In Proceedings of the Fifteenth Annual ACM-SIAM Symposium on Discrete Algorithms. Society for Industrial and Applied Mathematics, New Orleans, LA, USA, 11–14 January 2004; pp. 30–39. [Google Scholar]
Moraru, I.; Andersen, D.G. Exact pattern matching with feed-forward bloom filters. J. Exp. Algorithm. (JEA) 2012, 17, 3–4. [Google Scholar] [CrossRef]
Ho, J.T.L.; Lemieux, G.G. PERG: A scalable FPGA-based pattern-matching engine with consolidated bloomier filters. In Proceedings of the 2008 IEEE International Conference on Field-Programmable Technology, Taipei, Taiwan; 2008; pp. 73–80. [Google Scholar]
Melsted, P.; Pritchard, J.K. Efficient counting of k-mers in DNA sequences using a bloom filter. BMC Bioinform. 2011, 12, 333. [Google Scholar] [CrossRef] [PubMed]
Sun, C.; Fan, J.; Shi, L.; Liu, B. A novel router-based scheme to mitigate SYN flooding DDoS attacks. IEEE INFOCOM (Student Poster). 2007. Available online: https://pdfs.semanticscholar.org/fdae/7b20d220a1c23f9f6c0f8464574f78ef55c0.pdf (accessed on 11 July 2019).
Tarkoma, S.; Rothenberg, C.E.; Lagerspetz, E. Theory and Practice of Bloom Filters for Distributed Systems. IEEE Commun. Surv. Tutor. 2012, 14, 131–155. [Google Scholar] [CrossRef]
Broder, A.; Mitzenmacher, M. Network Applications of Bloom Filters: A Survey. Int. Math. 2003, 1, 485–509. [Google Scholar] [CrossRef]
Mitzenmacher, M.; Upfal, E. Probability and Computing: Randomized Algorithms and Probabilistic Analysis; Cambridge University Press: Cambridge, UK, 2005. [Google Scholar]
Papoulis, A.; Pillai, S.U. Probability, Random Variables, and Stochastic Processes; Tata McGraw-Hill Education: Pennsylvania Plaza, NY, USA, 2002; pp. 55–57. [Google Scholar]
Abramowitz, M.; Stegun, I. Handbook of Mathematical Functions: With Formulas, Graphs, and Mathematical Tables Applied Mathematics Series; National Bureau of Standards: Gaithersburg, MD, USA, 1964; pp. 260–265.
Klar, B. Bounds on tail probabilities of discrete distributions. Probab. Eng. Inf. Sci. 2000, 14, 161–171. [Google Scholar] [CrossRef]

Figure 1. Plot of false positive probabilities with

θ = 1

to

θ = 5

vs. the number of hash functions k. The ratio

m / n

is set to four and the cyan dashed line shows

k_{o p t} (θ)

. The sample points correspond to

p_{f p}

and the lines correspond to

{\hat{p}}_{f p}

. The two functions overlap almost 100 percent.

Figure 1. Plot of false positive probabilities with

θ = 1

to

θ = 5

vs. the number of hash functions k. The ratio

m / n

is set to four and the cyan dashed line shows

k_{o p t} (θ)

. The sample points correspond to

p_{f p}

and the lines correspond to

{\hat{p}}_{f p}

. The two functions overlap almost 100 percent.

Figure 2. Plot of relative error between exact and approximate false positive probabilities with

θ = 1

to

θ = 5

vs. the number of hash functions k.

n = 1000

and

m = 4000

are used in this plot, and larger values of n and m result in slightly lower relative errors.

Figure 2. Plot of relative error between exact and approximate false positive probabilities with

θ = 1

to

θ = 5

vs. the number of hash functions k.

n = 1000

and

m = 4000

are used in this plot, and larger values of n and m result in slightly lower relative errors.

Figure 3. A plot showing

y_{1}

and

y_{2}

, used in Lemmas 1 and 2, as a function of

κ

for

θ = 5

.

Figure 3. A plot showing

y_{1}

and

y_{2}

, used in Lemmas 1 and 2, as a function of

κ

for

θ = 5

.

Figure 4.

g (θ, κ)

as a function of

κ

for

θ = 5

and

m / n = 4

.

κ_{1} \approx 1.2262

,

κ_{2} \approx 5.5756 < 6

, and

κ^{*} \approx 1.6117

.

Figure 4.

g (θ, κ)

as a function of

κ

for

θ = 5

and

m / n = 4

.

κ_{1} \approx 1.2262

,

κ_{2} \approx 5.5756 < 6

, and

κ^{*} \approx 1.6117

.

Figure 5. The functions

κ^{*} (θ)

and

{\hat{κ}}^{*} (θ)

as a function of

θ

.

Figure 5. The functions

κ^{*} (θ)

and

{\hat{κ}}^{*} (θ)

as a function of

θ

.

Figure 6. Plot of relative error between

κ^{*} (θ)

and

{\hat{κ}}^{*} (θ)

vs.

θ

.

Figure 6. Plot of relative error between

κ^{*} (θ)

and

{\hat{κ}}^{*} (θ)

vs.

θ

.

Figure 7. Plot of false positive theoretical values (

{\hat{p}}_{f p}

) and simulation results (

{\bar{p}}_{f p}

) where n = 10,000,000, m = 40,000,000,

θ = 1, 2, 3, 4, 5

, and

n_{q} = n

vs. k between 1 to 20.

Figure 7. Plot of false positive theoretical values (

{\hat{p}}_{f p}

) and simulation results (

{\bar{p}}_{f p}

) where n = 10,000,000, m = 40,000,000,

θ = 1, 2, 3, 4, 5

, and

n_{q} = n

vs. k between 1 to 20.

Table 1. The

κ^{*} (θ)

,

{\hat{κ}}^{*} (θ)

, and relative error between

κ^{*} (θ)

and

{\hat{κ}}^{*} (θ)

values.

Table 1. The

κ^{*} (θ)

,

{\hat{κ}}^{*} (θ)

, and relative error between

κ^{*} (θ)

and

{\hat{κ}}^{*} (θ)

values.

$θ$	$κ^{*} (θ)$	${\hat{κ}}^{*} (θ)$	$\frac{κ^{} (θ) - {\hat{κ}}^{} (θ)}{κ^{*} (θ)}$	$θ$	$κ^{*} (θ)$	${\hat{κ}}^{*} (θ)$	$\frac{κ^{} (θ) - {\hat{κ}}^{} (θ)}{κ^{*} (θ)}$	$θ$	$κ^{*} (θ)$	${\hat{κ}}^{*} (θ)$	$\frac{κ^{} (θ) - {\hat{κ}}^{} (θ)}{κ^{*} (θ)}$
1	0.6931	1.1213	−0.6177	11	2.9099	3.1583	−0.0854	21	5.0183	5.1953	−0.0353
2	0.9326	1.3250	−0.4207	12	3.1228	3.3620	−0.0766	22	5.2274	5.3990	−0.0328
3	1.1635	1.5287	−0.3139	13	3.3351	3.5657	−0.0691	23	5.4362	5.6027	−0.0306
4	1.3893	1.7324	−0.2469	14	3.5469	3.7694	−0.0627	24	5.6448	5.8064	−0.0286
5	1.6117	1.9361	−0.2013	15	3.7582	3.9731	−0.0572	25	5.8533	6.0101	−0.0268
6	1.8317	2.1398	−0.1682	16	3.9690	4.1768	−0.0524	26	6.0616	6.2138	−0.0251
7	2.0498	2.3435	−0.1433	17	4.1795	4.3805	−0.0481	27	6.2697	6.4175	−0.0236
8	2.2664	2.5472	−0.1239	18	4.3896	4.5842	−0.0443	28	6.4776	6.6212	−0.0222
9	2.4818	2.7509	−0.1084	19	4.5995	4.7879	−0.0410	29	6.6854	6.8249	−0.0209
10	2.6963	2.9546	−0.0958	20	4.8090	4.9916	−0.0380	30	6.8931	7.0286	−0.0197

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kim, K.; Jeong, Y.; Lee, Y.; Lee, S. Analysis of Counting Bloom Filters Used for Count Thresholding. Electronics 2019, 8, 779. https://doi.org/10.3390/electronics8070779

AMA Style

Kim K, Jeong Y, Lee Y, Lee S. Analysis of Counting Bloom Filters Used for Count Thresholding. Electronics. 2019; 8(7):779. https://doi.org/10.3390/electronics8070779

Chicago/Turabian Style

Kim, Kibeom, Yongjo Jeong, Youngjoo Lee, and Sunggu Lee. 2019. "Analysis of Counting Bloom Filters Used for Count Thresholding" Electronics 8, no. 7: 779. https://doi.org/10.3390/electronics8070779

APA Style

Kim, K., Jeong, Y., Lee, Y., & Lee, S. (2019). Analysis of Counting Bloom Filters Used for Count Thresholding. Electronics, 8(7), 779. https://doi.org/10.3390/electronics8070779

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Analysis of Counting Bloom Filters Used for Count Thresholding

Abstract

1. Introduction

2. Previous Bloom Filter (BF) and Counting Bloom Filter (CBF) Analysis

3. Proposed Analysis and Results

3.1. False Positive Probability

3.2. Uniqueness of Optimal Number of Hash Functions

3.3. Procedure for Determining the Optimal Number of Hash Functions

4. Simulation Results

5. Conclusions

Author Contributions

Funding

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI