Open Access This article is
- freely available
Electronics 2019, 8(7), 779; https://doi.org/10.3390/electronics8070779
Analysis of Counting Bloom Filters Used for Count Thresholding
Department of Electrical Engineering, Pohang University of Science and Technology (POSTECH), Pohang 37673, Korea
Author to whom correspondence should be addressed.
Received: 7 June 2019 / Accepted: 9 July 2019 / Published: 11 July 2019
A bloom filter is an extremely useful tool applicable to various fields of electronics and computers; it enables highly efficient search of extremely large data sets with no false negatives but a possibly small number of false positives. A counting bloom filter is a variant of a bloom filter that is typically used to permit deletions as well as additions of elements to a target data set. However, it is also sometimes useful to use a counting bloom filter as an approximate counting mechanism that can be used, for example, to determine when a specific web page has been referenced more than a specific number of times or when a memory address is a “hot” address. This paper derives, for the first time, highly accurate approximate false positive probabilities and optimal numbers of hash functions for counting bloom filters used in count thresholding applications. The analysis is confirmed by comparisons to existing theoretical results, which show an error, with respect to exact analysis, of less than 0.48% for typical parameter values.
Keywords:counting bloom filter; database search; count thresholding; hash function
A bloom filter (BF) is a powerful tool that can be used to create novel, low-overhead methods for dealing with big data sets in various software/hardware applications. Proposed by Burton Howard Bloom in 1970 , a BF is an m-bit vector that is initialized to 0. Assuming that n elements are stored into a data set, each time a new element is stored, k hash functions, each of which maps the element to one of m bit locations, are applied and the corresponding bits in the BF are set to 1. To determine if a new unknown element is a member of the data set or not, the k hash functions are applied to that element and the corresponding bits in the BF are checked. A positive answer is returned if all of those bits are found to be 1. Since the search time is independent of the number of elements in the data set, this results in extremely space-efficient and fast search of large data sets. Although false positives are possible, false negatives are not since the element could not have been stored in the data set if any of the k hash function bits in the BF are 0.
Over the years, various BF variants have been proposed. One commonly cited variant is a counting bloom filter (CBF), which has been proposed as a method that supports deletion as well as addition of elements to a data set . In a CBF, the m elements are multiple-bit counts instead of bits. Every time an element is added to or deleted from the data set, k hash functions are applied to the element and all of the k locations in the m-element CBF are incremented or decremented by one. In order to check if a specific element is still in the data set, the k hashed locations in the CBF can be checked for the absence of 0 values. Several methods have been proposed to enable efficient support of m-element CBFs [3,4].
BFs and CBFs have been found to be useful for numerous applications. For example, in a web cache, instead of storing all web objects that are accessed, disk writes can be significantly reduced by only storing those web objects that are referred to more than once, thereby eliminating “one-hit wonders” (accessed by a set of users once and never again). A BF can be used to quickly determine if a specific web page has been referenced before or not. For long-term usage, a CBF can be used instead of a BF in order to permit deletion of long unused web objects from the web cache. This approach has been found to reduce the rate of disk writes by nearly half in an actual system of servers .
Other applications where BFs and CBFs have been found to be useful include communications and networking applications [6,7,8,9], Huffman coding , cache architecture design [3,11], memory wear leveling , and string search for DNA sequence identification [12,13,14,15].
Given that the elements in a CBF contain approximate “counts”, this paper examines the problem of using a CBF as an approximate counting mechanism, in particular to check whether a certain data element has been referred to or more times, where is a count threshold. For example, when applied to the above web cache application, could be used to eliminate one-hit and two-hit web objects; i.e., disk write usage could be reduced by only storing web objects that have been accessed two or more times. As a second example, when applied to memory management in a computer system, a value such as could be used to identify hot memory addresses. As a third example, when applied to DNA sequence identification, approximate count thresholding could be used to quickly identify or match specific strings in a DNA sequence that occur more than times. As a fourth example, approximate count thresholding could be used to determine if a set of nodes are accessing a given web page a large number of times within a short timespan and thus help guard that web page against distributed denial of service (DDOS) attacks .
The main motivation for this paper to provide a solid theoretical foundation for the use of CBFs for count thresholding applications. Towards this end, Section 2 introduces the traditional BF and CBF analysis. Then Section 3 presents the newly proposed CBF analysis method and its main results. Next, Section 4 uses comparisons to existing theoretical analysis to confirm this new analysis method. Finally, Section 5 concludes this paper.
2. Previous Bloom Filter (BF) and Counting Bloom Filter (CBF) Analysis
The traditional BF analysis method proceeds as follows [17,18]. For insertion of n elements into a data set, an m-bit BF, initialized with all 0 bits, is used. Every time an element is inserted into the data set, k hash functions are applied and the corresponding bits in the m-bit BF are set to 1. After the first element is inserted into the data set and the first hash function is used to set one bit of the BF, an arbitrary bit in the BF is 0 with probability . After all k hash functions are used, an arbitrary bit in the BF is still 0 with probability . Thus, after all n elements are inserted and all k hash functions applied to each of those n elements, the probability that an arbitrary bit in the BF is 0 is , where the approximation is based on the definition of e [17,18].
If a user wishes to determine if an element is present in the data set or not, he/she applies the k hash functions and checks all k bits in the BF. If an element is not in the data set, an erroneous result is produced when all k hashed bits in the BF are unity. Using this as an approximation, the false positive probability . An important BF parameter that must be selected is the number of hash functions, k, to be used. For this purpose, the k value that is typically used is the one that minimizes the false positive probability. Thus, based on the above approximation,
The careful reader will note that the above analysis is not strictly correct as it assumes independence of the values of bits in the BF, even if the k hashed locations are for an element in the data set . However, using Chernoff bounds, Mitzenmacher and Upfal have shown that, for large m and n, the same result is obtained even without the independence assumption . Therefore, as can be easily verified by the reader, given sufficiently large n and m (e.g., and or larger), the above equations are highly accurate and k can be chosen based on Equation (1).
Using the same assumptions as [18,19], the above analysis can be extended to a CBF. To do this, it is noted that after n elements have been inserted into a data set and uniformly random hash mappings have been used to increment the values in an m-element CBF, the probability that an arbitrary element in the CBF has the value l is simply defined by the probability mass function (pmf) of a binomial distribution with success probability . Denoting this as ,Note that when , which corresponds to checking whether an arbitrary element in the CBF is 0, this equation simplifies to .
Suppose an m-element CBF is used to determine if an element has been referenced or more times. After insertion of n elements in a data set, the probability that an arbitrary element of the CBF has a value less than is simply the sum of from to . Thus, the false positive probability with count threshold is as follows.
3. Proposed Analysis and Results
Although clearly useful for certain applications such as web data caching, determination of hot memory addresses, string matching in DNA sequence analysis, and protection again DDOS attacks, there has been no previous detailed theoretical analysis of CBFs used for count thresholding in the open literature (previous papers have only dealt with CBFs used to permit deletions of elements in large data sets). Such an analysis is necessary in order to be able to predict the effectiveness of a CBF solution and the specific CBF parameters to use. For example, m, the number of CBF elements to be used, must be selected such that the resulting false positive probability level is acceptable for the chosen application. In addition, k, the number of hash functions to be used, must be selected to minimize the the false positive probability. This type of analysis is provided in this section.
3.1. False Positive Probability
The analysis starts with a derivation of a close approximation for the false positive probability, which is necessary since the exact form given in Equation (2) involves a sum of binomial distributions, which is extremely difficult and time-consuming to compute for large n and m values. For large and small , it is well known that a binomial distribution can be approximated by a Poisson distribution with mean . For CBF applications, large n (data set size) and m (CBF size) values satisfy these conditions since and . Thus, the approximate false positive probability can be written as follows.
The cumulative mass function (cmf) of a Poisson distribution is a regularized incomplete gamma function . Thus, the approximate false positive probability can be written aswhere the mean of the Poisson distribution used is defined as and
As shown in Figure 1, this incomplete Gamma function approximation results in a highly accurate approximation of . Note that the approximation only depends on the ratio of to m. Figure 1 shows that and overlap almost 100%. The exact relative error of is shown in Figure 2. For the parameters shown, the relative error is less than 0.48% when an optimal number of hash functions is used.
The optimal k values , for which the false positive probabilities are the lowest, are shown using a dashed cyan line in Figure 1. As can be seen in the figure, is definitely not the same, or even close, to , shown as a solid vertical orange line in Figure 1, when . Before proposing a systematic method for finding for general values of , a rigorous analysis will be presented that shows that only one such value exists.
3.2. Uniqueness of Optimal Number of Hash Functions
A sequence of lemmas are used to prove that there exists a unique value of for which the false positive probability is minimized. To follow this proof process, the reader is advised to refer to the plots in Figure 3 and Figure 4 when reading the following lemmas.
Since the optimal false positive probability point occurs when its slope is 0, the proof starts by taking the derivative of with respect to k. To find the shape of the derivative of , the logarithm of can be used.By taking the derivative of Equation (4),
By Leibniz’s rule and the definition of the incomplete gamma function ,
Therefore, by applying Equation (6) to the right side of Equation (5) and multiplying both sides of Equation (5) by ,
For , this derivative should be set to 0. Since , the second part must be 0. Then, multiplying this second part by a common factor and denoting this term as , the following equations and lemmas follow.
To determine whether g is a decreasing or increasing function, the derivative of g is needed.
In Equation (7), the first part is greater than 0. Thus, g is a decreasing or increasing function depending on the polarity of . Let and . The and terms are defined in this manner in order to facilitate the examination of the exact conditions under which g is a decreasing or increasing function, and thereby determine the conditions for the changes in slope of the false positive probability function. Then,Examples of the shapes of and are shown in Figure 3. An example of the function is shown in Figure 4.
For a fixed value of θ, is a strictly increasing concave function of κ.
Proof of Lemma 1.
Using the definition of an incomplete Gamma function, the first partial derivative of can be shown to be greater than zero. i.e.,
Then, using the second partial derivative, it can also be verified that is concave when .
Proof of Lemma 2.
Lemma 2 is equivalent to
From , by putting into the function,which is equivalent to
Then, by the Stirling inequality,
The function on the right hand side decreases from to 1 and increases from to ∞. The minimum value of this function is 1, which occurs at some point with . Thus, Therefore, □
, and is a strictly decreasing function for , where .
Proof of Lemma 3.
From L’Hopital’s rule, . This implies that . Therefore, .
As approaches 0 from the right, , and . This implies that . On the other hand, by Lemma 2, . Therefore, by the intermediate value theorem, there exists a point that satisfies . Then, since as , is a straight line, and Lemma 1 states that is a strictly increasing function of , for . Thus, for . □
The function is a strictly increasing function for , where .
Proof of Lemma 4.
From Lemma 2 again, . On the other hand, for all real positive values , , whereas . Thus, . Therefore, by the intermediate value theorem again, there is a point that satisfies . Finally, in the interval of , because in this interval. □
The function is a strictly decreasing function for , and .
Proof of Lemma 5.
By the definition of an incomplete gamma function, , the limit of as approaches positive infinity is 0. This makes the left term of become when approaches positive infinity. The right term becomes 0 by L’Hopital’s rule. Therefore, . From Lemma 1, is strictly concave, and from the proof of Lemma 4, . Thus, for , because . □
Given n, m and a count threshold , there exists a unique value for which the false positive probability of a CBF has the minimum value. Furthermore, or .
Proof of Theorem 1.
In the interval of , due to Lemma 3. Also, because Lemma 5 states that is a strictly decreasing function from to ∞, at which point approaches zero. Then, due to Lemma 4, there is a unique value such that . Finally, using the definition of , the theorem follows. □
3.3. Procedure for Determining the Optimal Number of Hash Functions
Based on the lemmas and theorem of the previous subsection, a general procedure to be used to find the optimal number of hash functions is as follows. Start from and compute the false positive probability using Equation (3). Then increment k by one and recompute the false positive probability. Continue until the false positive probability starts to increase or , whichever comes first. The k value that results in the minimum is .
The above procedure can be simplified by using precomputed tables or a linear approximation. Table 1 shows a table of precomputed optimal values for count thresholds ranging from 1 to 30. This table was created by following the procedure outlined above. Since , this table can be used to determine by simply using the relationship shown in Theorem 1; i.e., is either the floor or ceiling of .
For large count thresholds , a straight line approximation can be used to determine , and thereby . Figure 5 shows that the straight line approximationclosely tracks for large values. By plotting the relative errors, as shown in Figure 6, it can be seen that there is a relative error of less than about 2 percent when .
4. Simulation Results
Simulations were conducted to verify the proposed theoretical analysis. A simulation program was written in Java for a general CBF with n data entries, m CBF elements, and k hash functions. The hash functions were created as uniform random distributions between 0 and using the pseudorandom number generator provided in the java.util.Random package and stored in tables so that they could be reused during hashing. Care was taken to ensure that the hash functions created were orthogonal to each other. This simulator program is freely available for all interested readers.
Figure 7 shows the false positive simulation results obtained with an example set of n, m, and count threshold values. The open triangle, circle, square, and diamond marks refer to the simulation results while the solid curves show the expected false positive probabilties .
Each simulation result, which was the average of 100 simulation runs, was obtained in the following manner. The CBF was initialized by setting all m CBF entries to 0. Then, n data entries were generated randomly. For each data entry, k hash functions, applied by looking up table values from precomputed random hashes (as described above), are applied and used to increment the CBF entries corresponding to the hash function outputs. Finally, queries were randomly generated and the k hash functions are applied to each of those queries. Each query resulted in a “true” answer if all k CBF elements mapped by the k hash functions are greater than or equal to the count threshold . Finally, the number of false positives generated in this manner were counted and divided by to produce the false positive probability.
Figure 7 shows the false positive rate simulation and analysis results, as a function of k, with million, million, and values ranging from 1 to 5. As can be seen from the figure, the simulation results closely map the theoretical results, with slight variations only visible for exceedingly low false probabilities of or smaller. Exceedingly low false probabilities imply rare occurrences of false positives, thus requiring longer simulation runs to obtain accurate results. Even lower false positive probabilities (smaller than ) then result in zero occurrences of false positive events in our simulations. Thus, simulation results were not recorded, since false positive events did not occur, for and through 10 in Figure 7.
This paper has investigated the problem of determining the optimal parameter values to be used for counting bloom filters used in applications requiring approximate count thresholds. Rigorous analysis has led to a highly accurate equation for the false positive probability, with relative errors of less than 0.48% given typical parameter values. It has also been proven that there exists a unique number of hash functions for which an minimal false positive probability is obtained. Next, a systematic procedure based on precomputed tables and a linear approximation has been presented for finding . Finally, realistic simulations modeling the use of a CBF for a count thresholding application has been used to show that the theoretical analysis closely models actual CBF behavior.
Conceptualization, K.K. and S.L.; methodology, K.K. and Y.J.; software, K.K.; validation, K.K.; formal analysis, K.K.; investigation, K.K. and S.L.; resources, K.K., Y.J. and S.L.; data curation, K.K. and Y.J.; writing—original draft preparation, K.K. and S.L.; writing—review and editing, K.K., Y.L. and S.L.; visualization, K.K.; supervision, Y.L. and S.L.; project administration, S.L.; funding acquisition, S.L.
This research was funded by Samsung Electronics, Samsung Research Funding and Incubation Center of Samsung Electronics under Project Number SRFC-TB1703-07.
Conflicts of Interest
The authors declare no conflict of interest.
The following abbreviations are used in this manuscript:
|CBF||Counting bloom filter|
|pmf||Probability mass function|
|cmf||Cumulative mass function|
- Bloom, B.H. Space/time trade-offs in hash coding with allowable errors. Commun. ACM 1970, 13, 422–426. [Google Scholar] [CrossRef]
- Guo, D.; Liu, Y.; Li, X.; Yang, P. False negative problem of counting bloom filter. IEEE Trans. Knowl. Data Eng. 2010, 22, 651–664. [Google Scholar]
- Ghosh, M.; Ozer, E.; Ford, S.; Biles, S.; Lee, H.H.S. Way Guard: A segmented counting Bloom filter approach to reducing energy for set-associative caches. In Proceedings of the International Symposium on Low Power Electronics and Design (ISPLED), San Fancisco, CA, USA, 19–21 August 2009; pp. 165–170. [Google Scholar]
- Yun, J.; Lee, S.; Yoo, S. Dynamic wear leveling for phase-change memories with endurance variations. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2015, 23, 1604–1615. [Google Scholar] [CrossRef]
- Maggs, B.M.; Sitaraman, R.K. Algorithmic nuggets in content delivery. ACM SIGCOMM Comput. Commun. Rev. 2015, 45, 52–66. [Google Scholar] [CrossRef]
- Lu, Y.; Montanari, A.; Prabhakar, B.; Dharmapurikar, S.; Kabbani, A. Counter braids: A novel counter architecture for per-flow measurement. ACM SIGMETRICS Perform. Eval. Rev. 2008, 36, 121–132. [Google Scholar] [CrossRef]
- Bonomi, F.; Mitzenmacher, M.; Panigrah, R.; Singh, S.; Varghese, G. Beyond bloom filters: From approximate membership checks to approximate state machines. In Proceedings of the ACM SIGCOMM 2006 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications, Pisa, Italy, 11–15 September 2006; Volume 36, pp. 315–326. [Google Scholar]
- Dharmapurikar, S.; Krishnamurthy, P.; Taylor, D.E. Longest prefix matching using bloom filters. In Proceedings of the 2003 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications, Karlsruhe, Germany, 25–29 August 2003; pp. 201–212. [Google Scholar]
- Song, H.; Hao, F.; Kodialam, M.; Lakshman, T. Ipv6 lookups using distributed and load balanced bloom filters for 100gbps core router line cards. In Proceedings of the IEEE INFOCOM 2009, Rio de Janeiro, Brazil, 19–25 April 2009; pp. 2518–2526. [Google Scholar]
- Ficara, D.; Di Pietro, A.; Giordano, S.; Procissi, G.; Vitucci, F. Enhancing counting Bloom filters through Huffman-coded multilayer structures. IEEE/ACM Trans. Netw. (TON) 2010, 18, 1977–1987. [Google Scholar] [CrossRef]
- Fan, L.; Cao, P.; Almeida, J.; Broder, A.Z. Summary cache: A scalable wide-area web cache sharing protocol. IEEE/ACM Trans. Netw. (TON) 2000, 8, 281–293. [Google Scholar] [CrossRef]
- Chazelle, B.; Kilian, J.; Rubinfeld, R.; Tal, A. The Bloomier filter: An efficient data structure for static support lookup tables. In Proceedings of the Fifteenth Annual ACM-SIAM Symposium on Discrete Algorithms. Society for Industrial and Applied Mathematics, New Orleans, LA, USA, 11–14 January 2004; pp. 30–39. [Google Scholar]
- Moraru, I.; Andersen, D.G. Exact pattern matching with feed-forward bloom filters. J. Exp. Algorithm. (JEA) 2012, 17, 3–4. [Google Scholar] [CrossRef]
- Ho, J.T.L.; Lemieux, G.G. PERG: A scalable FPGA-based pattern-matching engine with consolidated bloomier filters. In Proceedings of the 2008 IEEE International Conference on Field-Programmable Technology, Taipei, Taiwan; 2008; pp. 73–80. [Google Scholar]
- Melsted, P.; Pritchard, J.K. Efficient counting of k-mers in DNA sequences using a bloom filter. BMC Bioinform. 2011, 12, 333. [Google Scholar] [CrossRef] [PubMed]
- Sun, C.; Fan, J.; Shi, L.; Liu, B. A novel router-based scheme to mitigate SYN flooding DDoS attacks. IEEE INFOCOM (Student Poster). 2007. Available online: https://pdfs.semanticscholar.org/fdae/7b20d220a1c23f9f6c0f8464574f78ef55c0.pdf (accessed on 11 July 2019).
- Tarkoma, S.; Rothenberg, C.E.; Lagerspetz, E. Theory and Practice of Bloom Filters for Distributed Systems. IEEE Commun. Surv. Tutor. 2012, 14, 131–155. [Google Scholar] [CrossRef]
- Broder, A.; Mitzenmacher, M. Network Applications of Bloom Filters: A Survey. Int. Math. 2003, 1, 485–509. [Google Scholar] [CrossRef]
- Mitzenmacher, M.; Upfal, E. Probability and Computing: Randomized Algorithms and Probabilistic Analysis; Cambridge University Press: Cambridge, UK, 2005. [Google Scholar]
- Papoulis, A.; Pillai, S.U. Probability, Random Variables, and Stochastic Processes; Tata McGraw-Hill Education: Pennsylvania Plaza, NY, USA, 2002; pp. 55–57. [Google Scholar]
- Abramowitz, M.; Stegun, I. Handbook of Mathematical Functions: With Formulas, Graphs, and Mathematical Tables Applied Mathematics Series; National Bureau of Standards: Gaithersburg, MD, USA, 1964; pp. 260–265.
- Klar, B. Bounds on tail probabilities of discrete distributions. Probab. Eng. Inf. Sci. 2000, 14, 161–171. [Google Scholar] [CrossRef]
Figure 1. Plot of false positive probabilities with to vs. the number of hash functions k. The ratio is set to four and the cyan dashed line shows . The sample points correspond to and the lines correspond to . The two functions overlap almost 100 percent.
Figure 2. Plot of relative error between exact and approximate false positive probabilities with to vs. the number of hash functions k. and are used in this plot, and larger values of n and m result in slightly lower relative errors.
Figure 3. A plot showing and , used in Lemmas 1 and 2, as a function of for .
Figure 4. as a function of for and . , , and .
Figure 5. The functions and as a function of .
Figure 6. Plot of relative error between and vs. .
Figure 7. Plot of false positive theoretical values () and simulation results () where n = 10,000,000, m = 40,000,000, , and vs. k between 1 to 20.
Table 1. The , , and relative error between and values.
© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).