Content-Based Approach for Improving Bloom Filter Efficiency

Alsuhaibani, Mohammed; Khan, Rehan Ullah; Qamar, Ali Mustafa; Alsuhibany, Suliman A.

doi:10.3390/app13137922

Open AccessArticle

Content-Based Approach for Improving Bloom Filter Efficiency

¹

Department of Computer Science, College of Computer, Qassim University, Buraydah 51452, Saudi Arabia

²

Department of Information Technology, College of Computer, Qassim University, Buraydah 51452, Saudi Arabia

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(13), 7922; https://doi.org/10.3390/app13137922

Submission received: 2 May 2023 / Revised: 26 June 2023 / Accepted: 29 June 2023 / Published: 6 July 2023

(This article belongs to the Special Issue Emerging Trends and Applications of Big Data Analytics and the Internet of Things for Future Smart Cities)

Download

Browse Figures

Versions Notes

Abstract

:

Bloom filters are a type of data structure that is used to test whether or not an element is a member of a set. They are known for being space-efficient and are commonly employed in various applications, such as network routers, web browsers, and databases. These filters work by allowing a fixed probability of incorrectly identifying an element as being a member of the set, known as the false positive rate (FPR). However, traditional bloom filters suffer from a high FPR and extensive memory usage, which can lead to incorrect query results and a slow performance. Thus, this study indicates that a content-based strategy could be a practical solution for these challenges. Specifically, our approach requires less bloom filter storage, consequently decreasing the probability of false positives. The effectiveness of several hash functions on our strategy’s performance was also evaluated. Experimental evaluations demonstrated that the proposed strategy could potentially decrease false positives by a substantial margin of up to 79.83%. The use of size-based content bits significantly contributes to the decrease in the number of false positives as well. However, as the volume of content bits rises, the impact on time is not considerably noticeable. Moreover, the evidence suggests that the application of a singular approach leads to a more than 50% decrease in false positives.

Keywords:

big data; bloom filter; content based; data structure; large dataset; text processing

1. Introduction

Big data, which refers to the increasing speed, diversity, and quantity of data being generated and gathered in today’s digital world, is a commonly used concept within the areas of computer science and data management [1,2,3]. These data are derived from an array of sources, encompassing social media platforms, Internet of Things (IoT) devices, various sensors, and numerous forms of digital documentation. Furthermore, they can offer critical insights and guide decision-making processes across a multitude of scenarios.

Nonetheless, the size and intricacy of big data present considerable difficulties when it comes to management and analysis. Traditional systems designed for data processing and storage may falter under the immense variety and volume of data. Even with distributed systems, the resources and time demanded for processing and analysis could be daunting. Hence, efficient and effective strategies for big data processing and analysis are a dire necessity.

A bloom filter (BF) could potentially be a feasible answer to this issue. As a probabilistic data structure, a BF is capable of confirming if an element belongs to a specific set. First brought to light in 1970 by Burton Bloom [4], the BF has since become a highly utilized instrument within the spheres of data management and computer science.

Fundamentally, a BF is a condensed data structure that enables efficient membership testing, that is, ascertaining if a particular element is part of a set or not [5]. This is achieved by using a hash function to map elements onto a pre-set size bit array, marking corresponding bits as “1” for every added element to the set. When checking for membership, the bloom filter scrutinizes the related bits in the array. If all these bits are set to “1”, the filter delivers a “potentially in the set” response. Conversely, if any of the bits indicate “0” a “definitely not in the set” response is given.

One of the key advantages of BFs is their space efficiency, as they can represent a large set of elements using a relatively small amount of memory. This makes them particularly useful for big data applications, where the volume of data can be overwhelming and efficient use of space is crucial. In addition, BFs can be used for other purposes in the context of big data, such as deduplication and data compression [6].

It is important, however, to note that BFs are probabilistic in nature, which means that they may produce false positives (i.e., returning a “possibly in set” result for an element that is not actually in the set) but never produce false negatives. This trade-off is a fundamental property of BFs, and the false positive rate (FPR) can be controlled by adjusting the size of the bit array and the number of hash functions used.

In fact, Khan et al. [7] put forward the idea of check bits, which, from an implementation standpoint, involves obtaining the binary representation of the content value to be stored in the BF. A subset of bits from the content value, known as check bits, is selected. These check bits are stored in a distinct array, which references the same location as the BF. The content value is then stored in the BF by utilizing hash functions. Prior to retrieving data from the BF, the reverse sequence of these steps is performed to verify the accuracy of the retrieved data. To mitigate false positives, the check bits play a crucial role in the retrieval process by ensuring that it is not solely dependent on the hash output.

However, despite the effectiveness of the proposal of using check bits and its ability to reduce false positives by almost 50%, it does not utilize the content of data themselves for the check bits, which, as we will see later in this paper, has been empirically proven to be very helpful. Therefore, this paper proposes a content-based approach to reduce FPRs while using BFs by utilizing a smaller amount of data storage space in the BFs. In order to evaluate the proposed approach, a number of experimental studies are carried out. The results showed the proposed approach’s effectiveness in improving the accuracy and efficiency of membership testing.

The contributions of the paper can be summarized as follows:

We propose a content-based concept that solves the main problems with the traditional BF. These problems are as follows: the BF can yield a considerable number of false positives, leading to incorrect content being returned when queried; the large size of the BF can hamper the speed of querying data; and the BF could consume a substantial amount of memory, which becomes a concern when dealing with large-scale applications or systems where efficient memory utilization is crucial.
We provide a comprehensive evaluation of the content-based BF and reduce the false positives by using a smaller amount of data storage space in the BF paradigm.
We optimally evaluate the execution and the role of hash functions in reducing false positives.
The experimental evaluation provides valuable insight into the application of content-based BFs.

The remaining sections of this paper are structured as follows: Section 2 provides a comprehensive overview of the background and related works. Section 3 presents the proposed approach in detail. Section 4 is dedicated to explaining the conducted experiments and sharing the outcomes. A discussion on these outcomes takes place in Section 5. The conclusion of this paper can be found in Section 6.

2. Background and Related Work

Different categories of BFs have been organized according to their respective applications. Nevertheless, this research primarily concentrates on presenting the concept of content-based content bits and examining their implementation in network security. Therefore, we will limit our review to specific BF categories, such as standard, compressed, dynamic, generalized, hierarchical, and space-code BFs, paying special attention to their role in network security.

This study employs the standard BF (SBF) as a means to introduce the concept of content-based content bits. In an SBF, a set of elements is preserved using the membership function

S : f = S \to [0, 1]

. The SBF allows for the encoding of arbitrary functions, enabling the mapping of values with a subset of elements. This approach ensures an efficient utilization of storage space. Moreover, these functions can be dynamically updated without interfering with the function itself.

The compressed BF (ComBF) [8], besides the three primary factors “k”, “m”, and “n”, has an additional component called the “transmission size z”. This size aims to minimize the data needing transmission over a network, thereby significantly reducing the bandwidth usage, though it incurs increased memory requirements. As mentioned earlier, the SBF is only suitable for unchanging sets with a known size. To bypass this limitation, Guo et al. [9] proposed the dynamic BF (DBF), a dynamic set constructed using an active

n \times m

bit matrix. It operates by initially activating a standard BF and subsequently launching a new one as the FPR begins to escalate. The novel SBF replaces the older one upon activation, deactivating previously active filters and leaving only the latest one active. The SBF poses a limitation by failing to reach a 100% FPR, potentially opening a window for security risks. Laufer et al. [10] addressed this by introducing the generalized BF (GBF), which tackles this problem by imposing an upper limit on the FPR, adjusting the filter bits and resetting them via hash functions. The GBF has been shown to have a strong performance for security purposes.

The hierarchical BF (HBF) [11] utilizes a two-level array system for locating and mapping files within a metadata servers group. The first array is used to distribute the metadata server, which significantly reduces memory overhead. Meanwhile, the second one functions as a cache, storing partial distribution information of the metadata server. This method shows an improved efficiency and performance. Kumar et al. [12] proposed the space-code BF (SCBF) for approximating per-flow traffic measurements. Specifically, the SCBF employs multiple SBFs to approximate the representation of a multiset. This allows the SCBF to efficiently answer queries.

It is important to note the application of the previously discussed BFs in the realm of network security, including authentication, tracebacking, detecting node replication, privacy preservation, and addressing SYN flooding [13]. The SBF is relevant and can be applied to all of the network security areas mentioned previously, which makes it a suitable choice for the content-based approach presented here. However, the ComBF, GBF, HBF, and SCBF are only used for tracebacking, and, similarly, the DBF is used only for detecting node replication. Alsuhibany et al. [13] has proposed the use of BFs in analyzing big data security. Specifically, the counting BF was presented as a space- and time-efficient method for analyzing big data security using a smaller dataset. However, this concept was only demonstrated using a small dataset and the counting BF to justify its feasibility.

The security of bloom filters (BFs) has recently been called into question due to the discovery of cryptanalysis attacks that can be used to re-identify sensitive attribute values [14]. These attacks exploit the fundamental principle of hashing elements of sets into bit positions. This principle can be exploited to re-identify sensitive encoded values, irrespective of the specific encoding function being used. To address this issue, Christen et al. [14] developed the hashing BF method as a solution.

Patgiri et al. [1] presented a comprehensive study to explore the potential utilization of BFs in the field of big data research. Their investigation primarily targeted the application of a specialized variant of the BF known as the fingerprint BF. It specifically serves the purpose of fingerprint-matching tasks. Additionally, the authors introduced the multidimensional BF, which exhibits a remarkable proficiency in handling arrays with dimensions extending beyond the conventional 2D and 3D domains.

Reviriego et al. [15] introduced an adaptive memory access bloom filter (BF) that specifically addresses the issue of false positives when dealing with repeated elements across different queries. Their approach aims to minimize false positive occurrences in such scenarios. The approach achieved an FPR of less than 5% for networking applications while using minimal memory. Another study conducted by Abdennebi and Kaya [16] looked at various types of BFs for different domains, specifically focusing on one-hashing BFs. Wu et al. [17] introduced the concept of an elastic BF, which allows for both deletions and expansions without increasing the FPR or decreasing query speed. The approach utilizes elastic fingerprints, which can expand or shrink.

To the best of our understanding, none of the previously mentioned studies have primarily focused on resolving the problem of the FPR. Therefore, this paper presents a new method utilizing content to reduce false positives, which will be discussed in Section 3. It is also worth mentioning that this study furthers the discourse and investigations presented in papers [7,13], serving as a direct sequel to the concepts and research introduced there.

3. Proposed Approach

We propose a content-based solution for taking advantage of the content bits and using them to reduce the false positives in the BFs. Many solutions use a large size of BF to reduce the false positives. However, we keep the BF smaller and reduce the false positives by introducing just a few extra bits (content bits). These bits take into account the content value of the data and thus produce different hashes that can differentiate between the two content values having the same hash output. For making it easy to understand the methodology, we proceed as follows:

The problem with the traditional BF is two-fold:

First, a significant number of false positives, leading to incorrect content being returned when queried.
Second, the large size of the BF. The large size of the BF has two issues, firstly the querying speed and secondly the large memory usage.

To counter the aforementioned problems, we propose an approach founded on content bits. The proposed content-based approach for processing text data in this paper is a multistep process that includes several key elements as described next:

Load and extract the text data, where the raw text data are loaded from the source and extracted for further processing.
Remove the punctuation from the text data. This is performed to remove any unnecessary characters that might affect the processing of text data.
Convert the data to lowercase. This ensures that all the text data are in the same case and improves the consistency of processing.
Tokenize the text. This involves breaking down the textual data into individual words or tokens for further analysis.
Remove the stop words. Stop words are commonly used words such as “and” or “the” that do not contain much semantic value and can be removed to reduce the size of the text data and improve the efficiency of processing.
Remove words having two or fewer characters. This step is to make sure that the processed data contain only meaningful words.
Normalize the words. This step is to make sure that the words are in their base form and to remove any variations that might affect the processing of text data. For example, if the word “running” is present in the text data it will be normalized to its base form “run”.
Generate processed sentences (content). This step involves creating meaningful sentences from the processed text data that can be used for further analysis.
Prior to storing the content in a BF, acquire its binary representation, referring to encoding data into a series of binary digits, zeroes and ones.
Take a few bits of the content (2 or 3), which are called content bits, and store them in a separate array, which points to the same location as the BF.
Generate hash values for the content. The hash values are generated using the following functions:
$f u n c t i o n h a s h = s t r i n g 2 h a s h (s t r, m a x S i z e)$
$s t r = d o u b l e (s t r);$
$h a s h = m a x S i z e * o n e s (s i z e (s t r, 1), 1);$
$f o r i = 1 : s i z e (s t r, 2),$
$h a s h = m o d (h a s h * 33 + s t r (:, i), m a x S i z e) + 1;$
$e n d$
$e n d$
Store the content value in the BF.
The BF and the check bits can point to similar content before retrieval, and this reverse process makes sure that, even if the same hash functions output is obtained for different content, the content bits will play a role in generating uniqueness, thus reducing the dependency on the hash functions alone and reducing the chance of false positives by approximately 50%.

The overall steps of the proposed methodology are also shown and summarized in Figure 1. Furthermore, the pseudocode of the algorithm is given in Algorithm 1.

Algorithm 1 Proposed method pseudocode

1:: procedure Main
2:: $Text \leftarrow Read Input$
3:: $Processed_Text \leftarrow Content_Analysis (Text)$
4:: Convert $Processed_Text$ to binary form
5:: Choose an appropriate number of hash functions and determine the offset of content bits.
6:: Set $acceptable_results$ to false
7:: while $acceptable_results$ are not obtained do
8:: $content_bits_position \leftarrow finalize_content_bits_position ()$
9:: $number_of_content_bits \leftarrow select_number_of_content_bits ()$
10:: $size_of_bloom_filter \leftarrow select_size_of_bloom_filter ()$
11:: end while
12:: end procedure
13:
14:: procedure finalize_content_bits_position
15:: Implement a logic to determine the final position or placement of the content bits.
16:: return the selected position (left, middle, or right)
17:: end procedure
18:
19:: procedure select_number_of_content_bits
20:: Implement logic to select the number of content bits
21:: return the selected number of content bits
22:: end procedure
23:
24:: procedure select_size_of_bloom_filter
25:: Implement logic to select the size of the bloom filter
26:: return the selected size of the bloom filter
27:: end procedure
28:
29:: procedure Content_Analysis(var)
30:: Erase punctuation
31:: Convert the text data to lowercase
32:: Tokenize the text
33:: Remove a list of stop words
34:: Remove words with two or fewer characters
35:: Normalize the words
36:: Generate processed sentences (content)
37:: return processed sentences
38:: end procedure

4. Experiments and Results

In this section, we present the results of our experiment on the effect of content bits and size-based content bits, as well as an analysis of the time taken and the impact of different hash functions on the performance. Furthermore, the impact of utilizing varying numbers of hash functions on the occurrence of false positives is highlighted.

4.1. Effect of Content Bits

Here, we describe the influence of content bits on the quantity of false positives.

First, we discuss the values of the parameters, following the settings in [7], as shown in Table 1. The content bits are considered in binary form. Later, we decrement by 1 as more content bits are required. Therefore, the offset for the content bits is kept as 1. It only plays a role as more content bits are required to analyze the number of false positives. Furthermore, the size of the BF is retained as 500, and only three hash functions are used. We later provide experiments where the number of hash functions is also varied.

Figure 2 shows the reduction in false positives as more and more content bits are utilized. The content bits are shown along the x-axis and the false positives appear along the y-axis. In total, 1061 false positives are observed without using any of the content bits. Employing just two content bit results in a reduction of 54.38% (484 vs. 1061) in the false positives. Similarly, a 65.03% (371 vs. 1061) reduction in the false positives is seen while using three content bits. Moreover, a 69.84% (320 vs. 1061) reduction in the number is observed while using four bits. This reduction trend continues as more and more content bits are added. The maximum reduction of 79.83% (214 vs. 1061) is observed once seven content bits are employed.

4.2. Impact of Size-Based Content Bits

In this subsection, we show the impact of size-based content bits on the number of false positives. The experimental parameters used in this set of experiments are identical to those specified in Table 1.

The impact is depicted in Figure 3. It is important to mention that the size experiments involve only the text from the processed form due to empty spaces in the non-processed text. The number of false positives with standard content bits is 473. Using a size-based approach with two content bits decreases the false positives by just 1, resulting in a minimal decrease of 0.21%. Nevertheless, the difference slightly increases (2.59% reduction with size-based bits) once the content bits are increased to 3. The maximum difference of 5.2% is observed once the content bits are increased to five. In conclusion, it is evident that the utilization of size-based content bits effectively reduces the occurrence of false positives.

4.3. Time Analysis

In this subsection, we discuss the trend observed for the extra time required by the content bits. Figure 4 shows the variation in the time (in seconds) as more and more content bits are utilized. The required time without employing any of the content bits is just 0.445 s. At least double the amount of time is required if any content bits are used as displayed with the help of bars in blue color. Interestingly, there is no significant increase in processing time observed when the content bits are increased from two to seven. Moreover, the time required by seven bits is slightly less than that (by 0.03 s) required by six bits.

4.4. Hash Functions’ Variation

Here, we discuss the impact of using different numbers of hash functions on the number of false positives. The parameters of this set of experiments are the same as described in Table 1.

The impact is shown in Figure 5. We varied the number of hash functions from 1 to 25. A very interesting behavior is observed in the frequency of false positives. In the beginning, the false positives reduced from 544 to 408 as the hash functions were increased from 1 to 2, representing a decrease of 25%. However, only a decrease of 14.7% in false positives is seen as three hash functions are employed. This difference is further reduced to 13.2% and 6.3%, as the number of hash functions is increased to four and five, respectively. In short, a more than 50% decrease is observed in the number of false positives by using just six hash functions.

A very minimal decrease (not more than five) in the false positives is seen as the hash functions are increased beyond 11. This analysis shows that increasing the number of hash functions beyond a certain number is not very beneficial in reducing the false positives. More so, a significant overhead would be observed in the case of using too many functions.

5. Discussion

The results of experimental studies demonstrated the effectiveness of our content-based approach in improving the accuracy and efficiency of membership testing in bloom filters. More specifically, we are able to effectively decrease the occurrence of false positives by more than 79% while using just seven content bits.

Interestingly, a steady and gradual decrease is observed in the false positives as the content bits are increased from two to seven. Furthermore, the usage of size-based content bits proved to be further helpful in reducing false positives. This would be worth future work in terms of enhancing the reduction in FPRs.

For analyzing the time required by the content bits, a minimal increase in time is observed with the addition of the content bits. The increase remains less than a second as content bits are varied from two to seven. Since there may be a direct correlation between increasing the number of content bits and the required time, it might be interesting to investigate this increase though no crucial increase has been observed.

Lastly, a gradual decrease is observed in the false positives as the hash functions are increased. This decrease becomes less obvious as the hash functions become more than eight. Although there might be an inverse relationship between the FPRs while the number of hash functions is increased, increasing the number of hash functions beyond a certain number is not very beneficial in reducing the FPRs.

It is worth noting that the proposed approach holds significant potential for practical applications in various domains, such as network routers, web browsers, and databases. As these applications require efficient and accurate membership testing, the proposed approach could potentially be integrated into these systems, resulting in a significant improvement in their performance.

Furthermore, this research contributes to the field of BFs and opens up new opportunities for future research. For instance, further investigation could be carried out on different variations of the proposed approach or on the integration of the proposed approach into other types of data structures.

6. Conclusions and Outlook

This research has presented a comprehensive examination of the challenges associated with traditional BFs and proposed a novel, content-based approach to addressing these issues. The utilization of a smaller amount of data storage space in the filter has been identified as the key factor in reducing the FPR, resulting in a significant improvement in the accuracy of membership testing. Additionally, the evaluation of different hash functions has played a crucial role in optimizing the performance of our approach. The experimental results obtained from the evaluation of our proposed approach demonstrate its effectiveness and superiority in comparison to traditional BFs in terms of the accuracy and efficiency of membership testing.

In summary, this research has successfully proposed a novel, content-based approach that addresses the challenges of traditional BFs and demonstrates its effectiveness through experimental evaluation. The proposed approach holds promise for practical applications in various domains and opens up new opportunities for future research in the field of BFs.

Author Contributions

Conceptualization, R.U.K.; methodology, M.A., R.U.K. and A.M.Q.; software, R.U.K.; validation, M.A., R.U.K., A.M.Q. and S.A.A.; formal analysis, M.A., R.U.K., A.M.Q. and S.A.A.; investigation, M.A., R.U.K., A.M.Q. and S.A.A.; resources, M.A., R.U.K., A.M.Q. and S.A.A.; data curation, R.U.K.; writing—original draft preparation, M.A., R.U.K., A.M.Q. and S.A.A.; writing—review and editing, M.A., R.U.K. and A.M.Q.; visualization, M.A., R.U.K. and A.M.Q.; supervision, S.A.A.; project administration, S.A.A.; funding acquisition, S.A.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Acknowledgments

The researchers would like to thank the Deanship of Scientific Research, Qassim University, for funding the publication of this project.

Conflicts of Interest

The authors declare no conflict of interest.

References

Patgiri, R.; Nayak, S.; Borgohain, S.K. Role of Bloom Filter in Big Data Research: A Survey. Int. J. Adv. Comput. Sci. Appl. 2018, 9, 655–661. [Google Scholar] [CrossRef] [Green Version]
Hua, W.; Gao, Y.; Lyu, M.; Xie, P. Research on Bloom filter: A survey. J. Comput. Appl. 2022, 42, 1729. [Google Scholar]
Alsuhaibani, M.; Bollegala, D. Fine-Tuning Word Embeddings for Hierarchical Representation of Data Using a Corpus and a Knowledge Base for Various Machine Learning Applications. Comput. Math. Methods Med. 2021, 2021, 9761163. [Google Scholar] [CrossRef] [PubMed]
Bloom, B.H. Space/time trade-offs in hash coding with allowable errors. Commun. ACM 1970, 13, 422–426. [Google Scholar] [CrossRef]
Gomez-Barrero, M.; Rathgeb, C.; Li, G.; Ramachandra, R.; Galbally, J.; Busch, C. Multi-biometric template protection based on bloom filters. Inf. Fusion 2018, 42, 37–50. [Google Scholar] [CrossRef]
Podder, S.; Mukherjee, S. A bloom filter-based data deduplication for big data. In Proceedings of the International Conference on Data and Information Systems, Singapore, 20–22 July 2018; pp. 161–168. [Google Scholar]
Khan, R.U.; Qamar, A.M.; Alsuhibany, S.A.; Alsuhaibani, M. The impact of check bits on the performance of bloom filter. Cmc-Comput. Mater. Contin. 2022, 73, 6037–6046. [Google Scholar]
Mitzenmacher, M. Compressed bloom filters. In Proceedings of the Twentieth Annual ACM Symposium on Principles of Distributed Computing, Newport, RI, USA, 26–29 August 2001; pp. 144–150. [Google Scholar]
Guo, D.; Wu, J.; Chen, H.; Luo, X. Theory and network applications of dynamic bloom filters. In Proceedings of the IEEE INFOCOM 2006. 25TH IEEE International Conference on Computer Communications, Barcelona, Spain, 23–29 April 2006; pp. 1–12. [Google Scholar]
Laufer, R.P.; Velloso, P.B.; Duarte, O. Generalized bloom filters. In Electrical Engineering Program, COPPE/UFRJ, Tech. Rep. GTA-05-43; Grupo de Teleinformatica e Automacao (GTA)—Computer Networking and Automation Group: Rio de Janeiro, Brazil, 2005. [Google Scholar]
Zhu, Y.; Jiang, H.; Wang, J. Hierarchical bloom filter arrays (HBA): A novel, scalable metadata management system for large cluster-based storage. In Proceedings of the 2004 IEEE International Conference on Cluster Computing, San Diego, CA, USA, 20–23 September 2004; pp. 165–174. [Google Scholar]
Kumar, A.; Xu, J.; Li, L.; Wang, J. Space-code bloom filter for efficient traffic flow measurement. In Proceedings of the 3rd ACM SIGCOMM Conference on Internet Measurement, Miami Beach, FL, USA, 27–29 October 2003; pp. 167–172. [Google Scholar]
Alsuhibany, S.A.; Alsuhaibani, M.; Khan, R.U.; Qamar, A.M. Performance Analysis of Bloom Filter for Big Data Analytics. Comput. Intell. Neurosci. 2022, 2022, 2414605. [Google Scholar] [CrossRef] [PubMed]
Christen, P.; Schnell, R.; Vatsalan, D.; Ranbaduge, T. Efficient cryptanalysis of bloom filters for privacy-preserving record linkage. In Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining, Jeju, Republic of Korea, 23–26 May 2017; pp. 628–640. [Google Scholar]
Reviriego, P.; Sánchez-Macián, A.; Rottenstreich, O.; Larrabeiti, D. Adaptive One Memory Access Bloom Filters. IEEE Trans. Netw. Serv. Manag. 2022, 19, 848–859. [Google Scholar] [CrossRef]
Abdennebi, A.; Kaya, K. A Bloom Filter Survey: Variants for Different Domain Applications. arXiv 2021, arXiv:2106.12189. [Google Scholar]
Wu, Y.; He, J.; Yan, S.; Wu, J.; Yang, T.; Ruas, O.; Zhang, G.; Cui, B. Elastic Bloom Filter: Deletable and Expandable Filter Using Elastic Fingerprints. IEEE Trans. Comput. 2021, 71, 984–991. [Google Scholar] [CrossRef]

Figure 1. Overall summary of the proposed content-based methodology steps.

Figure 2. Role of the content bits in reducing the false positives.

Figure 3. Role of the size-based content bits in the reduction in false positives.

Figure 4. Variation in time with respect to increasing content bits.

Figure 5. Effect of using more hash functions on the false positives.

Table 1. Parameters employed for content bits’ generation based on the text’s content.

Parameters	Values
Content bit offset	1
BF size	500
Hash functions	3

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Alsuhaibani, M.; Khan, R.U.; Qamar, A.M.; Alsuhibany, S.A. Content-Based Approach for Improving Bloom Filter Efficiency. Appl. Sci. 2023, 13, 7922. https://doi.org/10.3390/app13137922

AMA Style

Alsuhaibani M, Khan RU, Qamar AM, Alsuhibany SA. Content-Based Approach for Improving Bloom Filter Efficiency. Applied Sciences. 2023; 13(13):7922. https://doi.org/10.3390/app13137922

Chicago/Turabian Style

Alsuhaibani, Mohammed, Rehan Ullah Khan, Ali Mustafa Qamar, and Suliman A. Alsuhibany. 2023. "Content-Based Approach for Improving Bloom Filter Efficiency" Applied Sciences 13, no. 13: 7922. https://doi.org/10.3390/app13137922

APA Style

Alsuhaibani, M., Khan, R. U., Qamar, A. M., & Alsuhibany, S. A. (2023). Content-Based Approach for Improving Bloom Filter Efficiency. Applied Sciences, 13(13), 7922. https://doi.org/10.3390/app13137922

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Content-Based Approach for Improving Bloom Filter Efficiency

Abstract

1. Introduction

2. Background and Related Work

3. Proposed Approach

4. Experiments and Results

4.1. Effect of Content Bits

4.2. Impact of Size-Based Content Bits

4.3. Time Analysis

4.4. Hash Functions’ Variation

5. Discussion

6. Conclusions and Outlook

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI