A Bloom Filter for High Dimensional Vectors

Chunyan Shuai 1, Hengcheng Yang 2, Xin Ouyang 3,* and Zeweiyi Gong 2 1 Faculty of Transportation Engineering, Kunming University of Science and Technology, Kunming 650221, China; earth0806@kmust.edu.cn or earth0806@sina.com 2 Faculty of Electric Power Engineering, Kunming University of Science and Technology, Kunming 650221, China; yanghengcheng112@163.com (H.Y.); earth_0806@163.com (Z.G.) 3 Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650221, China * Correspondence: oyx@kmust.edu.cn or kmoyx@hotmail.com; Tel.: +86-871-6530-0096


Introduction
In high-dimensional spaces, exact search methods, such as kd-tree approaches [1] and Q-gram [2], are only suitable for small size vectors due to the very large computational resources.With the development of sensor technology, communication technology and storage technology, big data with high dimensions have brought a new challenge to the retrieval and storage of the data.There is a compromise between efficiency and accuracy.A Bloom filter (BF) [3] and its variants [4,5], as a type of space-efficient and constant query delay random data structure, have been applied to represent a big set and retrieve memberships broadly [4,5], including IP address lookup [6,7], routing-table lookup [8], hardware String Matching [9,10], cooperative Web caching [11], intrusion detection [12][13][14][15], and so on.
The standard BF [3] includes a bit (or counter) array and multiple string hash functions, which stores all elements of a given set into the bit-array using these multiple string hash functions.During the past 30 years, different variants of the BF have optimized and extended BF from different perspectives, which make BFs more suitable for different circumstances and requirements [4].The performances are comprehensively discussed in References [16,17], including the memory cost, size of the set, false positive probability (FPP) and number of the hash functions.Reference [18] analyzes the conditions under which the paradox occurs in the BF and demonstrates that it highly depends on the prior probability that a given element belongs to the represented set.Complement BF [19] mean.For example, approximately 68.5% elements are projected to the locations between the negative and positive variance under Gaussian distribution.Through LSH mapping, the LshBFs transform high-dimensional vectors into real numbers, which avoids dimension disasters and brings computing and space overheads, but the aggregation of neighbors causes high FPPs.This, LshBFs are only suitable for AMQ not membership query.
Whatever the types of elements may be, the BF and the variants in above 1 and 2 view every element of a set as a flat one-dimensional string.By iteratively computing the characters of each input, they yield a random hash value and project it in a fixed array.For elements with all dimensions being numerical, such as pictures, the LshBFs [29][30][31][32] P-stable-function-based will provide a high FPPs for membership query.To answer the membership lookup of a large set with high-dimensional vectors, this paper modifies the string hash functions of BF and implements a new BF structure, denoted as HDBF.The experiments demonstrate that the HDBF has the same performances as CBF as regards data discretization, which can efficiently deal with the vectors in high numerical dimensional spaces.The main contributions of this paper are as follows.
1.The modified hash functions can effectively discretize vectors with numerical high dimensions, uniformly and randomly.Based on the modified hash functions, HDBF extends the Bloom filters to represent and query numerical vectors in high-dimensional spaces.2. Compared with CBF, HDBF is more efficient in dealing with numerical dimensions, and can be a replacement for CBF in numerical high-dimensional spaces 3. HDBF outperforms CBF in false positive probability, query delay, memory costs, and especially in numerical high-dimensional spaces.4. Different from parallel BFs (PBFs), HDBF will not bring dimension disaster.Moreover, it has memory and query overheads compared with PBFs.

Bloom Filter
Definition 1.A standard BF [3] applies an array of m bits, initially all are set to 0, and k independent hash functions h i to represent a set S = {a 1 , . . . ,a n } of n elements, as shown in Figure 1a.If an element is mapped into the BF by h i , the corresponding bit h i (a j )%m is set to 1.Given a query q, by k hash functions h i (q)%m mapping, the BF answers whether the q is a member of S with a false positive probability (FPP).CBF [20] replaces the bit-array with the 4-bit counter array to support element deletion, as shown in Figure 1b.
Information 2017, 8, x 3 of 11 mean.For example, approximately 68.5% elements are projected to the locations between the negative and positive variance under Gaussian distribution.Through LSH mapping, the LshBFs transform high-dimensional vectors into real numbers, which avoids dimension disasters and brings computing and space overheads, but the aggregation of neighbors causes high FPPs.This, LshBFs are only suitable for AMQ not membership query.
Whatever the types of elements may be, the BF and the variants in above 1 and 2 view every element of a set as a flat one-dimensional string.By iteratively computing the characters of each input, they yield a random hash value and project it in a fixed array.For elements with all dimensions being numerical, such as pictures, the LshBFs [29][30][31][32] P-stable-function-based will provide a high FPPs for membership query.To answer the membership lookup of a large set with high-dimensional vectors, this paper modifies the string hash functions of BF and implements a new BF structure, denoted as HDBF.The experiments demonstrate that the HDBF has the same performances as CBF as regards data discretization, which can efficiently deal with the vectors in high numerical dimensional spaces.The main contributions of this paper are as follows.
1.The modified hash functions can effectively discretize vectors with numerical high dimensions, uniformly and randomly.Based on the modified hash functions, HDBF extends the Bloom filters to represent and query numerical vectors in high-dimensional spaces.2. Compared with CBF, HDBF is more efficient in dealing with numerical dimensions, and can be a replacement for CBF in numerical high-dimensional spaces 3. HDBF outperforms CBF in false positive probability, query delay, memory costs, and especially in numerical high-dimensional spaces.4. Different from parallel BFs (PBFs), HDBF will not bring dimension disaster.Moreover, it has memory and query overheads compared with PBFs.

Bloom Filter
Definition 1.A standard BF [3] applies an array of m bits, initially all are set to 0, and k independent hash functions i h to represent a set m is set to 1.Given a query q , by k hash functions ( )% i h q m mapping, the BF answers whether the q is a member of S with a false positive probability (FPP).
CBF [20] replaces the bit-array with the 4-bit counter array to support element deletion, as shown in Figure 1b.A BF includes hash functions and the bit array, this paper takes sax_hash, a classical string hash function [33] used by the BF and most of the generations, as an example to illustrate the work mechanism of string hash functions, shown in Figure 1 and Algorithm 1.Given a three-dimensional vector x with numeral dimensions 123, 113, and 89, if the vector is inputted into the sax_hash, the sax_hash regards it as a string key = "123, 213, 89".By determining every character's ASCII code {"49", A BF includes hash functions and the bit array, this paper takes sax_hash, a classical string hash function [33] used by the BF and most of the generations, as an example to illustrate the work mechanism of string hash functions, shown in Figure 1 and Algorithm 1.Given a three-dimensional vector x with numeral dimensions 123, 113, and 89, if the vector is inputted into the sax_hash, the sax_hash regards it as a string key = "123, 213, 89".By determining every character's ASCII code {"49", "50", "51", "44", . . .} and bitwise operating, the sax_hash gets a hash value ranged in 0 − 2 31 .The iterating process is as shown in Figure 1 and line 3 and 4 of Algorithm 1.The operations of other string hashes are similar to the sax_hash, such as RSHash and APHash.Through k different string hash computing and MOD function h i (x)%m mapping, the input is stored in the bit (or counter) array.

High Dimensional Bloom Filter (HDBF)
BF assumes that all the elements of a set can be randomly and uniformly scattered into a range of integers.The string hash functions should satisfy: (1) Different vectors being projected to different values by the same hash function; (2) same vector being projected to different values by different hash functions; and (3) the avalanche effect [34].The change of single character will bring big change of the hash value.
The Sax hash regards the input parameter "key" as a one-dimensional string, and every character in the "key" will be computed.In fact, the characters the sax_hash operated are the ASCII codes, which are a series of integers.This implies that the input can be modified to an integer array.Thus, by modification of the input, we can expand BF into high-dimensional spaces.Except for replacing the input string (char *key) with an integer array (int *key), the modified sax_hash, called HDsax_hash, as shown in Algorithm 2. The HDsax_hash regards the input as a three-dimensional integer vector.By bitwise operating on integers 123, 213 and 89, as shown in lines 3 and 4 of Algorithm 2 and Figure 2, the HDsax_hash computes a hash value of the three-dimensinal vector.The operation process is The operation process is

HD h HD h HD h HD h HD h HD h
HD h  The similar modifications are applied on other string hash functions and the high-dimensional integer hash function (HDIH) family is obtained.Based on the HDIH and a counter array, a new BF structure, denoted as HDBF, is constructed to store and query the vectors with high numerical dimensions in a large set.Definition 2. A high-dimensional integer BF (HDBF) applies an array of m counters.Initially all are set to 0, and k independent HDIH functions HD_h i to represent a set S = {V 1 , V 2 , . . . ,V n } of n vectors, where any vector with d numerical dimensions V j (v j1 , . . ., v jd ), v jl ∈ I, as shown in Figure 2. If vector V j is mapped into the HDBF by HD_h i , the corresponding counter HD_h i (V j )%m is increased by 1.Given query q, if all k HDIH functions HD_h i (q)%m are bigger than 1, the HDBF regards q as a member of S with a FPP; if not, the query is certainly not in set S.

Performances
Due to the same data process, counter array and data type of CBF [20] and HDBF, they have the same performance, which can map a vector into an integer, ranged in 0 − 2 31 , randomly and uniformly.After n vectors are mapped into the counter array with m size by k HDIH functions, the probability of any one counter still being 0 is: If a false positive occurs, the corresponding counter must be 1, so the false positive probability (FPP) is: From Equation (4), the memory required by HDBF is: Let the upper limit of the FPP of the HDBF be f 0 .For fixed m and k, from Equation (4), the maximum number of vectors the HDBF can represent is n 0 , and In terms of Equation (4), given that g(k) = k ln(1 − e − kn m ), then fHDBF = e g(k) .To get the minimum value of fHDBF , function g(k) is derivative using k, When dg(k) dk = 0, the minimum number of the hash functions is obtained, and: Since HDBF only needs k hash computing for a query/deletion/insertion, the query time complexity is o(k).

Dataset and Settings
Since there are no benchmark datasets for BFs, here Color [35], Sift and Gist [36], used in most experiments, are adopted to compare and test the performances of different variants.The Color dataset includes 70 K vectors with 32 dimensions, and values of the dimensions are all less than 1, we expanded all values into integers.Sift and Gist contain 100 K vectors with 128 and 300 dimensions, respectively, and the values of dimensions are positive integers.All query vectors are different from the samples and are set to 10 K.The experiments ran on a computer with Intel Xeon E5-2603 v3 and 16 GB RAM.The schemes used to compare contain CBF [20], PBF-HT and PBF-BF [28], in which all counters of the arrays took up 4 bits.

Distribution and Entropy
Since the distribution and entropy reflect the discrete state of data, to check whether HDBF can scatter the high-dimensional vectors into different integers, randomly and uniformly, this paper firstly compares the distribution and entropy of HDBF with CBF on 3 datasets.Let v be the value of a counter after n vectors are projected by k hash functions, and p = v/kn be the selected frequencies of a counter.The entropy of the counter array is defined as Given m = 25n.Figure 3 shows the distributed situations of CBF and HDBF after different datasets are projected into counter arrays by 6 different hash functions.Figure 3a-c shows the distributions of the CBF, where the maximum values of the counters are 6, 6 and 12 on Color, Sift and Gist, respectively.Figure 3e-g demonstrates the distributions of HDBF, in which the maximum values are 6, 6 and 7, respectively.This illustrates that the HDIH functions almost possess the same discrete ability as the string hashes of CBF.
Figure 4a-c displays the increase of entropies of HDBF and CBF with samples increases under d, being 32, 128 and 300.For fixed n, Figure 4d,e shows the changes of entropies with the dimension increase.In Figure 4a,d, HDBF and CBF almost have similar entropies in low-dimensional spaces (d ≤ 32).With the increase in dimension (Figure 4b,c,e,g), the entropies of HDBF are slightly larger than those of CBF, obviously, for Sift and Gist where d > 32.This means that HDBF is superior to the CBF on the data discretization, especially in high-dimensional spaces.

FPP
Figure 5 displays the FPP changes with the increase of k for different memory costs and fixed samples.From Equation (8) in Section 4, for fixed m and n, there are a minimum number of hash functions; the FPPs first decrease to a minimum, then increase with the increase in k .The CBF and HDBF have the same change tendencies.
From Equation ( 8), for a fixed k = 6 and memory costs, the FPP will increase with sample growth, even reaching 1, as Figure 6 shows.On the contrary, for a fixed k and n , the FPP will decrease with the increase in memory, as shown in Figure 7. From Figures 5-7, we can clearly see that the FPPs of the CBF and HDBF almost possess the same values, display similar change tendencies, and they are close to meeting the false positive probability requirements.

FPP
Figure 5 displays the FPP changes with the increase of k for different memory costs and fixed samples.From Equation (8) in Section 4, for fixed m and n, there are a minimum number of hash functions; the FPPs first decrease to a minimum, then increase with the increase in k .The CBF and HDBF have the same change tendencies.
From Equation ( 8), for a fixed k = 6 and memory costs, the FPP will increase with sample growth, even reaching 1, as Figure 6 shows.On the contrary, for a fixed k and n , the FPP will decrease with the increase in memory, as shown in Figure 7. From Figures 5-7, we can clearly see that the FPPs of the CBF and HDBF almost possess the same values, display similar change tendencies, and they are close to meeting the false positive probability requirements.

FPP
Figure 5 displays the FPP changes with the increase of k for different memory costs and fixed samples.From Equation (8) in Section 4, for fixed m and n, there are a minimum number of hash functions; the FPPs first decrease to a minimum, then increase with the increase in k.The CBF and HDBF have the same change tendencies.

FPP
Figure 5 displays the FPP changes with the increase of k for different memory costs and fixed samples.From Equation (8) in Section 4, for fixed m and n, there are a minimum number of hash functions; the FPPs first decrease to a minimum, then increase with the increase in k .The CBF and HDBF have the same change tendencies.
From Equation ( 8), for a fixed k = 6 and memory costs, the FPP will increase with sample growth, even reaching 1, as Figure 6 shows.On the contrary, for a fixed k and n , the FPP will decrease with the increase in memory, as shown in Figure 7. From Figures 5-7, we can clearly see that the FPPs of the CBF and HDBF almost possess the same values, display similar change tendencies, and they are close to meeting the false positive probability requirements.From Equation ( 8), for a fixed k = 6 and memory costs, the FPP will increase with sample growth, even reaching 1, as Figure 6 shows.On the contrary, for a fixed k and n, the FPP will decrease with the increase in memory, as shown in Figure 7. From Figures 5-7, we can clearly see that the FPPs of the CBF and HDBF almost possess the same values, display similar change tendencies, and they are close to meeting the false positive probability requirements.The above discussions show that HDBF can discretize data with high-dimensions, randomly and uniformly, which can substitute CBF for dealing with vectors with numerical high-dimensions.The following sections will continue to compare HDBF with other schemes based on BFs.

Memory Costs and Latency
Let the average 8 compares the memory use of the PBF-BF, PBF-HT, CBF and HDBF on 3 datasets.For fixed FPPs, the memory costs of PBF-BF and PBF-HT enlarge with the increase in the samples and dimensions, linearly, which is in line with the discussions in Section 2. According to Equation (4), with the n growing, m will enlarge to fit a constant FPP, so the memory usages of HDBF and CBF increase with the number of samples (Figure 8a-c), and will not be affected by the dimensions (Figure 8d-e).Once the dimensions are greater than 1, the memory costs of PBF-BF and PBF-HT are far higher than those of CBF and HDBF, as shown in Figure 8d-f.The above discussions show that HDBF can discretize data with high-dimensions, randomly and uniformly, which can substitute CBF for dealing with vectors with numerical high-dimensions.The following sections will continue to compare HDBF with other schemes based on BFs.

Memory Costs and Latency
Let the average 8 compares the memory use of the PBF-BF, PBF-HT, CBF and HDBF on 3 datasets.For fixed FPPs, the memory costs of PBF-BF and PBF-HT enlarge with the increase in the samples and dimensions, linearly, which is in line with the discussions in Section 2. According to Equation (4), with the n growing, m will enlarge to fit a constant FPP, so the memory usages of HDBF and CBF increase with the number of samples (Figure 8a-c), and will not be affected by the dimensions (Figure 8d-e).Once the dimensions are greater than 1, the memory costs of PBF-BF and PBF-HT are far higher than those of CBF and HDBF, as shown in Figure 8d-f.The above discussions show that HDBF can discretize data with high-dimensions, randomly and uniformly, which can substitute CBF for dealing with vectors with numerical high-dimensions.The following sections will continue to compare HDBF with other schemes based on BFs.

Memory Costs and Latency
Let the average FPP ⊂ [0.0001 − 0.0005], m = 25n and k = 6. Figure 8 compares the memory use of the PBF-BF, PBF-HT, CBF and HDBF on 3 datasets.For fixed FPPs, the memory costs of PBF-BF and PBF-HT enlarge with the increase in the samples and dimensions, linearly, which is in line with the discussions in Section 2. According to Equation (4), with the n growing, m will enlarge to fit a constant FPP, so the memory usages of HDBF and CBF increase with the number of samples (Figure 8a-c), and will not be affected by the dimensions (Figure 8d-e).Once the dimensions are greater than 1, the memory costs of PBF-BF and PBF-HT are far higher than those of CBF and HDBF, as shown in Figure 8d-f.Under 10 K query vectors, the average initiation and query time of CBF and HDBF are less than PBF-HT and PBF-BF, as shown in Figures 9 and 10.Since all schemes need to split vectors and project all dimensions into corresponding arrays, the initiation time will continue to increase with the samples and dimensions.However, the increased speeds of CBF and HDBF are far slower than those of PBF-BF and PBF-HT, as shown in Figure 9. Compared with PBF-BF and PBF-HT, CBF and HDBF only require dividing the dimensions and computing the hash values, so their query times will increase slightly with the increased dimensions (Figure 10d-e) but are constant as cardinality increase (Figure 10a-c), which is consistent with Equation ( 9).Since PBF-HT and PBF-BF contain multiple BFs, once any one BF returns to 0, the query will stop, the query time fluctuates slightly with the increase in dimensions.
BF and PBF-HT enlarge with the increase in the samples and dimensions, linearly, which is in line with the discussions in Section 2. According to Equation (4), with the n growing, m will enlarge to fit a constant FPP, so the memory usages of HDBF and CBF increase with the number of samples (Figure 8a-c), and will not be affected by the dimensions (Figure 8d-e).Once the dimensions are greater than 1, the memory costs of PBF-BF and PBF-HT are far higher than those of CBF and HDBF, as shown in Figure 8d-f.Under 10 K query vectors, the average initiation and query time of CBF and HDBF are less than PBF-HT and PBF-BF, as shown in Figures 9 and 10.Since all schemes need to split vectors and project all dimensions into corresponding arrays, the initiation time will continue to increase with the samples and dimensions.However, the increased speeds of CBF and HDBF are far slower than those of PBF-BF and PBF-HT, as shown in Figure 9. Compared with PBF-BF and PBF-HT, CBF and HDBF only require dividing the dimensions and computing the hash values, so their query times will increase slightly with the increased dimensions (Figure 10d-e) but are constant as cardinality increase (Figure 10a-c), which is consistent with Equation ( 9).Since PBF-HT and PBF-BF contain multiple BFs, once any one BF returns to 0, the query will stop, the query time fluctuates slightly with the increase in dimensions.

Conclusions
With the development of computer technology, data dimensions and sizes increase quickly, and the requirements for tools and methods for dealing with high dimensional data are becoming urgent.Although there are some data structures for high-dimensional data in a number of variants of BF, there are some problems, such as high temporal and spatial costs.In this paper, we proposed a new hash family, called HDIH, to map the vectors with high-dimensions.Based on the HDIH family and a counter array, a new Bloom filter structure, denoted as HDBF, was built to represent and query the Under 10 K query vectors, the average initiation and query time of CBF and HDBF are less than PBF-HT and PBF-BF, as shown in Figures 9 and 10.Since all schemes need to split vectors and project all dimensions into corresponding arrays, the initiation time will continue to increase with the samples and dimensions.However, the increased speeds of CBF and HDBF are far slower than those of PBF-BF and PBF-HT, as shown in Figure 9. Compared with PBF-BF and PBF-HT, CBF and HDBF only require dividing the dimensions and computing the hash values, so their query times will increase slightly with the increased dimensions (Figure 10d-e) but are constant as cardinality increase (Figure 10a-c), which is consistent with Equation ( 9).Since PBF-HT and PBF-BF contain multiple BFs, once any one BF returns to 0, the query will stop, the query time fluctuates slightly with the increase in dimensions.

Conclusions
With the development of computer technology, data dimensions and sizes increase quickly, and the requirements for tools and methods for dealing with high dimensional data are becoming urgent.Although there are some data structures for high-dimensional data in a number of variants of BF, there are some problems, such as high temporal and spatial costs.In this paper, we proposed a new hash family, called HDIH, to map the vectors with high-dimensions.Based on the HDIH family and a counter array, a new Bloom filter structure, denoted as HDBF, was built to represent and query the vectors with numerical high-dimensions in a large set.The HDBF regards all elements in a set as

Conclusions
With the development of computer technology, data dimensions and sizes increase quickly, and the requirements for tools and methods for dealing with high dimensional data are becoming urgent.Although there are some data structures for high-dimensional data in a number of variants of BF, there are some problems, such as high temporal and spatial costs.In this paper, we proposed a new hash family, called HDIH, to map the vectors with high-dimensions.Based on the HDIH family and a counter array, a new Bloom filter structure, denoted as HDBF, was built to represent and query the vectors with numerical high-dimensions in a large set.The HDBF regards all elements in a set as vectors while not strings.By iteratively operating the dimensions of the input vectors, the HDBF can translate the vectors into a series of integers, randomly and uniformly.This paper theoretically discusses the relationships of false positive probability, memory costs and hash functions of HDBF.The experiments showed that the distribution of HDBF is almost the same as that of CBF, and the entropy of HDBF in high-dimensional spaces is slightly larger than that of CBF.This means that HDBF has a better data discrete ability than CBF, which can replace CBF to deal with vectors with high-dimensions, randomly and uniformly.Compared with PBF-BF and PBF-HT, HDBF has memory and query overheads, and the memory costs and query time will not be affected by the dimensions.Therefore, HDBF, as a substitute for CBF, is suitable for representing and querying numerical vectors in a high-dimensional space.

S
of n elements, as shown in Figure1a.If an element is mapped into the BF by i h , the corresponding bit ( )%

Figure 1 .
Figure 1.Structure of high-dimensional vector Bloom filter.

Figure 2 .
Figure 2. Structure of high dimension vector Bloom filter.

Figure 2 .
Figure 2. Structure of high dimension vector Bloom filter.

Figure 3 .
Figure 3.The distributions of CBF and HDBF for different samples.

Figure 4 .
Figure 4. Entropies of CBF and HDBF for different samples.

Figure 5 .
Figure 5. FPPs of the CBF and HDBF for different k and memory cost.

Figure 3 . 11 Figure 3 .
Figure 3.The distributions of CBF and HDBF for different samples.

Figure 4 .
Figure 4. Entropies of CBF and HDBF for different samples.

Figure 5 .
Figure 5. FPPs of the CBF and HDBF for different k and memory cost.

Figure 4 .
Figure 4. Entropies of CBF and HDBF for different samples.

Figure 3 .
Figure 3.The distributions of CBF and HDBF for different samples.

Figure 4 .
Figure 4. Entropies of CBF and HDBF for different samples.

Figure 5 .
Figure 5. FPPs of the CBF and HDBF for different k and memory cost.

Figure 5 .
Figure 5. FPPs of the CBF and HDBF for different k and memory cost.

Figure 6 .
Figure 6.FPPs of the CBF and HDBF for different samples.

Figure 7 .
Figure 7. FPPs of the CBF and HDBF for different memory costs.

Figure 7 .
Figure 7. FPPs of the CBF and HDBF for different memory costs.

Figure 7 .
Figure 7. FPPs of the CBF and HDBF for different memory costs.