Associative Memories to Accelerate Approximate Nearest Neighbor Search

Nearest neighbor search is a very active field in machine learning for it appears in many application cases, including classification and object retrieval. In its canonical version, the complexity of the search is linear with both the dimension and the cardinal of the collection of vectors the search is performed in. Recently many works have focused on reducing the dimension of vectors using quantization techniques or hashing, while providing an approximate result. In this paper we focus instead on tackling the cardinal of the collection of vectors. Namely, we introduce a technique that partitions the collection of vectors and stores each part in its own associative memory. When a query vector is given to the system, associative memories are polled to identify which one contain the closest match. Then an exhaustive search is conducted only on the part of vectors stored in the selected associative memory. We study the effectiveness of the system when messages to store are generated from i.i.d. uniform $\pm$1 random variables or 0-1 sparse i.i.d. random variables. We also conduct experiment on both synthetic data and real data and show it is possible to achieve interesting trade-offs between complexity and accuracy.


Introduction
Nearest neighbor search is a fundamental problem in computer science that consist of finding the data point, in a previously given collection, that is the closest to some query one, according to a specific metric or similarity measure.When points are distributed uniformly and independently in the search space, and when the space is much larger than the cardinality of data points, it is well known that one cannot expect to solve this problem with a complexity lesser than linear with cardinality and dimensionality of data points.This complexity is achieved by using an exhaustive search where distances between all data points and the query one are computed.Nearest neighbor search is found in a wide set of applications, including object retrieval, classification, computational statistics, pattern recognition, vision, data mining. . .In many of these domains, collections and dimensions are large, leading to unreasonable computation time when using exhaustive search.For this reason, many recent works have been focusing on approximate nearest neighbor search, where complexities can be greatly reduced at the cost of a nonzero error probability in retrieving the closest match.Most methods act either on cardinality [15,14,5] or on dimensionality [10,1,9] of data points.Reducing cardinality often requires to be able to partition space in order to identify interesting regions that are likely to contain the nearest neighbor, thus reducing the number of distances to compute, whereas reducing dimensionality is often achieved through quantization techniques, and thus approximate distance computation.In this note we will present and investigate an alternative approach that was introduced in [20] in the context of sparse data.Its key idea is to partition the patterns into equal-sized classes and to compute the overlap of an entire class with the given query vector using associative memories based on neural networks.If this overlap of a given class is above a certain threshold, we decide that the given vector is similar to one of the vectors in the considered class and we perform exhaustive search on this class, only.Obviously, this is a probabilistic method that comes with a certain probability of failure.We will impose two restrictions on the relation between dimension and the size of the database, or equivalently, between dimension and the number of vectors in each class: On the one hand, we want the size of the database to be so large that the algorithm is efficient compared to other methods (primarily exhaustive search), on the other hand we need it to be small enough, to let the error probability converge to 0, as dimension (and possibly the size of the database) converge to infinity.More precisely, in this note we will consider two principally different scenarios: one where the input patterns are sparse, i.e. consist of 0s and 1s with a majority of 0s and another scenario, where the data are "unbiased".This situation is best modeled by choosing the coordinates of all the input vector as independent and identically distributed (i.i.d. for short) ±1-random variables with equal probabilities for +1 and for −1.Even though the principal method is similar, both cases have their individual difficulties.As a matter of fact, similar problems occur when analyzing the Hopfield model [8] of neural networks for unbiased patterns see [13] or for sparse, biased patterns (see e.g.[6] or [12]).We will analyze the situation of sparse patterns in Section 3, while Section 4 is devoted to the investigation of the situation where the patterns are unbiased and dense.In Section 5 we will support our findings with simulations on classical approximate nearest neighbor search benchmarks.

Related work
Part of the literature focus on using tree-based methods to partition space [14], resulting with difficulties when facing high dimension spaces [15].When facing real valued vectors, many techniques propose binary encoding of vectors [5,7,19], leading to very efficient reductions in time complexity.As a matter of fact, search in Hamming space can be performed very efficiently, but it is well known that premature discretization of data usually leads to significant loss in precision.In order to achieve better performance, quantization techniques based on Product Quantization (PQ) and Locality Sensitive Hashing (LSH) can be used.In PQ [10], vectors are split into subvectors.Each subvector space is quantized independently from the others.Then a vector is quantized by concatenating the quantized versions of its subvectors.Optimized versions of PQ have been proposed by performing joint optimization [4,17].In LSH, multiple hash functions are defined, resulting in storing each point index multiple times [18,1].In [20], the authors propose to split data points into multiple parts, each one stored in its own sparse associative memory.The ability of these memories to store a large number of messages result in very efficient reduction in time complexity when data points are sparse binary vectors.In [3], the authors combine this method with PQ, extending it to real valued vectors.In [9], the authors propose to use sum of vectors instead of associative memories and discuss optimization strategies (including online scenarios).The methods we analyze in this note are in the same vein.

Search Problems with sparse patterns
In this section we will treat the situation, where we try to find a pattern that is closest to some input pattern and all the patterns are binary and sparse.To be more precise, assume that we are given n sparse vectors or patterns x µ , µ = 1, . . ., n with x µ ∈ {0, 1} d for some (large) d.Assume that the (x µ ) are random and i.i.d. and have i.i.d.coordinates such that for all i = 1, . . ., d and all µ = 1, . . ., n.To describe a sparse situation we will also assume c may depend on d but that c/d → 0 as c and d become large.Actually we will even need that this convergence is fast enough.Moreover, we will always assume that there is an x µ such that the query pattern x 0 has a macroscopically large overlap with x µ , hence that d l=1 x 0 l x µ l is of order c.We will actually begin with the situation that x 0 = x µ for an index µ.The situation where x 0 is a perturbed version of x µ is then similar.The algorithm proposed by Gripon et al. in [20,3] now proposes to partition the patterns into equal-sized classes X 1 , . . .X q , with |X i | = k for all i = 1, . . .q. Thus n = kq.Afterwards one computes the score for each of these classes and takes the class X i with the highest score.Then, on this class one performs exhaustive search.We will analyze under which condition on the parameters this algorithm works well and when it is effective.To this end, we will show the following theorem.
Theorem 3.1.Assume that the query pattern is equal to one of the patterns in the database.The algorithm described above works asymptotically correctly, i.e. with an error probability that converges to 0 and is efficient, i.e. requires less computations than exhaustive search, if Proof.To facilitate notation assume that x 0 = x 1 (which is of course unknown to the algorithm), that x 1 ∈ X 1 , and that x 1 i = 1 for the first c indices i and that x 1 i = 0 for the others.Note that for any ε > 0, with high probability for c and d large, we are able to assume that c Then for any i we have s(X i , x 0 ) = µ:x µ ∈X i c l,m=1 x µ l x µ m and in particular x µ l x µ m .
We now want to show that for a certain range of parameters n, k, and d this algorithm works reliably, i.e. we want to show that Now since, c and c are close together when divided by d we will replace c by c everywhere without changing the proof.Trivially, and we just need to bound the probability on the right hand side.Taking X µ to be i.i.d.B(c, c d )-distributed random variables, we may rewrite the probability on the right hand side as Centering the variables we obtain Now as c d the term c 4 d 2 is negligible with respect to c 2 and therefore up to terms of vanishing order Let us treat the last term first.For t > 0 if we take ε d → 0 for d → ∞.This basically always works, if k ≤ d 15 .
The other summand is slightly more complicated: where on the right hand side we can take any fixed µ.
To estimate the expectations we will apply a version of Lindeberg's replacement trick [11] or [2].To this end assume that S c := where the (ξ i ) are i.i.d.standard Gaussian random variables.Finally set Then we obtain by Taylor expansion where ζ 1 and ζ 2 are random variable that lie within the interval [T k , T k + σ k ], and , respectively (possibly with the right and left boundary interchanged, if σ k or ξ k are negative).First of all observe that for each k the random variable T k is independent of the ξ k and σ k and ξ k and σ k have matching first and second moments.Therefore For the second term on the right hand side we compute Observe that E[(ξ) ν e tξ 2 ] < ∞ for ν = 0, 1, 2, 3 for t small enough and the expectation E is uniformly bounded in the number of summands c for ν = 0, 1, 2, 3 and if t is small enough.Finally the ζ 1 and ζ 2 have the form Moreover, we have for t smaller than 1/2 by Gaussian integration (1) Therefore, writing we obtain In a similar fashion we can also bound: , where now we make us of E[e −tξ 2 ] = 1 √ 1+2t instead of (1).Hence altogether we obtain Note that we will always assume that c → ∞ such that kt 2 √ c is negligible with respect to kt 2 .Choosing t = d 8k , which in particular ensures that t converges to 0 for large dimensions and therefore especially that (1) can be applied, we obtain that asymptotically which goes to 0, whenever d 2 k.Hence for d k d 2 we obtain that the method works fine if q e 1 32 d 2 k , which is our last condition.
In a very similar fashion we can treat the case of an input pattern that is a corrupted version of one of the vectors in the database.
Corollary 3.2.Assume that the input pattern x 0 has a macroscopic overlap with one of the patterns in the database, i.e. d l=1 x 0 l x µ l = αc, for a α ∈ (0, 1), and x 0 has c entries equal to 1 and d − c indices equal to 0. The algorithm described above works asymptotically correctly, i.e. with an error probability that converges to 0 and is efficient if Proof.The proof is basically a rerun of the proof of Theorem 3.1.Again without loss of generality assume that x 1 is our target pattern in the database such that d l=1 x 0 l x 1 l = αc, that x 1 ∈ X 1 , and that x 0 i = 1 for the first c indices i and that x 0 i = 0 for the others.Then for any i we have x µ l x µ m while s(X i , x 0 ) is structurally the same as in the previous proof.Therefore, following the lines of the proof of Theorem 3.1 and using the notation introduced there we find that Again the second summand vanishes as long as k d 15 while for the first summand we obtain exactly as in the previous proof Now we choose t = α 2 d 8k to conclude as in the proof of Theorem 3.1 that which converges to 0 by assumption.

Dense, unbiased patterns
Contrary to the previous section we will now treat the situation where the patterns are not sparse and do not have a bias.This is best modeled by choosing x µ ∈ {−1, 1} d for some (large) d.We will now assume that the x µ , µ = 1, . . ., n are i.i.d. and have i.i.d.coordinates such that for all l = 1, . . ., d and all µ = 1, . . ., n. Again we will suppose that there is an x µ such that d H (x µ , x 0 ) is macroscopically large.To avoid technical problems we start with the situation that x 0 = x 1 .Again we will also partition the patterns into equal-sized classes X 1 , . . .X q , with |X i | = k for all i = 1, . . .q and compute s( x 0 l x 0 m x µ l x µ m to compute overlap of a class i with the query pattern.We will prove: Theorem 4.1.Assume that the query pattern is equal to one of the patterns in the database.The algorithm described above works asymptotically correctly and is efficient if (3) and qe , for some C > 0.
Proof.Without loss of generality, we may again suppose that x 0 = x 1 and now that x 0 l = x 1 l ≡ 1 for all l = 1, . . ., d, since otherwise we can flip the spins of all coordinates and all images, where this is not the case (recall that the x µ l are all unbiased i.i.d.).Then the above expression simplifies to Especially x µ l x µ m .
Again we want to know for which parameters Trivially, By the exponential Chebychev inequality we obtain for any t > 0 To calculate the expectations, we introduce a standard normal random variable y that is independent of all other random variables occuring in the computation and by Gaussian expectation and Fubini property ) Here we used the well-known estimate cosh(x) ≤ e x 2 2 for all x.Thus (E exp(t On the other hand, similary to the above, again by introducing a standard Gaussian random variable y we arrive at To compute the latter we write for some ε d > 0, that converges to zero if d → ∞ at a speed to be chosen later where we used there the well-known estimate P[y ≥ u] ≤ e − u 2 2 for all u ≥ 0 and We first choose ε d = k −1/4 εd for some small εd converging to 0 as d goes to infinity, such that the first factor on the right hand side converges to 1. Then we choose k and d such that the third term on the right hand side converges to one.This is true if Anticipating our choices of t and d such that dt goes to 0, we get the condition as d and k go to 0. Now, we always suppose that d k.With these choices we obtain that Now we differentiate two cases.If d 4 k 3 (note that we always have d k) we expand the logarithm to obtain P(s(X 2 , x 0 ) ≥ s(X 1 , x 0 )) ≤ e −td 2 +2kd 2 t 2 +O(kt 4 d 4 ) Choosing t = 1 4k we see that the term O(kt 4 d 4 ) is in fact o(1) and therefore Thus, if k d 2 and q e d 2 8k we obtain On the other hand, if d k ≤ O(d 4/3 ) we choose t = k −5/4 .Then by the same expansion of the logarithm will always dominate the -term and clearly the O-term is again o(1).Thus (1 + o(1)).
Since k ≤ O(d 4/3 ) the exponent diverges and we see that as long as log q We can easily check that our first condition − Again, in a similar way one can treat the case of an input pattern that is a corrupted version of one of the vectors in the database.
Corollary 4.2.Assume that the input pattern x 0 has a macroscopic overlap with one of the patterns, say x 1 , in the database, i.e. d l=1 x 0 l x 1 l = αd, for a α ∈ (0, 1).The algorithm described above works asymptotically correctly, and is efficient if (3) and qe for some C > 0.
Proof.The proof follows the lines of the proof of Theorem 4.1 by using the condition that d l=1 x 0 l x 1 l = αd.Remark 4.3.One might wonder, in how far taking a higher power than two of d l=1 x 0 l x 1 l in the computation of the score function changes the results.So let us assume we take (2) s(X i , x 0 ) = Then, of course, in the setting of the proof of Theorem 4.1 we obtain ln so we gain in the exponent.However the probability that s(X 1 , x 0 ) is smaller than s(X 1 , x 0 ) is then much more difficult to estimate.As a matter of fact, none of the techniques used in Sections 2 and 3 works, because a higher power cannot be linearized by Gaussian integration on the one hand, and exponential of powers larger than two of of Gaussian random variables are not integrable.A possible strategy could include exponential bounds as in Proposition 3.2. in [16].
From here one also learns that in the Hopfield model with N neurons and n-spin interaction (n ≥ 3) the storage capacity grows like N p−1 .By similarity this could lead to the conjecture that replacing the score function by (2) leads to a class size of k d n .However, in this the computational complexity of our algorithm would also increase.

Experiments
In order to stress the performance of the proposed system when considering nonasymptotic parameters, we propose several experiments.In subsection 5.1, we analyze the performance when using synthetic data.In subsection 5.2, we run simulations using standard off-the-shelf real data.
5.1.Synthetic data.We first present results obtained using synthetic data.Unless specified otherwise, each drawn point is obtained using Monte-Carlo simulations with at least 100,000 independent tests.5.1.1.Sparse patterns.Considering data to be drawn i.i.d.with: recall that we have four free parameters: c the number of 1s in patterns, d the dimension of patterns, k the number of pattern in each class and q the number of classes.
Our first two experiments consists of varying k then q while the other parameters are fixed.We choose the parameters d = 128 and c = 8. Figure 1 depicts the rate at which the highest score is not achieved by the class containing the query vector (we call it the error rate), as a function of k and for q = 10.This function is obviously increasing with k, and presents a high slope for small values of k, emphasizing the critical choice of k for applications.Figure 2 depicts the probability of error as a function of q, for various choices of k.Interestingly, for reasonable values of k the slope of the curve is not as dramatic as in Figure 1.When adjusting parameters, it seems thus more reasonable to increase the number of classes rather than the number of patterns in each class.This is not a surprising finding as the complexity of the process depends on q but not on k.A typical designing scenario would consist in, given data and its parameters c and d, finding the best trade-off between q and k in order to split the data in the different classes.Figure 3 depicts the error rate as a function of k when n = kq is fixed.To obtain these points we consider c = 8, d = 128, and n = 16384.There are two interesting facts that are pointed out in this figure.First we see that the trade-off is not simple as multiple local minima happens.This is not surprising as, with larger values of k, the decision becomes less precise and promissory.As a matter of fact, the last point in the curve only claims the correct answer is somewhere in a population of 8192 possible solutions whereas the first point claims it is one of the 64 possible ones.On the other hand, there is much more memory consumption for the first point where 256 square matrices of dimension 128 are stored whereas only 2 of them are used for the last point.Second, the error rate remains of the same order for all tested couples of values k and q.This is an interesting finding as it emphasizes the design of a solution is more about complexity vs. precision of the answer -in the sense of obtaining a reduced number of candidates -than it is about error rate.Finally, in order to evaluate both the tightness of the bounds obtained in Section 3 and the speed of convergence, we run an experiment in which c = log 2 (d), q = 2.
We then depict the error rate as a function of d and when k =  We ran the same experiments using cooccurence rules as initially proposed in [20,3] (instead of adding contributions from distinct messages, we take the maximum).We observed small improvements in every case, even though they are not significant.

Dense patterns.
Let us now consider data to be drawn i.i.d.according to the distribution : recall that we then have three free parameters: d the dimension of patterns, k the number of patterns in each class and q the number of classes.Again, we begin our experiments by looking at the influence of k (resp.q) to the performance.The obtained results are depicted in Figure 5 (resp.Figure 6).To conduct these experiments, we have chosen d = 64.The global slope resembles that of Figure 1 and Figure 2.Then, we consider the designing scenario where the number of samples is known.We plot the error rate depending on the choice of k (and thus the choice of q = n/k).The results are depicted in Figure 7. Finally, we estimate the speed of convergence together with the tightness of the bound obtained in Section 4. To do so, we choose k as a function of d with q = 2.The obtained results are depicted in Figure 8. Again, we observe that the case k = d 2 appears to be a limit case where error rate remains constant.5.2.Real data.In this section we use real data in order to stress the performance of the obtained methods on authentic scenarios.Since we want to stress the interest of using our proposed method instead of classical exhaustive search, we are mainly interested in looking at the error as a function of the complexity.To do so, we modify our method as follows: we compute the scores of each class and order them from largest to smallest.We then look at the rate for which the nearest neighbor is in one of the first p classes.With p = 1, this method boils down to the previous experiments.Larger values of p allows for more flexible trade-offs of errors versus performance.We call the obtained ratio the recall@1.Considering n vectors with dimension d, the computational complexity of an exhaustive search is dn (or cn for sparse vectors).On the other hand, the proposed method has a twofold computational cost: first the cost of computing each score, which is d 2 q (or c 2 q for sparse vectors), then the cost of exhaustively looking for the nearest neighbor in the selected p classes, which is pkd (or pkc for sparse vectors).Note that the cost of ordering the q obtained scores is negligible.There are several things that need to be changed in order to accommodate for real data.First, for nonsparse data we center data and then project each obtained vector on the hypersphere with radius 1.Then, due to the nonindependence of stored patterns, we choose an allocation strategy that greedily assign each vector to a class, rather than using a completely random allocation.
The allocation strategy we use consists of the following: each class is initialized with a random vector drawn without replacement.Then each remaining vector is assigned to the class that achieves the maximum normalized score.Scores are normalized in two ways: they are divided by the number of items κ currently contained in the class, and multiplied by k − κ in order to favor classes with fewer elements.Note that more standard clustering techniques could be used instead.The interest of this allocation strategy is emphasized in Figure 9, where we use raw MNIST data to illustrate our point.MNIST data consists of 784 grey-level images representing handwritten digits.There are 60,000 reference images and 10,000 query ones.For various values of k, we compare the performance obtained with our proposed allocation and compare it with a completely random one.We also run some experiments on a binary database.It consists of Santander customer satisfaction sheets associated with a Kaggle contest.There are 76,000 vectors with dimension 369 containing 33 nonzero values in average.In this first experiment, the vectors stored in the database are the ones used to also query it.The obtained results are depicted in Figure 10.Relative computational complexity k=8000 k=4000 k=2000 k=1000 k=500 Figure 10.Recall@1 on the Santander customer satisfaction dataset as a function of the relative complexity of the proposed method with regards to an exhaustive nearest neighbor search, for various values of k.Each curve is obtained by varying the value of p.
Then, we run experiments on the SIFT1M dataset1 .This dataset contains 1 million 128-dimensions SIFT descriptors obtained using a Hessian-affine detector, plus 10,000 query ones.The obtained results are depicted in Figure 11.To stress consistency of results, we also run experiments on the GIST1M dataset.It contains 1 million 960-dimensions GIST descriptors and 1,000 query ones.The obtained results are depicted in Figure 12. Figure 12.Recall@1 on the GIST1M dataset as a function of the relative complexity of the proposed method with regards to an exhaustive nearest neighbor search, for various values of k.Each curve is obtained by varying the value of p.

Conclusion
We introduced a technique to perform approximate nearest neighbor search using associative memories to eliminate most of the unpromising candidates.Considering independent and identically distributed binary random variables, we showed there exists some asymptotical regimes for which both the error probability tends to zero and the complexity of the process becomes negligible compared to an exhaustive search.We also ran experiments on synthetic and real data to emphasize the interest of the method on realistic application scenarios.
A particular interest of the method is its ability to cope with very high dimension vectors, and even to provide better performance as the dimension of data increases.When combined with reduction dimension methods, it arises interesting perspectives on how to perform approximate nearest neighbor search with limited complexity.
There are many open ideas on how to improve the method further, including and not limiting to using smart pooling to directly identify the nearest neighbor without need to perform an exhaustive search, cascading the process using hierarchical partitioning, or improving the allocation strategies.

Figure 1 .
Figure 1.Evolution of the error rate as a function of k.The other parameters are q = 10, d = 128 and c = 8.

Figure 2 .
Figure 2. Evolution of the error rate as a function of q and for various values of k.The other parameters are d = 128 and c = 8.

Figure 3 .
Figure 3. Evolution of the error rate for a fixed number of stored messages n = 16384 as a function of k (recall that q = n/k).The generated messages are such that d = 128 and c = 8.
d 1.5 , k = d 2 and k = d 2.5 .The result is depicted on Figure 4.This figure supports the fact the obtained bound is tight, as illustrated by the curve corresponding to the case k = d 2 for which the error rate appears almost constant.

Figure 4 .
Figure 4. Evolution of the error rate as a function of d.The other parameters are q = 2, c = log 2 (d) and k = d α /10 with various values of α.

Figure 5 .Figure 6 .
Figure 5. Evolution of the error rate as a function of k.The other parameters are q = 10 and d = 64.

Figure 7 .
Figure 7. Evolution of the error rate for a fixed total number of samples as a function of k.The other parameters are n = 16384 and d = 64.

Figure 8 .
Figure 8. Evolution of the error rate as a function of d.In this scenario, we choose k = d α with various values of α and q = 2.

Figure 9 .
Figure 9. Recall@1 on the MNIST dataset as a function of the relative complexity of the proposed method with regards to an exhaustive search, for various values of k and allocation methods.Each curve is obtained by varying the value of p.

Figure 11 .
Figure 11.Recall@1 on the SIFT1M dataset as a function of the relative complexity of the proposed method with regards to an exhaustive nearest neighbor search, for various values of k.Each curve is obtained by varying the value of p.