This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (

A recently proposed product quantization method is efficient for large scale approximate nearest neighbor search, however, its performance on unstructured vectors is limited. This paper introduces residual vector quantization based approaches that are appropriate for unstructured vectors. Database vectors are quantized by residual vector quantizer. The reproductions are represented by short codes composed of their quantization indices. Euclidean distance between query vector and database vector is approximated by asymmetric distance,

Approximate nearest neighbor search (ANN) is proposed to tackle the curse of the dimensionality problem [

The state-of-the-art approaches such as tree-based methods (e.g., KD-tree [

Recently proposed hamming embedding methods compress the vectors into short codes and approximate the Euclidiean distance between two vectors by the hamming distance between their codes. These methods include hamming embedding [

Objectives of the paper are comparable to those of [

Our paper is organized as follows: Section 2 presents the residual vector quantization and Section 3 introduces our exhaustive and non-exhaustive search methods that are based on the residual vector quantization. Section 4 evaluates the search performance and compares our approaches with two state-of-the-art methods. Section 5 discusses the results and Section 6 is the conclusion.

A ^{D}_{i}^{D}_{i}_{i}^{D}

The performance of quantizer

Residual vector quantization [_{1}_{1}_{1}_{1}, is used for learning the second stage-codebook _{2}_{1}_{1}_{1}_{2}_{2}

For _{2}

The quantization performance of _{i}_{i}_{j}_{i}_{,}_{j}_{j}

Considering the optimization problem of finding a vector

By differentiating the objection function _{j}

With the observations that

In [

Asymmetric distance reduces the quantization noise and improves the search quality [_{j}_{j}_{i,ui}_{i}_{i}

The squared asymmetric distance can then be efficiently estimated by several table lookups:

If we only consider the order of distance, term ||

Exhaustive search has to scan quantization codes of all database vectors. In problems such as bag-of-features-based large scale image retrieval, billions of images are represented by hundreds of local feature vectors per image, and it is prohibitive to scan the feature vector database, even with fast asymmetric distance computation.

In [

Our exhaustive search approach uses only the most accurate item ^{(}^{L}^{)} to approximate the _{1} quantization outputs generate a rough approximation:

The rough asymmetric distances between database vectors and the target vector are then evaluated by table lookups for coarse search:

The database vectors which have large rough distances are pruned and the remaining database vectors are used to evaluate more accurate distances to the target vector by their most accurate approximations as in

The total number of possible rough approximations is ^{L1}, thus an inverted file system is used to improve the search performance. Each inverted list corresponds to a possible rough approximation. When encoding database vectors by _{1} indices are used to determine which inverted list it should be inserted in, then the _{1} indices are discarded and only the last _{2} = _{1} indices and its vector id are stored in the inverted list. A query vector first evaluated its distances to the ^{L1} possible rough approximations by

_{2} table lookups per vector are involved. The term ||^{(L)}||^{2} – ||^{(L1)}||^{2} is pre-calculated and stored in offline quantization stage. By fast table lookups and distance update scheme, both coarse and fine search are efficient.

Three public available datasets were used to evaluate the performances of ANN methods: the structured SIFT descriptor dataset [

The SIFT dataset and GIST dataset have three subsets: learning set, database set, and query set. The learning set is used for learning the model and evaluating quantization performance, the database and query sets are used for evaluating ANN search performance. For the SIFT dataset, the learning set is extracted from Flicker images [

This section investigates the quantization performance of our approach by evaluating the influence of parameters over quantization error. _{2}

The performances of our approaches are measured by two metrics: recall@R and ratio of distance errors (RDE). Recall@R is defined in [_{i}_{i}

_{1} ∈ {1,2} and _{2} ∈ {1,2,4,8,16} are the numbers of stage-quantizers used for coarse search and fine search, ^{L1}. The code length _{2} log_{2}

_{1} = 2, but the frequent inverted lists access reduces the search performance.

In this section we compare our approach with two state-of-the-art methods: spectral hashing (SH) and product quantization. The performance of product quantization is sensitive to the grouping order of vector components. The natural product quantization groups the consecutive components while the structured product quantization groups related components together based on the prior knowledge of vector’s structure. Experimental results in [_{1} = 1, _{2} = 8,

The VLAD dataset is used for evaluating the accuracy of ANN methods on unstructured vectors. The performance is measured by mean average precision (mAP) [

The advantage of residual vector quantization is quantizing the whole vector in original space. Product quantization is based on the assumption that the subspaces are statistically mutual independent such that the original space can be represented by the production of these subspaces. But vectors in real data do not all meet that assumption. Moreover, the vector’s structure determines the quantization parameters and makes product quantization inflexible. In contrast, residual vector quantization processes the whole vector in original space, and the parameters are not limited by the structure of vector.

Residual vector quantization can be regarded as a simplified hierarchical k-means (HKM). When generating a new quantization level, HKM performs k-means clustering in each previous level’s cluster and generate a new partition for each previous level’s cluster. In contrast, residual vector quantization generates a global partition and then embeds it into each previous level’s cluster. It is similar to the hamming embedding (HE) method, while HE involves two levels and uses the orthogonal partition in each cluster. The simplified structure makes it possible to have more quantization levels and each level have more centroids for fine division of space. The method that transforming tree-like structure to flat structure, which has been used in ferns classifier [

Processing vectors in original high dimensional space causes negative implications for complexity. Operations such as finding the nearest centroid or generating residual vectors are performed in high dimensional space while product quantization process subvectors in the low dimensional subspace. The memory usage of codebook is negligible when compared to the memory occupied by a codeddatabase. The complexity of look-up table computation is also negligible when compared with the complexity of scanning the database’s codes. The drawback is the computational complexities of learning and quantization stage of residual vector quantization are linear times of the complexities of product quantization. Our feature work will focus on reducing the complexities of learning and quantization stage.

We have introduced residual vector quantization for approximate nearest neighbor search. Two efficient search approaches are proposed based on residual vector quantization. The non-exhaustive search method significantly improves the performance. We evaluate the performance on two structured datasets and one unstructured dataset, and compare our approaches with spectral hashing and product quantization. Our approaches obtain the best results in terms of the trade-off between accuracy, speed and memory usage. Results on structured datasets show our approaches slightly outperform product quantization. For unstructured data, our approaches significant outperform the product quantization.

The authors would like to thank the anonymous reviewers for their valuable comments. This research is supported by the National Natural Science Foundation of China (NSFC) under Grant No. 60903095 and by the Postdoctoral Science Foundation Funded Project of China under Grant No. 20080440941.

Block diagrams of two-stages residual vector quantization.

Quantization error associated with

Exhaustive search accuracy. (

RDE for SIFT dataset, exhaustive search method. (

Search accuracy of non-exhaustive search. (

Comparison of search accuracies obtained by spectral hashing, product quantization methods and our approaches. (

Dataset information.

Dimension of descriptor | 128 | 960 | 128 |

Size of learning set | 100,000 | 500,000 | 1,491 |

Size of database set | 1,000,000 | 1,000,000 | 1,491 |

Size of query set | 10,000 | 1,000 | 500 |

Comparison of RVQ and IVFRVQ on SIFT dataset.

RVQ | 34 | 1,000,000 | 0.96 | |

IVFRVQ | _{1} _{2} |
0.65 | 4,261 | 0.56 |

_{1} _{2} |
||||

_{1} _{2} |
3.2 | 1,682 | 0.80 | |

_{1} _{2} |
15.1 | 9,692 | 0.96 |

Comparison of RVQ and IVFRVQ on GIST dataset.

RVQ | 36.1 | 1,000,000 | 0.67 | |

IVFRVQ | _{1} _{2} |
2.9 | 5,205 | 0.36 |

_{1} _{2} |
||||

_{1} _{2} |
5.7 | 2,423 | 0.55 | |

_{1} _{2} |
20.5 | 16,512 | 0.74 |

Comparison with state of the art on VLAD dataset.

| |||
---|---|---|---|

SH | 0.255 | 0.349 | 0.397 |

PQ | 0.337 | 0.409 | 0.457 |

RVQ |

Search speed for 64-bit code and different methods (SIFT dataset).

RVQ | 34 | 1,000,000 | 0.96 | |

_{1} = 1, _{2} = 8, |
33,602 | |||

PQ | 33.7 | 1,000,000 | 0.93 | |

IVFPQ | 3 | 9,102 | 0.87 | |

IVFPQ | 7.3 | 17,621 | 0.93 | |

SH | 35.3 | 1,000,000 | 0.53 |