1. Introduction
Feature representation and similarity measurement are two critical components in content-based image retrieval (CBIR) [
1,
2]. Conventionally, the similarity between images is measured by the Euclidean distance between their corresponding features. A retrieval system ranks candidate images according to such similarity. However, Zhou et al. have revealed that a traditional pairwise Euclidean distance is not adequate to demonstrate the intrinsic similarity relationship between images [
3].
Many algorithms [
4,
5,
6,
7,
8,
9,
10,
11,
12] have been proposed to model the geometry structure of the intrinsic data manifold. Among these methods, a graph-based affinity learning algorithm called the diffusion process [
13] has shown superior ability, which learns the local structure of data manifold for re-ranking in image retrieval. Nevertheless, the diffusion-based re-ranking method usually incurs extra computational overhead and time expenses. The online retrieval stage of a CBIR system usually requires real-time efficiency, and thus such time and computational resource overhead are unfavorable. On the other hand, the offline database image indexing stage has a relatively lower efficiency requirement. This gives us a hint to embed such geometry structure information of the intrinsic data manifold into image feature representations.
Traditional learning methods usually focus on carefully-designed hand-crafted image features. Global features such as HSV [
14] are directly used for image representation. Local features like SIFT [
15] are firstly aggregated globally and then used for image representation [
16]. Although those kinds of feature representations have achieved certain benefits, their rigid processing architectures limit further improvement for the image retrieval task and leave little room to embed the learned manifold information into such feature representations. Leading by the brilliant work of Krizhevsky et al. [
17], the potential of deep learning has been widely explored in the computer vision community. Convolutional Neural Networks (CNNs) demonstrate a powerful representative capability for images and have pushed the performance of a considerable amount of vision tasks to a new state-of-the-art [
18], including image retrieval [
19].
Most existing works directly took off-the-shelf CNN models pre-trained on classification tasks as the feature extractor for image retrieval [
20,
21]. However, there is a natural gap between the tasks of image classification and image retrieval, since image classification focuses on class-level discrimination while image retrieval emphasizes instance-level similarity more. Directly using the pre-trained CNN models on image retrieval will result in limited performance. Moreover, it discards the trainable nature of CNN, which is another essential character that leads CNN to success. As a consequence, it is a natural choice to fine-tune the pre-trained CNN model to learn more powerful representations to fit the requirements of image retrieval tasks [
22,
23].
Previous works [
24,
25,
26,
27,
28,
29] have demonstrated that Siamese or triplet networks are more proper to learn for ranking-oriented retrieval task with pairwise or triplet similarity supervision. For network training, the quality of supervision is one of the critical factors that affect the quality of feature learning and information embedding. Unfortunately, dataset collecting and label annotating are labor-consuming work, and thus large-scale hand-collected and labeled training data are not a feasible option. Although some researchers have proposed to automatically generate training image pairs or triplets [
22,
23], a big pre-collected image pool with a strict constraint on image types is still required.
To address the above problems, in this paper, we propose to automatically select training image pairs with the help of geometry structure information of the intrinsic data manifold and embed such information into the feature learning process with a specially designed Siamese network. On the one hand, similarity based on the Euclidean distance is efficient to calculate but usually lacks robustness. On the other hand, local geometry information of the intrinsic data manifold is very effective in improving the reliability of similarity measurement but is not easy to acquire. My approach combines those merits. With the supervision of manifold learning, the automatic image pairs selection process equips selected pairs with such manifold information. Fine-tuning the pre-trained model with those image pairs transfers local geometry information of the intrinsic data manifold into the newly learned feature embedding. Under the newly learned feature embedding, Euclidean distance-based similarity measurement is not only efficient to calculate but is also robust.
An overview of the proposed method is illustrated in
Figure 1. The pipeline starts at an old feature embedding and ends up with a new feature embedding. Taking the previous endpoint as a new start point, we can restart the automatic image pairs selection and network fine-tuning process, and thus we can iteratively improve the feature embedding to get better feature representations.
Since our goal is to learn a robust feature embedding with pairwise supervision, with the learned embedding, the output feature representations of the same image pair should be close in the Euclidean space. A similarity embedding loss is adopted to pull those image pairs together. For those not similar in both the original Euclidean space and the learned manifold, we should keep their relationship. We use a feature consistency preserving loss to prevent dramatic change to the new feature embedding. Experimental results are included to demonstrate the effectiveness of our proposed method. Specifically, the proposed method significantly outperforms the baseline and surpasses or is on par with the state-of-the-art methods on three benchmark datasets.
The rest of this paper is organized as follows:
Section 2 reviews some closely related work of this paper. In
Section 3,
Section 4 and
Section 5, the details of the proposed method are presented. Experiments are discussed in
Section 6, followed by conclusions in
Section 7.
2. Related Work
My work is related to deep learning-based image retrieval and manifold learning for visual re-ranking. In the following, we briefly discuss those works and point out the differences between our work and theirs.
Deep learning-based image retrieval. Witnessing the great success of deep learning on a variety of vision tasks, some pioneering works started leveraging CNN on image retrieval tasks [
19,
21,
30]. Most of these works were based on off-the-shelf neuron activations of pre-trained CNNs. Some of the early works directly utilized the activation of fully-connected (FC) layers of the network as image representation [
31]. Razavian et al. proposed to leverage the activations of augmented FC6 layer of AlexNet [
17] as image representation. The reported performance outperformed the state-of-the-art SIFT-based methods even with such a rough setting, which demonstrated the powerful representative capability of CNNs. Successive researchers realized that image representations generated from convolutional layers are more suitable for image retrieval. Ng et al. leveraged VLAD [
32] to encode column features on each spatial location of feature maps [
21]. Gong et al. also used VLAD to aggregate feature maps extracted from local image patches across multiple scales [
30]. Babenko et al. proposed to aggregate convolutional feature maps into a compact image representation by sum-pooling [
20]. Tolias et al. applied max-pooling over multiple carefully-designed regions and integrated those generated vectors into a single feature vector [
33]. In addition, some works tried to combine these two types of features. Li et al. fused convolutional features and FC features to compose a compact image representation [
34].
While off-the-shelf models have achieved impressive results on retrieval performance, several works have proved that fine-tuning the pre-trained CNNs is a promising branch for image retrieval. Babenko et al. re-trained existing pre-trained models with a dataset related to buildings and demonstrated the feasibility and effectiveness of fine-tuning [
19]. However, this method kept classification-based network architecture, which limited the performance from further promotion.
Fine-tuning pre-trained models with a retrieval oriented objective and dataset drives the learned feature more suitable for pairwise similarity measurement. Arandjelovic et al. inserted a back-propagatable VLAD-like layer into a pre-trained model and fine-tuned the model with triplet loss [
35]. The notable point is that this work used only weak supervision. Radenović et al. [
23] went one step further to train a retrieval oriented model with a Siamese network in an unsupervised manner. They employed the Structure-from-Motion method and the SIFT-based Bag-of-Word (BoW) method to group images of the same architecture together. The training positive and negative image samples were automatically selected based on those image groups. Gordo et al. applied a similar pipeline but with triplet loss and a SIFT matching based label generation method instead [
22]. Although these works achieved considerable improvement in image retrieval, they all needed a large stand-alone image dataset related to the target dataset, which is challenging to collect.
Different from them, our approach automatically selects relevant training image pairs with learned local geometry information of the intrinsic data manifold, and our goal is to embed such learned manifold information into a new feature embedding. In addition, we have a specially designed loss function.
Manifold learning for visual re-ranking. In image retrieval tasks, manifold learning-based methods are usually applied to refine pairwise image similarity since the original Euclidean distance-based similarity measurement is not reliable enough. Such similarity refinement is especially useful for re-ranking. Bai et al. proposed to do visual re-ranking by collecting the intrinsic manifold information of data with an algorithm called Sparse Contextual Activation (SCA) [
36]. Several works demonstrated that the diffusion process was a promising way to do re-ranking in retrieval tasks. The diffusion process uses a graph to represent the similarity relationship of images, where vertices represent images and edges between vertices denote their corresponding similarity. The manifold structure is learned and applied by iteratively diffusing similarity into a part of or the whole graph [
13]. Many works showed excellent performance on image retrieval [
37,
38,
39,
40,
41]. In this work, we leverage manifold learning methods to refine the initial Euclidean-based similarity measurement. While one of the critical points of the proposed method lies in manifold learning, we do not pay extra attention to studying how to improve it. We adopt the state-of-the-art method and embed it into the pipeline to select training image pairs as supervision.
There are also a few works learning to embed from manifold information for a variety of tasks. Xu et al. iteratively learned the manifold embedding via the iterative manifold embedding (IME) layer [
42]. However, this work only focuses on the projection from initial feature representation to the iterative manifold embedding representation, which limits the capability to handle unseen images. Iscen et al. [
43] proposed to train models with image pairs mined with the help of manifold learning. Compared to this work, we have a different objective and do not need a large stand-alone dataset related to the target dataset.