Fine-Grained Image Retrieval via Object Localization

Wang, Rong; Zou, Wei; Wang, Jiajun

doi:10.3390/electronics12102193

Open AccessCommunication

Fine-Grained Image Retrieval via Object Localization

by

Rong Wang

,

Wei Zou

^* and

Jiajun Wang

School of Electronic and Information Engineering, Soochow University, Suzhou 215006, China

^*

Author to whom correspondence should be addressed.

Electronics 2023, 12(10), 2193; https://doi.org/10.3390/electronics12102193

Submission received: 8 April 2023 / Revised: 5 May 2023 / Accepted: 10 May 2023 / Published: 11 May 2023

Download

Browse Figures

Versions Notes

Abstract

:

In this paper, a network consisting of an object localization module and a discriminative feature extraction module is designed for fine-grained image retrieval (FGIR). In order to reduce the interference of complex backgrounds, the object localization module is introduced into the network before feature extraction. By selecting the convolutional feature descriptors, the main object is separated from the background, and thus, most of the interference is filtered out. Further, in order to improve the overall performance of the network, a discriminative filter bank is introduced into the network as the local feature detector. Hence, the local discriminative features can be extracted directly from the original feature map. The experimental results based on the CUB-200-2011 and Cars-196 datasets demonstrate that the proposed method can improve the performance of FGIR.

Keywords:

deep learning; feature extraction; computer science

1. Introduction

Basically, general image retrieval focuses on retrieving images that contain similar content to the query images. However, the purpose of fine-grained image retrieval (FGIR) is to search images from the database that belong to the same subcategories as the query image [1,2]. In general, due to the great differences in pose, background, and scale between the images belonging to the same subcategory and the similarity of images belonging to different subcategories, the main challenge for FGIR is the small inter-class variance as well as the large intra-class variance. To meet that challenge, it is important to extract more discriminative features to distinguish subtle differences in the different categories.

In recent few years, FGIR has attracted great attention due to its wide application in biological research [3,4,5]. It is also possible to expand the application of it to other areas, such as optical measurements [6]. In general image retrieval, the shallow features, such as the texture of the image, are used to describe the image. Many convolutional neural networks have been proposed, such as VGGNet [7], GoogLeNet [8], and ResNet [9]. Convolution neural networks (CNNs) are widely used for image retrieval to extract features. Moreover, pre-trained CNN models are also used to improve coding efficiency. CNN models are fine-tuned in generic instance image retrieval, which can achieve better results [10]. However, the above-mentioned methods can achieve better performance only based on the datasets with large variances between different categories; hence, they are not suitable for FGIR. It is common to train CNNs with metric learning loss functions, such as Triplet loss [11], N-pair loss [12], CRL loss [13], and Proxy NCA [3]. However, training CNNs with a metric learning loss function cannot take full advantage of the local features.

In this paper, the object localization module is used to predict the position of the objects and remove interference from the background. Then, the discriminative filter bank is introduced as the local feature detector to further improve the performance of FGIR. The experimental results demonstrate that the proposed method can significantly improve the performance of FGIR.

2. Method

In this paper, the designed network contains two stages, as shown in Figure 1. The first stage is to obtain the object regions from the original images. The second stage is to further extract the features that are different from the other categories.

In the first stage, the original image is input into the CNN to obtain the feature map based on Resnet50. Then, the attention object location module (AOLM) is used to locate the object regions, which are subsequently cropped from the original image to remove interference from the background.

In the second stage, the cropped image is used as the input. The designed network is used to extract the global features and local features based on Resnet50. The final global and local vectors for retrieval are achieved through the use of 1 × 1 convolutional filters. The weights of the convolution layer in the first stage and the second stage are shared. Therefore, the network is robust according to the images of different sizes. The designed network, which includes two stages, is still an end-to-end network.

The fusion of global and local features is implemented based on Equation (1), as follows:

c o s_s i m = (1 - θ) \times c o s_s i m_g + θ \times c o s_s i m_p

(1)

where

c o s_s i m_g

denotes the global similarity and

c o s_s i m_p

denotes the local similarity. θ denotes the weight. The range of θ is from 0 to 1. When θ is 0, only the global features are used. When θ is 1, only the local features are used.

2.1. Design of the Object Localization Module

Before extracting the discriminative features, it is important to locate objects accurately to remove interference from the background. In this paper, AOLM is realized based on the selective convolutional descriptor aggregation (SCDA) method.

First, the activation map is obtained as in Equation (2)

A = \sum_{i = 0}^{C - 1} f_{i}

(2)

where

A \in R^{W \times H}

represents the activation map with the size of W × H. f_i is the feature map of channel i. C (C = 1,2,…) denotes the number of channels.

If many channels respond to the same regions, this is where the objects may be located.

The object regions can be located according to the threshold a, which is calculated through the use of Equation (3)

a = \frac{\sum_{x = 0}^{W - 1} \sum_{y = 0}^{H - 1} A_{(x, y)}}{H \times W}

(3)

where (x,y) denotes the position in the activation map.

The mask, which is selected to locate the object regions, is obtained through the following:

M_{(x, y)} = \{\begin{cases} 1 if A_{(x, y)} > a \\ 0 otherwise \end{cases}

(4)

The advantage of using the SCDA module is that it can accurately locate the object with only a pre-trained ResNet50. Further, good performance can be achieved without introducing new training parameters.

2.2. Design of the Networks for Extracting the Global and Local Features

Actually, it is very important to extract the local discriminative features for FGIR. The local features can be used to describe the fine features at different levels. A convolutional filter, as a convolutional kernel of 1 × 1, is used after training to detect the local features. In this paper, the original image is input into the convolutional neural network to obtain a feature map with the size of C × W × H, as shown in Figure 2. Each vector with C × 1 × 1 can represent a patch of the original image. Since the 1 × 1 convolutional filter can extract the local features of a patch, the information of several parts can be obtained with several 1 × 1 convolutional filters.

In this paper, the network to extract the global features and local features is based on ResNet50, as illustrated in Figure 3. The overall appearance of some fine-grained images also provides some discriminative features. The global feature vector of one dimension, F_g, is obtained through global max pooling. Then, F_g is input into the convolutional layer of 1 × 1 to achieve the global features. The local features are obtained through a convolutional kernel of C × 1 × 1 with the number of KM. In Figure 3, the blue modules focus on the global features. The orange modules are used to detect the local features, and the gray ones are used for dimension reduction. If we assume that the dataset has K categories, as well as M discriminative regions for each category, the total number of filters is KM, and we obtain a KM-dimensional vector after global max pooling.

Softmax loss is used to supervise the filters to achieve discriminative features; thus, Cross-Channel Pooling is applied to convert the KM-dimensional vector to the K-dimensional vector. The embedding features for retrieval are obtained with dimension reduction for the features through global maximum pooling.

Proxy-Anchor loss is used as the metric-learning loss function to shorten the distance between the feature vectors of the same category and lengthen the distance between the feature vectors from different categories. The formula is given as follows:

L_{raw} = - \frac{1}{n} \sum_{i = 1}^{n} \log (P_{r_{i}} (c_{i}))

(5)

L_{object} = - \frac{1}{n} \sum_{i = 1}^{n} \log (P_{o_{i}} (c_{i}))

(6)

L_{part} = - \frac{1}{n} \sum_{i = 1}^{n} \log (P_{p_{i}} (c_{i}))

(7)

L_{m e t r i c} = L_{Proxy - Anchor}^{g} + L_{Proxy - Anchor}^{p}

(8)

where n denotes the size of training batch,

c_{i}

is the true label of the ith image,

P_{r_{i}} (c_{i})

,

P_{o_{i}} (c_{i})

and

P_{p_{i}} (c_{i})

denote the probability of the

c_{i}

obtained with the softmax layer for the global features of the original images, the global features of the objects, and the local features of the objects, respectively.

L_{Proxy - Anchor}^{g}

denotes the metric learning loss function of the global branch.

L_{Proxy - Anchor}^{p}

denotes the metric learning loss function of the local branch. The total loss function can be obtained as follows:

L = L_{raw} + L_{object} + L_{part} + L_{m e t r i c}

(9)

3. Results and Discussion

In the following, the batch size is set to 30, and a Stochastic Gradient Descent (SGD) optimizer is used in the experiments. θ is set to 0.8 in the experiments. The feature dimension is set to 512 for every dataset.

In order to show the effect of the object localization module, the result maps of object localization have been visualized, as shown in Figure 4. The original images are shown in Figure 4a. Figure 4b illustrates the object region, which is located based on the Conv5_2 in the convolution layer from ResNet50. Figure 4c shows the object regions that are located based on the convolution layer as Conv5_3in ResNet50. Figure 4d shows the results from the AOLM module. From Figure 4b,c, we can notice that the located object regions contain background regions, which do not overlap completely. We can filter the background regions that do not overlap with logical AND. Hence, the results can be improved.

To test the validity of every module from the networks, we conducted ablation experiments with different modules based on the widely-used CUB-200-2011 dataset. Table 1 lists the performance of different modules. In Table 1, AOLM denotes the object localization module, P denotes the local branch of extracting the discriminative features, and G denotes the global branch of extracting the discriminative features. As presented in Table 1, the network of the P + G modules can achieve better performance than the network of the P module. Therefore, the network, in combination with global and local features, can achieve better results than that using only global features or local features. Compared with the networks of the P + G modules, the results can be further improved with the network of the AOLM + P + G modules, which demonstrates the effectiveness of the AOLM module. The network of AOLM + G modules may ignore the local fine features, and the network of AOLM + P modules may ignore the global features. We can see that the network with AOLM + P + G modules achieves better performance than the other networks.

To further validate the effectiveness of the proposed method, we conducted experiments using the different methods based on the CUB-200-2011 and Cars-196 datasets for image retrieval. Table 2 lists the performance of the state-of-the-art methods as well as our previous method using the network with three branches (NTB) in terms of Recall@K for quantitative comparison. Figure 5 shows a comparison in terms of the precision of the different methods. Compared with the existing image retrieval methods, the retrieval results can be improved with the proposed method. Thus, we can see that the performance of retrieval can be enhanced after removing the interference from the background. Furthermore, extracting comprehensive fine features of the objects will also be helpful for FGIR.

Furthermore, the proposed method was also validated based on the Cars-196 dataset. In Table 3, we present the performance of different methods in terms of Recall@K on the Cars-196 dataset. In Figure 6, we show the comparison in terms of precision for the different methods. From Table 3 and Figure 6, we can also see that the proposed method can improve the retrieval results when compared with other existing methods.

In fact, the dimension of the embedded feature vectors is also an important factor that may affect image retrieval performance. Hence, we further analyze the impact of the dimension of the embedded feature vectors on the performance of image retrieval in terms of Recall@1 as well as the retrieval time.

Figure 7a,b describe the variety of Recall@1 according to the dimensions of the embedded feature vectors based on the CUB-200-2011 and Cars-196 datasets, respectively. As can be seen in Figure 7, Recall@1 increases with the increment in the dimension of the embedded feature vectors. However, as the dimension arrives at 512, the increment of Recall@1 becomes quite slow. In Figure 8a,b, we show the variety in terms of the retrieval time according to the dimension of the embedded feature vectors based on the CUB-200-2011 and Cars-196 datasets, respectively. From Figure 8, we can notice that the retrieval time significantly increases as the dimension increases from 512. Thus, improvement in terms of performance is quite limited, with much more retrieval time as the dimension increases from 512. Based on the above analysis, the dimension of the embedded feature vectors is set to 512 for image retrieval.

4. Conclusions

In this paper, the object localization module is used to predict the position of the objects and remove interference from the background for FGIR. To further improve the performance of the networks, the discriminative filter bank is introduced as the local feature detector. To validate the effectiveness of the proposed method, the experiments were performed based on two datasets (e.g., CUB-200-2011 and Cars-196). The experimental results demonstrate that the proposed method can improve the performance of FGIR.

Author Contributions

Conceptualization, R.W. and W.Z.; methodology, R.W. and W.Z; software, R.W.; validation, R.W., W.Z. and J.W.; formal analysis, W.Z. and J.W.; investigation, W.Z.; resources, J.W.; data curation, W.Z.; writing—original draft preparation, W.Z.; writing—review and editing, W.Z.; visualization, R.W.; supervision, W.Z. and J.W.; project administration, W.Z. and J.W.; funding acquisition, W.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by Suzhou Science and Technology Planning Project (Grant No. SKJY2021044), the Natural Science Foundation of Jiangsu Province, China (Grant No. BK20130324, BK20171249), Specialized Research Fund for the Doctoral Program of Higher Education (SRFDP) (Grant No. 20123201120009), and the Natural Science Foundation of the Jiangsu Higher Education Institutions of China (Grant No. 12KJB510029).

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Wei, X.; Song, Y.; Aodha, O.M.; Wu, J.; Peng, Y.; Tang, J.; Yang, J.; Belongie, S. Fine-grained image analysis with deep learning: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 8927–8948. [Google Scholar] [CrossRef] [PubMed]
Wei, X.; Xie, C.; Wu, J.; Shen, C. Mask-CNN: Localizing parts and selecting descriptors for fine-grained bird species categorization. Pattern Recognit. 2018, 76, 704–714. [Google Scholar] [CrossRef]
Movshovitz-Attias, Y.; Toshev, A.; Leung, T.K.; Ioffe, S.; Singh, S. No Fuss Distance Metric Learning Using Proxies. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 360–368. [Google Scholar]
Opitz, M.; Waltner, G.; Possegger, H.; Bischof, H. Bier-Boosting Independent Embeddings Robustly. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 5189–5198. [Google Scholar]
Xuan, H.; Souvenir, R.; Pless, R. Deep Randomized Ensembles for Metric Learning. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 723–734. [Google Scholar]
Akhmetzianov, A.V.; Kushner, A.G.; Lychagin, V.V. Multiphase Filtration in Anisotropic Porous Media. In Proceedings of the 2018 14th International Conference ‘Stability and Oscillations of Nonlinear Control Systems’ (STAB), Moscow, Russia, 30 May–1 June 2018; pp. 1–2. [Google Scholar]
Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. In Proceedings of the International Conference on Learning Representations, San Diego, CA, USA, 7–9 May 2015; pp. 1–14. [Google Scholar]
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going Deeper with Convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Babenko, A.; Slesarev, A.; Chigorin, A.; Lempitsky, V. Neural Codes for Image Retrieval. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; pp. 584–599. [Google Scholar]
Schroff, F.; Kalenichenko, D.; Philbin, J. Facenet: A Unified Embedding for Face Recognition and Clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 815–823. [Google Scholar]
Sohn, K. Improved Deep Metric Learning with Multi-Class N-Pair Loss Objective. In Proceedings of the Advances in Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016; pp. 1857–1865. [Google Scholar]
Zheng, X.; Ji, R.; Sun, X.; Wu, Y.; Huang, F.; Yang, Y. Centralized Ranking Loss with Weakly Supervised Localization for Fine-Grained Object Retrieval. In Proceedings of the 27th International Joint Conferences on Artificial Intelligence, Stockholm, Sweden, 13–19 July 2018; pp. 1226–1233. [Google Scholar]
Wei, X.-S.; Luo, J.-H.; Wu, J.; Zhou, Z.-H. Selective convolutional descriptor aggregation for fine-grained image retrieval. IEEE Trans. Image Process. 2017, 26, 2868–2881. [Google Scholar] [CrossRef] [PubMed]
Zheng, X.; Ji, R.; Sun, X.; Zhang, B.; Wu, Y.; Huang, F. Towards Optimal Fine Grained Retrieval via Decorrelated Centralized Loss with Normalize-Scale Layer. In Proceedings of the 33rd AAAI Conference on Artificial Intelligence (AAAI), Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 9291–9298. [Google Scholar]
Zeng, X.; Zhang, Y.; Wang, X.; Chen, K.; Li, D.; Yang, W. Fine-Grained image retrieval via piecewise cross entropy loss. Image Vis. Comput. 2020, 93, 103820.1–103820.6. [Google Scholar] [CrossRef]
Kim, S.; Kim, D.; Cho, M.; Kwak, S. Proxy Anchor Loss for Deep Metric Learning. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 3235–3244. [Google Scholar]
Cao, G.; Zhu, Y.; Lu, X. Fine-Grained Image Retrieval via Multiple Part-Level Feature Ensemble. In Proceedings of the 2021 IEEE International Conference on Multimedia and Expo (ICME), Shenzhen, China, 5–9 July 2021; pp. 1–6. [Google Scholar]
Wang, R.; Zou, W.; Lin, X.; Wang, J. Learning Discriminative Features for Fine-Grained Image Retrieval. In Proceedings of the IEEE 6th International Conference on Computer and Communications (ICCC), Chengdu, China, 11–14 December 2020; pp. 1915–1919. [Google Scholar]

Figure 1. The overall framework of the network.

Figure 2. The illustration of the patch detector.

Figure 3. Overview of the patch detector.

Figure 4. The result maps of object localization. (a) original images, (b) results of object localization from Conv5_2, (c) results of object localization from Conv5_3, (d) results of object localization with AOLM.

Figure 5. Comparison of precision for different methods on CUB-200-2011 dataset.

Figure 6. Comparison of precision for different methods on Cars-196 dataset.

Figure 7. Impact of different embedding dimensions on Recall@1. (a) CUB-200-2011 dataset, (b) Cars-196 dataset.

Figure 8. Impact of different embedding dimensions on retrieval time. (a) CUB-200-2011 dataset, (b) Cars-196 dataset.

Table 1. Ablation experimental results on CUB-200-2011 dataset.

Model	Recall@1	Recall@2	Recall@4	Recall@8
P	0.7370	0.8293	0.8928	0.9343
P + G	0.7591	0.8385	0.8942	0.9367
AOLM + P + G	0.7762	0.8600	0.9104	0.9441
AOLM + P	0.7608	0.8457	0.9026	0.9392
AOLM + G	0.7227	0.8125	0.8137	0.9178

Table 2. Comparison of Recall@K of different methods on CUB-200-2011 dataset. The best results are written in bold.

Method	Recall@1	Recall@2	Recall@4	Recall@8
SCDA [14]	62.6	74.2	83.2	90.1
CRL-WSL [13]	65.9	76.5	85.3	90.3
DGCRL [15]	67.9	79.1	86.2	91.8
PCE [16]	70.1	79.8	86.9	92.0
Proxy-Anchor [17]	69.9	79.6	86.6	91.4
MPFE [18]	69.3	79.9	87.3	92.1
NTB [19]	72.2	81.1	87.5	92.3
Proposed method	77.6	86.0	91.0	94.4

Table 3. Comparison of Recall@K of different methods on Cars-196 dataset.

Methods	Recall@1	Recall@2	Recall@4	Recall@8
SCDA	58.5	69.8	79.1	86.2
CRL-WSL	63.9	73.7	82.1	89.2
DGCRL	75.9	83.9	89.7	94.0
PCE	86.7	91.7	95.2	97.0
Proxy-Anchor	87.7	92.7	95.5	97.3
MPFE	86.3	91.7	95.0	97.1
NTB	90.4	94.5	96.6	98.0
Proposed method	92.0	95.3	97.1	98.4

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, R.; Zou, W.; Wang, J. Fine-Grained Image Retrieval via Object Localization. Electronics 2023, 12, 2193. https://doi.org/10.3390/electronics12102193

AMA Style

Wang R, Zou W, Wang J. Fine-Grained Image Retrieval via Object Localization. Electronics. 2023; 12(10):2193. https://doi.org/10.3390/electronics12102193

Chicago/Turabian Style

Wang, Rong, Wei Zou, and Jiajun Wang. 2023. "Fine-Grained Image Retrieval via Object Localization" Electronics 12, no. 10: 2193. https://doi.org/10.3390/electronics12102193

APA Style

Wang, R., Zou, W., & Wang, J. (2023). Fine-Grained Image Retrieval via Object Localization. Electronics, 12(10), 2193. https://doi.org/10.3390/electronics12102193

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Fine-Grained Image Retrieval via Object Localization

Abstract

1. Introduction

2. Method

2.1. Design of the Object Localization Module

2.2. Design of the Networks for Extracting the Global and Local Features

3. Results and Discussion

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI