Multi-Subject Image Retrieval by Fusing Object and Scene-Level Feature Embeddings
Abstract
:1. Introduction
- We introduce a new problem, multi-subject image retrieval (MSIR), searching for images from a multi-subject database composed of complex and diverse images.
- We propose an image embedding that combines object- and scene-level feature embeddings extracted using BoVW-based models.
- Our method shows better performance in MSIR compared with the existing methods.
2. Related Work
3. Methods
4. Experiments
4.1. Single Subject Image Retrieval
4.2. Multi-Subject Image Retrieval
4.3. Ablation Study
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Zheng, L.; Yang, Y.; Tian, Q. SIFT Meets CNN: A Decade Survey of Instance Retrieval. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 1224–1244. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Radenovic, F.; Iscen, A.; Tolias, G.; Avrithis, Y.; Chum, O. Revisiting Oxford and Paris: Large-Scale Image Retrieval Benchmarking. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; IEEE: Salt Lake City, UT, USA, 2018; pp. 5706–5715. [Google Scholar]
- Gkelios, S.; Sophokleous, A.; Plakias, S.; Boutalis, Y.; Chatzichristofis, S.A. Deep convolutional features for image retrieval. Expert Syst. Appl. 2021, 177, 114940. [Google Scholar] [CrossRef]
- Chen, B.C.; Wu, Z.; Davis, L.S.; Lim, S.N. Efficient Object Embedding for Spliced Image Retrieval. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; IEEE: Nashville, TN, USA, 2021; pp. 14960–14970. [Google Scholar]
- Zhou, P.; Chen, B.C.; Han, X.; Najibi, M.; Shrivastava, A.; Lim, S.N.; Davis, L. Generate, Segment, and Refine: Towards Generic Manipulation Segmentation. Proc. AAAI Conf. Artif. Intell. 2020, 34, 13058–13065. [Google Scholar] [CrossRef]
- Huh, M.; Liu, A.; Owens, A.; Efros, A.A. Fighting Fake News: Image Splice Detection via Learned Self-Consistency. In Computer Vision—ECCV 2018; Springer International Publishing: Cham, Switzerland, 2018; Volume 11215, pp. 106–124. [Google Scholar]
- Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. ImageNet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
- Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. In Proceedings of the Computer Vision—ECCV 2014; Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T., Eds.; Springer International Publishing: Cham, Switzerland, 2014; pp. 740–755. [Google Scholar]
- Radenovic, F.; Tolias, G.; Chum, O. Fine-Tuning CNN Image Retrieval with No Human Annotation. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 41, 1655–1668. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Gordo, A.; Almazán, J.; Revaud, J.; Larlus, D. End-to-End Learning of Deep Visual Representations for Image Retrieval. Int. J. Comput. Vis. 2017, 124, 237–254. [Google Scholar] [CrossRef] [Green Version]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In Advances in Neural Information Processing Systems; Cortes, C., Lawrence, N., Lee, D., Sugiyama, M., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2015; Volume 28. [Google Scholar]
- Dalal, N.; Triggs, B. Histograms of Oriented Gradients for Human Detection. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–25 June 2005; Volume 1, pp. 886–893. [Google Scholar]
- Csurka, G.; Bray, C.; Dance, C.; Fan, L. Visual categorization with bags of keypoints. In Proceedings of the Workshop on Statistical Learning in Computer Vision, ECCV, Prague, Czech Republic, 11–14 May 2004; pp. 1–22. [Google Scholar]
- Sivic, J.; Zisserman, A. Video Google: A text retrieval approach to object matching in videos. In Proceedings of the Ninth IEEE International Conference on Computer Vision, Nice, France, 13–16 October 2003; Volume 2, pp. 1470–1477. [Google Scholar]
- Fei-Fei, L.; Perona, P. A Bayesian hierarchical model for learning natural scene categories. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–25 June 2005; Volume 2, pp. 524–531. [Google Scholar]
- Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In Proceedings of the Computer Vision—ECCV 2020; Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M., Eds.; Springer International Publishing: Cham, Switzerland, 2020; pp. 213–229. [Google Scholar]
- Iakovidou, C.; Anagnostopoulos, N.; Kapoutsis, A.; Boutalis, Y.; Lux, M.; Chatzichristofis, S. Localizing global descriptors for content-based image retrieval. Eurasip J. Adv. Signal Process. 2015, 2015, 80. [Google Scholar] [CrossRef] [Green Version]
- Lowe, D.G. Distinctive Image Features from Scale-Invariant Keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef] [Green Version]
- Sharif Razavian, A.; Azizpour, H.; Sullivan, J.; Carlsson, S. CNN features off-the-shelf: An astounding baseline for recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Columbus, OH, USA, 23–28 June 2014; pp. 806–813. [Google Scholar]
- Babenko, A.; Slesarev, A.; Chigorin, A.; Lempitsky, V. Neural codes for image retrieval. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; pp. 584–599. [Google Scholar]
- Yue-Hei Ng, J.; Yang, F.; Davis, L.S. Exploiting local features from deep networks for image retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Boston, MA, USA, 7–12 June 2015; pp. 53–61. [Google Scholar]
- Tolias, G.; Sicre, R.; Jégou, H. Particular object retrieval with integral max-pooling of CNN activations. arXiv 2015, arXiv:1511.05879. [Google Scholar]
- Kalantidis, Y.; Mellina, C.; Osindero, S. Cross-dimensional weighting for aggregated deep convolutional features. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 685–701. [Google Scholar]
- Gordo, A.; Almazán, J.; Revaud, J.; Larlus, D. Deep image retrieval: Learning global representations for image search. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 241–257. [Google Scholar]
- Chum, O.; Philbin, J.; Sivic, J.; Isard, M.; Zisserman, A. Total recall: Automatic query expansion with a generative feature model for object retrieval. In Proceedings of the 2007 IEEE 11th International Conference on Computer Vision, Rio de Janeiro, Brazil, 14–21 October 2007; pp. 1–8. [Google Scholar]
- Perd’och, M.; Chum, O.; Matas, J. Efficient representation of local geometry for large scale object retrieval. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 9–16. [Google Scholar]
- Tolias, G.; Jégou, H. Visual query expansion with or without geometry: Refining local descriptors by feature aggregation. Pattern Recognit. 2014, 47, 3466–3476. [Google Scholar] [CrossRef] [Green Version]
- Babenko, A.; Lempitsky, V. Aggregating deep convolutional features for image retrieval. arXiv 2015, arXiv:1510.07493. [Google Scholar]
- Jegou, H.; Douze, M.; Schmid, C. Hamming embedding and weak geometric consistency for large scale image search. In Proceedings of the European Conference on Computer Vision, Marseille, France, 12–18 October 2008; pp. 304–317. [Google Scholar]
- Nister, D.; Stewenius, H. Scalable recognition with a vocabulary tree. In Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), New York, NY, USA, 17–22 June 2006; Volume 2, pp. 2161–2168. [Google Scholar]
- Arandjelovic, R.; Gronát, P.; Torii, A.; Pajdla, T.; Sivic, J. NetVLAD: CNN Architecture for Weakly Supervised Place Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NA, USA, 27–30 June 2016; pp. 5297–5307. [Google Scholar]
- Jun, H.; Ko, B.; Kim, Y.; Kim, I.; Kim, J. Combination of Multiple Global Descriptors for Image Retrieval. arXiv 2019, arXiv:1903.10663. [Google Scholar]
- Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 2015, 115, 211–252. [Google Scholar] [CrossRef]
- Lee, S.; Seong, H.; Lee, S.; Kim, E. Correlation Verification for Image Retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–20 June 2022; pp. 5364–5374. [Google Scholar]
- Ouyang, J.; Wu, H.; Wang, M.; Zhou, W.; Li, H. Contextual Similarity Aggregation with Self-Attention for Visual Re-Ranking. Adv. Neural Inf. Process. Syst. 2021, 34, 3135–3148. [Google Scholar]
- Nguyen, X.B.; Bui, D.T.; Duong, C.N.; Bui, T.D.; Luu, K. Clusformer: A Transformer based Clustering Approach to Unsupervised Large-scale Face and Visual Landmark Recognition. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; IEEE: Nashville, TN, USA, 2021; pp. 10842–10851. [Google Scholar]
- Chum, O.; Mikulík, A.; Perdoch, M.; Matas, J. Total recall II: Query expansion revisited. In Proceedings of the CVPR 2011, Colorado Springs, CO, USA, 20–25 June 2011; pp. 889–896. [Google Scholar]
- Cao, B.; Araujo, A.; Sim, J. Unifying Deep Local and Global Features for Image Search. In Computer Vision—ECCV 2020; Lecture Notes in Computer Science; Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M., Eds.; Springer International Publishing: Cham, Switzerland, 2020; Volume 12365, pp. 726–743. [Google Scholar]
- Shen, X.; Lin, Z.; Brandt, J.; Avidan, S.; Wu, Y. Object retrieval and localization with spatially-constrained similarity measure and k-NN re-ranking. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 3013–3020. [Google Scholar]
- Bai, S.; Bai, X.; Tian, Q.; Latecki, L.J. Regularized Diffusion Process for Visual Retrieval. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; pp. 3967–3973. [Google Scholar]
- Zhou, P.; Han, X.; Morariu, V.I.; Davis, L.S. Learning Rich Features for Image Manipulation Detection. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 1053–1061. [Google Scholar]
- Bosch, A.; Zisserman, A.; Munoz, X. Scene classification using a hybrid generative/discriminative approach. IEEE Trans. Pattern Anal. Mach. Intell. 2008, 30, 712–727. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Van de Weijer, J.; Gevers, T.; Bagdanov, A.D. Boosting color saliency in image feature detection. IEEE Trans. Pattern Anal. Mach. Intell. 2005, 28, 150–156. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Van De Sande, K.; Gevers, T.; Snoek, C. Evaluating color descriptors for object and scene recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2009, 32, 1582–1596. [Google Scholar] [CrossRef] [PubMed]
- Zhang, S.; Yang, M.; Cour, T.; Yu, K.; Metaxas, D.N. Query specific fusion for image retrieval. In Proceedings of the European Conference on Computer Vision, Florence, Italy, 7–13 October 2012; pp. 660–673. [Google Scholar]
- Deng, C.; Ji, R.; Liu, W.; Tao, D.; Gao, X. Visual reranking through weakly supervised multi-graph learning. In Proceedings of the IEEE International Conference on Computer Vision, Sydney, Australia, 2–8 December 2013; pp. 2600–2607. [Google Scholar]
- Zheng, L.; Wang, S.; Tian, L.; He, F.; Liu, Z.; Tian, Q. Query-adaptive late fusion for image search and person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1741–1750. [Google Scholar]
- Almazán, J.; Ko, B.; Gu, G.; Larlus, D.; Kalantidis, Y. Granularity-Aware Adaptation for Image Retrieval Over Multiple Tasks. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 389–406. [Google Scholar]
- Xia, G.S.; Hu, J.; Hu, F.; Shi, B.; Bai, X.; Zhong, Y.; Zhang, L.; Lu, X. AID: A Benchmark Data Set for Performance Evaluation of Aerial Scene Classification. IEEE Trans. Geosci. Remote Sens. 2017, 55, 3965–3981. [Google Scholar] [CrossRef]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR ’16), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Loshchilov, I.; Hutter, F. Decoupled weight decay regularization. arXiv 2017, arXiv:1711.05101. [Google Scholar]
Method | COCO | AID | Paris | Oxford | Base |
---|---|---|---|---|---|
DETR | 85.35 | 15.86 | 18.12 | 13.05 | BoVW |
Random | 60.79 | 36.1 | 69.73 | 21.41 | BoVW |
SIFT | 13.41 | 22.18 | 30.58 | 15.46 | BoVW |
ResNet | 56.46 | 54.98 | 65.43 | 23.32 | CNN |
R-MAC | 64.57 | 51.08 | 67.11 | 24.88 | CNN |
R-GeM | 64.39 | 58.99 | 72.15 | 25.27 | CNN |
Method | C+A | C+P | C+P+O | C+A+P+O | Base |
---|---|---|---|---|---|
DETR | 33.77 | 45.59 | 38.89 | 28.78 | BoVW |
Random | 27.62 | 33.39 | 28.95 | 26.54 | BoVW |
SIFT | 5.21 | 6.66 | 6.71 | 6.23 | BoVW |
ResNet | 35.14 | 34.76 | 31.73 | 35.92 | CNN |
R-MAC | 33.88 | 36.42 | 33.56 | 36.26 | CNN |
R-GeM | 36.39 | 36.74 | 33.77 | 38.36 | CNN |
Feature | 37.16 | 50.08 | 43.79 | 30.32 | Mix BoVW |
Codebook | 40.24 | 53.16 | 47.22 | 31.31 | Mix BoVW |
Histogram | 43.54 | 55.42 | 48.49 | 39.55 | Mix BoVW |
Class Embedding | Weighted Histogram | Accuracy | Dataset |
---|---|---|---|
53.25 | COCO | ||
√ | 81.34 | COCO | |
√ | √ | 85.35 | COCO |
√ | 18.12 | Paris | |
√ | √ | 14.11 | Paris |
√ | 45.59 | COCO+Paris | |
√ | √ | 37.43 | COCO+Paris |
Dataset | Multi-Subject Codebook | COCO Codebook | ||||
---|---|---|---|---|---|---|
Feature | Codebook | Histogram | Feature | Codebook | Histogram | |
C+A | 36.24 | 35.37 | 40.97 | 37.16 | 40.24 | 43.54 |
C+P | 47.08 | 45.81 | 50.83 | 50.08 | 53.16 | 55.42 |
C+P+O | 43.95 | 43.66 | 44.98 | 43.79 | 47.22 | 48.49 |
C+P+O+A | 29.37 | 29.68 | 38.23 | 30.32 | 31.31 | 39.55 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Ban, C.-G.; Hwang, Y.; Park, D.; Lee, R.; Jang, R.-Y.; Choi, M.-S. Multi-Subject Image Retrieval by Fusing Object and Scene-Level Feature Embeddings. Appl. Sci. 2022, 12, 12705. https://doi.org/10.3390/app122412705
Ban C-G, Hwang Y, Park D, Lee R, Jang R-Y, Choi M-S. Multi-Subject Image Retrieval by Fusing Object and Scene-Level Feature Embeddings. Applied Sciences. 2022; 12(24):12705. https://doi.org/10.3390/app122412705
Chicago/Turabian StyleBan, Chung-Gi, Youngbae Hwang, Dayoung Park, Ryong Lee, Rae-Young Jang, and Myung-Seok Choi. 2022. "Multi-Subject Image Retrieval by Fusing Object and Scene-Level Feature Embeddings" Applied Sciences 12, no. 24: 12705. https://doi.org/10.3390/app122412705
APA StyleBan, C.-G., Hwang, Y., Park, D., Lee, R., Jang, R.-Y., & Choi, M.-S. (2022). Multi-Subject Image Retrieval by Fusing Object and Scene-Level Feature Embeddings. Applied Sciences, 12(24), 12705. https://doi.org/10.3390/app122412705