Feature Fusion and Metric Learning Network for Zero-Shot Sketch-Based Image Retrieval
Abstract
1. Introduction
2. Related Work
2.1. Deep Metric Learning
2.2. Zero-Shot Sketch-Based Image Retrieval
2.3. Feature Fusion
3. Methodology
3.1. Problem Description
3.2. Model Structure
3.2.1. Attention Map Feature Fusion
3.2.2. Attention
3.2.3. Domain Aware Triplet
3.3. Training Approaches
3.3.1. Embedding Learning
3.3.2. Pairwise Training
3.3.3. Objective and Optimization
| Algorithm 1 Overall training procedure | 
| Input: training set ; batch size N; hyperparameter of regularizer ; Parameter: Model parameters ; classification layer 
 | 
4. Experiments
4.1. Datasets
4.2. Evaluation Metrics
4.3. Implementation Details
4.4. Comparing with the State-of-the-Arts
4.5. Qualitative Results
4.6. Ablation Studies
5. Limitation
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Ribeiro, L.S.F.; Bui, T.; Collomosse, J.; Ponti, M. Scene designer: A unified model for scene search and synthesis from sketch. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual, 19–25 June 2021; pp. 2424–2433. [Google Scholar]
- Kapoor, R.; Sharma, D.; Gulati, T. State of the art content based image retrieval techniques using deep learning: A survey. Multimed. Tools Appl. 2021, 80, 29561–29583. [Google Scholar] [CrossRef]
- Yelamarthi, S.K.; Reddy, S.K.; Mishra, A.; Mittal, A. A zero-shot framework for sketch based image retrieval. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 300–317. [Google Scholar]
- Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar]
- Leal-Taixé, L.; Canton-Ferrer, C.; Schindler, K. Learning by tracking: Siamese CNN for robust target association. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 33–40. [Google Scholar]
- Dey, S.; Riba, P.; Dutta, A.; Llados, J.; Song, Y.Z. Doodle to search: Practical zero-shot sketch-based image retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 2179–2188. [Google Scholar]
- Liu, Q.; Xie, L.; Wang, H.; Yuille, A.L. Semantic-aware knowledge preservation for zero-shot sketch-based image retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3662–3671. [Google Scholar]
- Zhang, Z.; Zhang, Y.; Feng, R.; Zhang, T.; Fan, W. Zero-shot sketch-based image retrieval via graph convolution network. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 12943–12950. [Google Scholar]
- Zhu, J.; Xu, X.; Shen, F.; Lee, R.K.W.; Wang, Z.; Shen, H.T. Ocean: A dual learning approach for generalized zero-shot sketch-based image retrieval. In Proceedings of the 2020 IEEE International Conference on Multimedia and Expo (ICME), London, UK, 6–10 July 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 1–6. [Google Scholar]
- Chaudhuri, U.; Banerjee, B.; Bhattacharya, A.; Datcu, M. CrossATNet-a novel cross-attention based framework for sketch-based image retrieval. Image Vis. Comput. 2020, 104, 104003. [Google Scholar] [CrossRef]
- Deng, C.; Xu, X.; Wang, H.; Yang, M.; Tao, D. Progressive cross-modal semantic network for zero-shot sketch-based image retrieval. IEEE Trans. Image Process. 2020, 29, 8892–8902. [Google Scholar] [CrossRef] [PubMed]
- Le, Q.; Mikolov, T. Distributed representations of sentences and documents. In Proceedings of the International Conference on Machine Learning, Beijing, China, 22–24 June 2014; PMLR: New York City, NY, USA, 2014; pp. 1188–1196. [Google Scholar]
- Liu, L.; Shen, F.; Shen, Y.; Liu, X.; Shao, L. Deep sketch hashing: Fast free-hand sketch-based image retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2862–2871. [Google Scholar]
- Shen, Y.; Liu, L.; Shen, F.; Shao, L. Zero-shot sketch-image hashing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3598–3607. [Google Scholar]
- Dutta, A.; Akata, Z. Semantically tied paired cycle consistency for zero-shot sketch-based image retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–17 June 2019; pp. 5089–5098. [Google Scholar]
- Wang, W.; Shi, Y.; Chen, S.; Peng, Q.; Zheng, F.; You, X. Norm-guided Adaptive Visual Embedding for Zero-Shot Sketch-Based Image Retrieval. In Proceedings of the IJCAI, Montreal, QC, Canada, 19–27 August 2021; pp. 1106–1112. [Google Scholar]
- Tursun, O.; Denman, S.; Sridharan, S.; Goan, E.; Fookes, C. An efficient framework for zero-shot sketch-based image retrieval. Pattern Recognit. 2022, 126, 108528. [Google Scholar] [CrossRef]
- Ren, H.; Zheng, Z.; Lu, H. Energy-Guided Feature Fusion for Zero-Shot Sketch-Based Image Retrieval. Neural Process. Lett. 2022, 54, 5711–5720. [Google Scholar] [CrossRef]
- Zhang, Z.; Zhang, X.; Peng, C.; Xue, X.; Sun, J. Exfuse: Enhancing feature fusion for semantic segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 269–284. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2017; Volume 30. [Google Scholar]
- Yang, L.; Zhang, R.Y.; Li, L.; Xie, X. Simam: A simple, parameter-free attention module for convolutional neural networks. In Proceedings of the International Conference on Machine Learning, Virtual Event, 18–24 July 2021; PMLR: New York City, NY, USA, 2021; pp. 11863–11874. [Google Scholar]
- Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
- Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
- Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. Supplementary material for ‘ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 13–19. [Google Scholar]
- Deng, J.; Guo, J.; Xue, N.; Zafeiriou, S. Arcface: Additive angular margin loss for deep face recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 4690–4699. [Google Scholar]
- Luo, H.; Gu, Y.; Liao, X.; Lai, S.; Jiang, W. Bag of tricks and a strong baseline for deep person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
- Sun, Y.; Cheng, C.; Zhang, Y.; Zhang, C.; Zheng, L.; Wang, Z.; Wei, Y. Circle loss: A unified perspective of pair similarity optimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 6398–6407. [Google Scholar]
- Hermans, A.; Beyer, L.; Leibe, B. In defense of the triplet loss for person re-identification. arXiv 2017, arXiv:1703.07737. [Google Scholar]
- Khosla, P.; Teterwak, P.; Wang, C.; Sarna, A.; Tian, Y.; Isola, P.; Maschinot, A.; Liu, C.; Krishnan, D. Supervised contrastive learning. Adv. Neural Inf. Process. Syst. 2020, 33, 18661–18673. [Google Scholar]
- Jaiswal, A.; Babu, A.R.; Zadeh, M.Z.; Banerjee, D.; Makedon, F. A survey on contrastive self-supervised learning. Technologies 2020, 9, 2. [Google Scholar] [CrossRef]
- Hadsell, R.; Chopra, S.; LeCun, Y. Dimensionality reduction by learning an invariant mapping. In Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), New York, NY, USA, 17–22 June 2006; IEEE: Piscataway, NJ, USA, 2006; Volume 2, pp. 1735–1742. [Google Scholar]
- Schroff, F.; Kalenichenko, D.; Philbin, J. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 815–823. [Google Scholar]
- Oh Song, H.; Xiang, Y.; Jegelka, S.; Savarese, S. Deep metric learning via lifted structured feature embedding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2016; pp. 4004–4012. [Google Scholar]
- Parkhi, O.M.; Vedaldi, A.; Zisserman, A. Deep Face Recognition; University of Oxford: Oxford, UK, 2015. [Google Scholar]
- Li, J.; Ling, Z.; Niu, L.; Zhang, L. Zero-shot sketch-based image retrieval with structure-aware asymmetric disentanglement. Comput. Vis. Image Underst. 2022, 218, 103412. [Google Scholar] [CrossRef]
- Liu, R.; Yu, Q.; Yu, S.X. Unsupervised sketch to photo synthesis. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 36–52. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
- Zhai, A.; Wu, H.Y. Classification is a strong baseline for deep metric learning. arXiv 2018, arXiv:1811.12649. [Google Scholar]
- Wang, Z.; Wang, H.; Yan, J.; Wu, A.; Deng, C. Domain-smoothing network for zero-shot sketch-based image retrieval. arXiv 2021, arXiv:2106.11841. [Google Scholar]
- Huang, Z.; Sun, Y.; Han, C.; Gao, C.; Sang, N. Modality-Aware Triplet Hard Mining for Zero-shot Sketch-Based Image Retrieval. arXiv 2021, arXiv:2112.07966. [Google Scholar]
- Sangkloy, P.; Burnell, N.; Ham, C.; Hays, J. The sketchy database: Learning to retrieve badly drawn bunnies. ACM Trans. Graph. (TOG) 2016, 35, 1–12. [Google Scholar] [CrossRef]
- Eitz, M.; Hays, J.; Alexa, M. How do humans sketch objects? ACM Trans. Graph. (TOG) 2012, 31, 1–10. [Google Scholar] [CrossRef]
- Felix, R.; Reid, I.; Carneiro, G. Multi-modal cycle-consistent generalized zero-shot learning. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 21–37. [Google Scholar]
- Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; IEEE: Piscataway, NJ, USA, 2009; pp. 248–255. [Google Scholar]
- Kodirov, E.; Xiang, T.; Gong, S. Semantic autoencoder for zero-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 3174–3183. [Google Scholar]
- Chaudhuri, U.; Chavan, R.; Banerjee, B.; Dutta, A.; Akata, Z. BDA-SketRet: Bi-level domain adaptation for zero-shot SBIR. Neurocomputing 2022, 514, 245–255. [Google Scholar] [CrossRef]










| Task | Methods | Dimention | Sketchy_c100 | Sketchy | Tu-Berlin | |||
|---|---|---|---|---|---|---|---|---|
| mAP@all | Prce@100 | mAP@200 | Prec@200 | mAP@all | Prce@100 | |||
| SBIR | GN Triplet (2016) [42] | 1024 | 20.4 | 29.6 | - | - | 17.5 | 25.3 | 
| (2017) [14] | 64 | 17.1 | 23.1 | - | - | 12.9 | 18.9 | |
| ZSL | SAE (2017) [46] | 300 | 21.6 | 29.3 | - | - | 16.7 | 22.1 | 
| FRWGAN (2018) [44] | 512 | 12.7 | 16.9 | - | - | 11.0 | 15.7 | |
| ZS-SBIR | Doodle2Search (2019) [7] | 4096 | - | - | 46.1 | 37.0 | 10.9 | - | 
| Sake (2019) [8] | 512 | - | - | 49.7 | 59.8 | 47.5 | 59.9 | |
| SketchyGCN (2020) [9] | 1024 | 38.2 | 53.8 | - | - | 32.4 | 50.5 | |
| OCEAN (2020) [10] | 512 | 46.2 | 59.0 | - | 33.3 | 46.7 | ||
| PCMSN (2020) [12] | 64 | 52.3 | 61.6 | - | - | 42.4 | 51.7 | |
| SBTKNet (2021) [18] | 512 | 55.2 | 69.7 | 50.2 | 59.6 | 48.0 | 60.8 | |
| DSN (2021) [40] | 512 | 58.1 | 70.0 | - | - | 49.3 | 60.7 | |
| NAVE (2021) [17] | 512 | 61.3 | 72.5 | - | - | 48.4 | 59.1 | |
| MATHM (2021) [41] | 512 | 62.9 | 73.8 | 48.5 | 58.1 | 46.1 | 59.8 | |
| EGFF (2022) [19] | 512 | 62.3 | 75.5 | 51.7 | 61.2 | 46.2 | 60.4 | |
| BDA-SketRet (2022) [47] | 64 | - | - | 43.7 | 51.4 | 37.4 | 50.4 | |
| FFMLN (ours) | 64 | 55.9 | 67.8 | 46.1 | 56.2 | 44.0 | 54.4 | |
| FFMLN (ours) | 512 | 65.6 | 77.0 | 53.6 | 62.4 | 49.3 | 61.9 | |
| Methods | Dimention | Quickdraw | ||
|---|---|---|---|---|
| mAP@all | mAP@200 | P@200 | ||
| CVAE (2018) [3] | 4096 | 0.30 | - | 0.30 | 
| Doodle2Search (2019) [7] | 4096 | 7.52 | 9.01 | 6.75 | 
| SBTKNet (2021) [18] | 512 | 11.9 | - | 16.7 | 
| FFMLN (ours) | 64 | 26.7 | 29.3 | 39.7 | 
| FFMLN (ours) | 512 | 28.8 | 34.5 | 45.1 | 
| Loss Function | Sketchy_c100 | Tu-Berlin | ||
|---|---|---|---|---|
| map@all | prec@100 | map@all | prec@100 | |
| 61.01 | 75.23 | 44.73 | 59.47 | |
| + | 61.96 | 75.55 | 46.57 | 60.39 | 
| + + | 64.22 | 76.22 | 47.91 | 60.85 | 
| + + | 65.52 | 77.11 | 48.92 | 61.53 | 
| FFMLN (ours) | 65.63 | 77.05 | 49.30 | 61.90 | 
| Method | Parameter | Convergence | Training Time | Inference Time | mAP@all | 
|---|---|---|---|---|---|
| EGFF (2022) [19] | 56.19 M | 10 | 119.8 min | 286 | 46.2 | 
| DSN (2021) [40] | 54.36 M | 16 | 346.1 min | - | 49.3 | 
| 290.1 M | 6 | 81 min | 278 | 49.3 | 
| Task | Embedding Method | Sketchy_c100 | Tu-Berlin | 
|---|---|---|---|
| Prce@100 | mAP@all | ||
| ZS-SBIR | VGG-16 | 59.8 | 36.9 | 
| CSE_ResNet-50 | 73.8 | 46.1 | |
| EGFF | 75.5 | 47.2 | |
| Siamese CNN | 73.1 | 46.5 | |
| AMFF (ours) | 77.0 | 49.3 | 
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. | 
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zhao, H.; Liu, M.; Li, M. Feature Fusion and Metric Learning Network for Zero-Shot Sketch-Based Image Retrieval. Entropy 2023, 25, 502. https://doi.org/10.3390/e25030502
Zhao H, Liu M, Li M. Feature Fusion and Metric Learning Network for Zero-Shot Sketch-Based Image Retrieval. Entropy. 2023; 25(3):502. https://doi.org/10.3390/e25030502
Chicago/Turabian StyleZhao, Honggang, Mingyue Liu, and Mingyong Li. 2023. "Feature Fusion and Metric Learning Network for Zero-Shot Sketch-Based Image Retrieval" Entropy 25, no. 3: 502. https://doi.org/10.3390/e25030502
APA StyleZhao, H., Liu, M., & Li, M. (2023). Feature Fusion and Metric Learning Network for Zero-Shot Sketch-Based Image Retrieval. Entropy, 25(3), 502. https://doi.org/10.3390/e25030502
 
        




 
       