ReGeNet: Relevance-Guided Generative Network to Evaluate the Adversarial Robustness of Cross-Modal Retrieval Systems
Abstract
1. Introduction
- We reformulate adversarial attacks on cross-modal hashing retrieval from an iterative, instance-specific optimization problem into a relevance-guided generative learning paradigm.
- We evaluate the proposed framework on multiple cross-modal hashing retrieval systems using two widely adopted benchmark datasets: NUS-WIDE [29] and FLICKR-25K [28]. The experimental results demonstrate that our approach achieves competitive or superior attack performance while being significantly more efficient.
2. Related Work
2.1. Deep Cross-Modal Hamming Retrieval
2.2. Adversarial Attack on Deep Hash Retrieval
3. Background
3.1. Cross-Modal Hash Retrieval System
- Embedding the vector of collected and semantic similar cross-modal data pairs.
- Constructing the optimization loss function to enhance the semantic and Hamming feature similarity of data pairs in hash space.
- Hashing all the cross-modal data and storing them in the database for retrieval.
3.2. Problem Formulation
4. Framework
4.1. Overview
4.2. Cross-Modal Adversarial Generative Network
4.3. Adversarial Supervised Adversarial Label Network
4.4. Relevance-Guided-Based Graph Convolution Network
4.5. Learning to Adversarial Hash
| Algorithm 1: The training process of relevance-guided generative network. |
![]() |
5. Experiment
5.1. Dataset
5.2. Evaluation Metrics
5.3. Implementation
5.4. Compared Methods
5.5. Results
5.6. Running Time and Imperceptibility
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Aradhya, H.R. Object detection and tracking using deep learning and artificial intelligence for video surveillance applications. Int. J. Adv. Comput. Sci. Appl. 2019, 10, 517–530. [Google Scholar] [CrossRef]
- Cao, Q.; Shen, L.; Xie, W.; Parkhi, O.M.; Zisserman, A. Vggface2: A dataset for recognising faces across pose and age. In Proceedings of the 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), Xi’an, China, 15–19 May 2018; IEEE: New York, NY, USA, 2018; pp. 67–74. [Google Scholar]
- Deng, J.; Guo, J.; Xue, N.; Zafeiriou, S. Arcface: Additive angular margin loss for deep face recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 4690–4699. [Google Scholar]
- Narasimhan, M.; Rohrbach, A.; Darrell, T. CLIP-It! Language-Guided Video Summarization. In Proceedings of the Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, Virtual, 6–14 December 2021; Ranzato, M., Beygelzimer, A., Dauphin, Y.N., Liang, P., Vaughan, J.W., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2021; pp. 13988–14000. [Google Scholar]
- Cao, Y.; Long, M.; Wang, J.; Yang, Q.; Yu, P.S. Deep visual-semantic hashing for cross-modal retrieval. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17August 2016; pp. 1445–1454. [Google Scholar]
- Jiang, Q.Y.; Li, W.J. Deep cross-modal hashing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 3232–3240. [Google Scholar]
- Cao, Y.; Liu, B.; Long, M.; Wang, J. Cross-modal hamming hashing. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 202–218. [Google Scholar]
- Szegedy, C.; Zaremba, W.; Sutskever, I.; Bruna, J.; Erhan, D.; Goodfellow, I.; Fergus, R. Intriguing properties of neural networks. arXiv 2013, arXiv:1312.6199. [Google Scholar]
- Shi, Y.; Wang, S.; Han, Y. Curls & whey: Boosting black-box adversarial attacks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 6519–6527. [Google Scholar]
- Guo, C.; Gardner, J.; You, Y.; Wilson, A.G.; Weinberger, K. Simple black-box adversarial attacks. In Proceedings of the International Conference on Machine Learning—PMLR, Long Beach, CA, USA, 9–15 June 2019; pp. 2484–2493. [Google Scholar]
- Komkov, S.; Petiushko, A. Advhat: Real-world adversarial attack on arcface face id system. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; IEEE: New York, NY, USA, 2021; pp. 819–826. [Google Scholar]
- Moosavi-Dezfooli, S.M.; Fawzi, A.; Frossard, P. Deepfool: A simple and accurate method to fool deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2574–2582. [Google Scholar]
- Dusmanu, M.; Schonberger, J.L.; Sinha, S.N.; Pollefeys, M. Privacy-preserving image features via adversarial affine subspace embeddings. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 14267–14277. [Google Scholar]
- Che, Z.; Borji, A.; Zhai, G.; Ling, S.; Guo, G.; Callet, P.L. Adversarial attacks against deep saliency models. arXiv 2019, arXiv:1904.01231. [Google Scholar] [CrossRef]
- Ilyas, A.; Engstrom, L.; Athalye, A.; Lin, J. Black-box adversarial attacks with limited queries and information. In Proceedings of the International Conference on Machine Learning—PMLR, Stockholm, Sweden, 10–15 July 2018; pp. 2137–2146. [Google Scholar]
- Bhagoji, A.N.; He, W.; Li, B.; Song, D. Practical black-box attacks on deep neural networks using efficient query mechanisms. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 154–169. [Google Scholar]
- Yang, E.; Liu, T.; Deng, C.; Tao, D. Adversarial examples for hamming space search. IEEE Trans. Cybern. 2018, 50, 1473–1484. [Google Scholar] [CrossRef] [PubMed]
- Tolias, G.; Radenovic, F.; Chum, O. Targeted mismatch adversarial attack: Query with a flower to retrieve the tower. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)—ICCV’19, Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
- Bai, J.; Chen, B.; Li, Y.; Wu, D.; Guo, W.; Xia, S.; Yang, E. Targeted Attack for Deep Hashing based Retrieval. In Computer Vision—ECCV 2020, Proceedings of the 16th European Conference, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020. [Google Scholar]
- Wang, X.; Zhang, Z.; Wu, B.; Shen, F.; Lu, G. Prototype-supervised adversarial network for targeted attack of deep hashing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 16357–16366. [Google Scholar]
- Li, C.; Gao, S.; Deng, C.; Xie, D.; Liu, W. Cross-modal learning with adversarial samples. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; Volume 32. [Google Scholar]
- Li, C.; Tang, H.; Deng, C.; Zhan, L.; Liu, W. Vulnerability vs. reliability: Disentangled adversarial examples for cross-modal learning. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Virtual Event, 6–10 July 2020; pp. 421–429. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Tang, Y.; Pino, J.; Li, X.; Wang, C.; Genzel, D. Improving speech translation by understanding and learning from the auxiliary text translation task. arXiv 2021, arXiv:2107.05782. [Google Scholar] [CrossRef]
- Li, S.; Neupane, A.; Paul, S.; Song, C.; Krishnamurthy, S.; Roy-Chowdhury, A.K.; Swami, A. Stealthy Adversarial Perturbations Against Real-Time Video Classification Systems. In Proceedings of the NDSS’19—Network and Distributed Systems Security (NDSS) Symposium, San Diego, CA, USA, 24–27 February 2019. [Google Scholar]
- Hu, W.; Tan, Y. Generating adversarial malware examples for black-box attacks based on GAN. arXiv 2017, arXiv:1702.05983. [Google Scholar] [CrossRef]
- Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
- Huiskes, M.J.; Lew, M.S. The mir flickr retrieval evaluation. In Proceedings of the 1st ACM International Conference on Multimedia Information Retrieval, Vancouver, BC, Canada, 30–31 October 2008; pp. 39–43. [Google Scholar]
- Chua, T.S.; Tang, J.; Hong, R.; Li, H.; Luo, Z.; Zheng, Y. Nus-wide: A real-world web image database from national university of singapore. In Proceedings of the ACM International Conference on Image and Video Retrieval, Fira, Greece, 8–10 July 2009; pp. 1–9. [Google Scholar]
- Liu, X.; Mu, Y.; Zhang, D.; Lang, B.; Li, X. Large-scale unsupervised hashing with shared structure learning. IEEE Trans. Cybern. 2014, 45, 1811–1822. [Google Scholar] [CrossRef] [PubMed]
- Yan, C.; Xie, H.; Yang, D.; Yin, J.; Zhang, Y.; Dai, Q. Supervised hash coding with deep neural network for environment perception of intelligent vehicles. IEEE Trans. Intell. Transp. Syst. 2017, 19, 284–295. [Google Scholar] [CrossRef]
- Su, S.; Zhong, Z.; Zhang, C. Deep joint-semantics reconstructing hashing for large-scale unsupervised cross-modal retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3027–3035. [Google Scholar]
- Chun, S.; Oh, S.J.; De Rezende, R.S.; Kalantidis, Y.; Larlus, D. Probabilistic embeddings for cross-modal retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 8415–8424. [Google Scholar]
- Bai, J.; Chen, B.; Wu, D.; Zhang, C.; Xia, S. Universal Adversarial Head: Practical Protection against Video Data Leakage. In Proceedings of the ICML’21, Online, 18–24 July 2021. [Google Scholar]
- Li, C.; Gao, S.; Deng, C.; Liu, W.; Huang, H. Adversarial Attack on Deep Cross-Modal Hamming Retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 2218–2227. [Google Scholar]
- Weiss, Y.; Torralba, A.; Fergus, R. Spectral hashing. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–10 December 2008; Volume 21. [Google Scholar]
- Pushpalatha., K.R.; Chaitra., M.; Karegowda, A.G. Color Histogram based Image Retrieval—A Survey. Int. J. Adv. Res. Comput. Sci. 2013, 4, 119. [Google Scholar]
- Zhang, Y.; Jin, R.; Zhou, Z.H. Understanding bag-of-words model: A statistical framework. Int. J. Mach. Learn. Cybern. 2010, 1, 43–52. [Google Scholar] [CrossRef]
- Xu, R.; Li, C.; Yan, J.; Deng, C.; Liu, X. Graph Convolutional Network Hashing for Cross-Modal Retrieval. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19), Macao, China, 10–16 August 2019; Volume 2019, pp. 982–988. [Google Scholar]
- Wang, J.; Zhang, T.; Sebe, N.; Shen, H.T. A survey on learning to hash. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 769–790. [Google Scholar] [CrossRef] [PubMed]
- Qin, Q.; Huang, L.; Wei, Z.; Xie, K.; Zhang, W. Unsupervised deep multi-similarity hashing with semantic structure for image retrieval. IEEE Trans. Circuits Syst. Video Technol. 2020, 31, 2852–2865. [Google Scholar] [CrossRef]
- Wang, X.; Shi, Y.; Kitani, K.M. Deep supervised hashing with triplet labels. In Computer Vision—ACCV 2016, Proceedings of the 13th Asian Conference on Computer Vision, Taipei, Taiwan, 20–24 November 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 70–84. [Google Scholar]
- Liu, H.; Wang, R.; Shan, S.; Chen, X. Deep supervised hashing for fast image retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2064–2072. [Google Scholar]
- Han, J.; Dong, X.; Zhang, R.; Chen, D.; Zhang, W.; Yu, N.; Luo, P.; Wang, X. Once a man: Towards multi-target attack via learning multi-target adversarial network once. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 5158–5167. [Google Scholar]
- Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015, Proceedings of the 18th International Conference, Munich, Germany, 5–9 October 2015; Springer: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar]
- Zhu, J.Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2223–2232. [Google Scholar]
- Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]



| Symbol | Definition |
|---|---|
| * | The cross-modal retrieval function |
| * | The cross-modal data points |
| * | The embedding vector of data points |
| * | The hash codes of cross-modal data points |
| * | The parallel generative network |
| * | The parallel discriminator network |
| The relevance-guided graph convolution network | |
| The label network | |
| K | The length of hash codes |
| The adversarial loss function | |
| The reconstruction loss function | |
| The distinguish loss function |
| Datasets | Train | Query | Retrieval |
|---|---|---|---|
| FLICKER-25K | 5000 | 1000 | 14,015 |
| NUSWIDE | 5000 | 2100 | 183,321 |
| Task | Method | Iter. | FLICKER-25k | Method | Iter. | NUS-WIDE | ||
|---|---|---|---|---|---|---|---|---|
| CMHH | DCMH | CMHH | DCMH | |||||
| I→T | Original | 71.73% | 64.66% | Original | 57.94% | 49.32% | ||
| CMLA | 200 | 56.41% | 51.78% | CMLA | 200 | 33.47% | 32.15% | |
| 500 | 55.52% | 51.24% | 500 | 32.02% | 31.61% | |||
| DACM | 200 | 50.57% | 45.69% | DACM | 200 | 26.61% | 24.89% | |
| 500 | 49.76% | 44.55% | 500 | 25.31% | 23.90% | |||
| Ours | 1 | 50.59% | 46.18% | Ours | 1 | 27.46% | 24.31% | |
| T→I | Original | 74.08% | 61.94% | Original | 52.45% | 45.76% | ||
| CMLA | 200 | 52.73% | 54.02% | CMLA | 200 | 35.68% | 30.18% | |
| 500 | 51.45% | 53.69% | 500 | 34.73% | 29.55% | |||
| DACM | 200 | 48.28% | 51.57% | DACM | 200 | 27.45% | 26.31% | |
| 500 | 47.67% | 50.42% | 500 | 26.39% | 25.73% | |||
| Ours | 1 | 51.51% | 54.32% | Ours | 1 | 24.36% | 28.47% | |
| Discussion | Settings | I→T | T→I | ||
|---|---|---|---|---|---|
| MAP | Per | MAP | Per | ||
| Original | 50.59% | 0.025 | 51.51% | 0.011 | |
| Generative | 54.75% | 0.031 | 53.92% | 0.016 | |
| Generator | 61.61% | 0.023 | 55.60% | 0.012 | |
| Tasks | Method | Iteration | FLICKER-25k | Method | Iteration | NUS-WIDE | ||
|---|---|---|---|---|---|---|---|---|
| Time | Per | Time | Per | |||||
| I→T | CMLA | 100 | 0.31 | 0.056 | CMLA | 100 | 0.42 | 0.045 |
| 500 | 1.27 | 0.021 | 500 | 1.87 | 0.031 | |||
| DACM | 100 | 0.35 | 0.045 | DACM | 100 | 0.45 | 0.039 | |
| 500 | 1.31 | 0.039 | 500 | 1.96 | 0.025 | |||
| Ours | 1 | 0.002 | 0.069 | Ours | 1 | 0.003 | 0.064 | |
| T→I | CMLA | 100 | 0.18 | 0.041 | CMLA | 100 | 0.14 | 0.029 |
| 500 | 0.81 | 0.033 | 500 | 0.72 | 0.011 | |||
| DACM | 100 | 0.16 | 0.058 | DACM | 100 | 0.11 | 0.035 | |
| 500 | 0.95 | 0.025 | 500 | 0.77 | 0.014 | |||
| Ours | 1 | 0.003 | 0.036 | Ours | 1 | 0.003 | 0.054 | |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Hu, C.; Yang, Y.; Chen, Y.; Chen, L.; Liu, C.; Li, Y.; Shi, R.; Huang, J. ReGeNet: Relevance-Guided Generative Network to Evaluate the Adversarial Robustness of Cross-Modal Retrieval Systems. Mathematics 2026, 14, 151. https://doi.org/10.3390/math14010151
Hu C, Yang Y, Chen Y, Chen L, Liu C, Li Y, Shi R, Huang J. ReGeNet: Relevance-Guided Generative Network to Evaluate the Adversarial Robustness of Cross-Modal Retrieval Systems. Mathematics. 2026; 14(1):151. https://doi.org/10.3390/math14010151
Chicago/Turabian StyleHu, Chao, Yulin Yang, Yan Chen, Li Chen, Chengguang Liu, Yuxin Li, Ronghua Shi, and Jincai Huang. 2026. "ReGeNet: Relevance-Guided Generative Network to Evaluate the Adversarial Robustness of Cross-Modal Retrieval Systems" Mathematics 14, no. 1: 151. https://doi.org/10.3390/math14010151
APA StyleHu, C., Yang, Y., Chen, Y., Chen, L., Liu, C., Li, Y., Shi, R., & Huang, J. (2026). ReGeNet: Relevance-Guided Generative Network to Evaluate the Adversarial Robustness of Cross-Modal Retrieval Systems. Mathematics, 14(1), 151. https://doi.org/10.3390/math14010151


