A Deep Hashing Technique for Remote Sensing Image-Sound Retrieval
Abstract
:1. Introduction
- A new deep cross-modal triplet-based hashing framework is proposed to leverage triplet similarity of deep features and triplet labels to tackle the issue of insufficient utilization of relative semantic similarity relationships for RS image-sound similarity learning. To the best of our knowledge, it is the first work to use hash codes to perform cross-modal RS image-sound retrieval.
- A new triplet selection strategy was developed to select effective triplets, which is helpful for capturing the intra-class and inter-class variations for hash codes learning.
- The objective function was designed to learn deep features’ similarities and reduce the information loss between hash-like codes and hash codes.
- Extensive experimental results of three RS image-sound datasets showed that the proposed DTBH method can achieve superior performance to other state-of-the-art cross-modal image-sound retrieval methods.
2. The Proposed Method
2.1. Problem Definition
2.2. Multimodal Architecture
2.3. Triplet Selection Strategy
2.4. Objective Function
Algorithm 1 Optimization algorithm for learning DTBH. |
Input: M triplet units Output: The parameters W of the DTBH approach; Initialization: Utilize glorot_uniform distribution to initialize W. Repeat: 1: Utilize triplet selection strategy to select triplet units , 2: Use e millisecond window size with f millisecond shift to compute MFCC for the RS sounds, 3: Compute , , , , and by forward propagation; 4: Compute hash codes , and by using Equations (1)–(3); 5: Utilize , , , , , , , and to compute ℑ according to Equation (18); 6: Update W by exploiting Adam; Until: a fixed number of iterations or a stopping criteria is satisfied Return: W |
3. Experiments
3.1. Dataset and Evaluation Protocols
3.2. Implementation Details
3.3. Evaluation of Different Factors
3.4. Results
3.5. Parameter Discussion
4. Conclusions
Author Contributions
Funding
Acknowledgments
Conflicts of Interest
References
- Ma, Y.; Wu, H.; Wang, L.; Huang, B.; Ranjan, R.; Zomaya, A.Y.; Jie, W. Remote Sensing Big Data Computing: Challenges and Opportunities. Future Gener. Comput. Syst. 2015, 51, 47–60. [Google Scholar] [CrossRef]
- Mandal, D.; Chaudhury, K.N.; Biswas, S. Generalized semantic preserving hashing for n-label cross-modal retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4076–4084. [Google Scholar]
- Scott, G.J.; Klaric, M.N.; Davis, C.H.; Shyu, C.R. Entropy-Balanced Bitmap Tree for Shape-Based Object Retrieval From Large-Scale Satellite Imagery Databases. IEEE Trans. Geosci. Remote Sens. 2011, 49, 1603–1616. [Google Scholar] [CrossRef]
- Demir, B.; Bruzzone, L. Hashing-Based Scalable Remote Sensing Image Search and Retrieval in Large Archives. IEEE Trans. Geosci. Remote Sens. 2016, 54, 892–904. [Google Scholar] [CrossRef]
- Peng, L.; Peng, R. Partial Randomness Hashing for Large-Scale Remote Sensing Image Retrieval. IEEE Geosci. Remote Sens. Lett. 2017, 14, 464–468. [Google Scholar]
- Li, Y.; Zhang, Y.; Xin, H.; Hu, Z.; Ma, J. Large-Scale Remote Sensing Image Retrieval by Deep Hashing Neural Networks. IEEE Trans. Geosci. Remote Sens. 2018, 54, 950–965. [Google Scholar] [CrossRef]
- Yang, Y.; Newsam, S. Geographic image retrieval using local invariant features. IEEE Trans. Geosci. Remote Sens. 2012, 51, 818–832. [Google Scholar] [CrossRef]
- Aptoula, E. Remote sensing image retrieval with global morphological texture descriptors. IEEE Trans. Geosci. Remote Sens. 2013, 52, 3023–3034. [Google Scholar] [CrossRef]
- Luo, B.; Aujol, J.F.; Gousseau, Y.; Ladjal, S. Indexing of satellite images with different resolutions by wavelet features. IEEE Trans. Image Process. 2008, 17, 1465–1472. [Google Scholar]
- Rosu, R.; Donias, M.; Bombrun, L.; Said, S.; Regniers, O.; Da Costa, J.P. Structure tensor Riemannian statistical models for CBIR and classification of remote sensing images. IEEE Trans. Geosci. Remote Sens. 2016, 55, 248–260. [Google Scholar] [CrossRef]
- Tobin, K.W.; Bhaduri, B.L.; Bright, E.A.; Cheriyadat, A.; Karnowski, T.P.; Palathingal, P.J.; Potok, T.E.; Price, J.R. Automated Feature Generation in Large-Scale Geospatial Libraries for Content-Based Indexing. Photogramm. Eng. Remote Sens. 2006, 72, 531–540. [Google Scholar] [CrossRef]
- Shyu, C.R.; Klaric, M.; Scott, G.J.; Barb, A.S.; Davis, C.H.; Palaniappan, K. GeoIRIS: Geospatial Information Retrieval and Indexing System-Content Mining, Semantics Modeling, and Complex Queries. IEEE Trans. Geosci. Remote Sens. 2013, 102, 2564–2567. [Google Scholar] [CrossRef] [Green Version]
- Ye, D.; Li, Y.; Tao, C.; Xie, X.; Wang, X. Multiple feature hashing learning for large-scale remote sensing image retrieval. ISPRS Int. J. Geo-Inf. 2017, 6, 364. [Google Scholar] [CrossRef] [Green Version]
- Guo, M.; Yuan, Y.; Lu, X. Deep Cross-Modal Retrieval for Remote Sensing Image and Audio. In Proceedings of the 2018 10th IAPR Workshop on Pattern Recognition in Remote Sensing (PRRS), Beijing, China, 19–20 August 2018; pp. 1–7. [Google Scholar]
- Jiang, Q.Y.; Li, W.J. Deep Cross-Modal Hashing. In Proceedings of the Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 3270–3278. [Google Scholar]
- Li, K.; Qi, G.J.; Ye, J.; Hua, K.A. Linear subspace ranking hashing for cross-modal retrieval. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1825–1838. [Google Scholar] [CrossRef] [PubMed]
- Tang, J.; Wang, K.; Shao, L. Supervised matrix factorization hashing for cross-modal retrieval. IEEE Trans. Image Process. 2016, 25, 3157–3166. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Li, D.; Dimitrova, N.; Li, M.; Sethi, I.K. Multimedia content processing through cross-modal association. In Proceedings of the Eleventh ACM International Conference on Multimedia, Berkeley, CA, USA, 2–8 November 2003; pp. 604–611. [Google Scholar]
- Zhang, H.; Zhuang, Y.; Wu, F. Cross-modal correlation learning for clustering on image-audio dataset. In Proceedings of the 15th ACM international conference on Multimedia, Augsburg, Germany, 24–29 September 2007; pp. 273–276. [Google Scholar]
- Song, Y.; Morency, L.P.; Davis, R. Multimodal human behavior analysis: Learning correlation and interaction across modalities. In Proceedings of the 14th ACM International Conference on Multimodal Interaction, New York, NY, USA, 22–26 October 2012; pp. 27–30. [Google Scholar]
- Torfi, A.; Iranmanesh, S.M.; Nasrabadi, N.; Dawson, J. 3D Convolutional Neural Networks for Cross Audio-Visual Matching Recognition. IEEE Access 2017, 5, 22081–22091. [Google Scholar] [CrossRef]
- Arandjelovi, R.; Zisserman, A. Look, Listen and Learn. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 609–617. [Google Scholar]
- Nagrani, A.; Albanie, S.; Zisserman, A. Seeing Voices and Hearing Faces: Cross-modal biometric matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
- Li, Y.; Zhang, Y.; Huang, X.; Ma, J. Learning source-invariant deep hashing convolutional neural networks for cross-source remote sensing image retrieval. IEEE Trans. Geosci. Remote Sens. 2018, 56, 6521–6536. [Google Scholar] [CrossRef]
- Salem, T.; Zhai, M.; Workman, S.; Jacobs, N. A multimodal approach to mapping soundscapes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 2524–2527. [Google Scholar]
- Gu, J.; Cai, J.; Joty, S.; Niu, L.; Wang, G. Look, Imagine and Match: Improving Textual-Visual Cross-Modal Retrieval with Generative Models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7181–7189. [Google Scholar]
- Lin, K.; Yang, H.F.; Hsiao, J.H.; Chen, C.S. Deep learning of binary hash codes for fast image retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 27–35. [Google Scholar]
- Xu, T.; Jiao, L.; Emery, W.J. SAR Image Content Retrieval Based on Fuzzy Similarity and Relevance Feedback. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2017, 10, 1824–1842. [Google Scholar]
- Zhang, L.; Yin, X.; Wang, Z.; Hao, L.; Lin, M. Preliminary Analysis of the Potential and Limitations of MICAP for the Retrieval of Sea Surface Salinity. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2018, 11, 2979–2990. [Google Scholar] [CrossRef]
- Yang, Z.; Li, Z.; Zhu, J.; Preusse, A.; Hu, J.; Feng, G.; Papst, M. High-Resolution Three-Dimensional Displacement Retrieval of Mining Areas From a Single SAR Amplitude Pair Using the SPIKE Algorithm. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2018, 11, 3782–3793. [Google Scholar] [CrossRef]
- Lu, X.; Chen, Y.; Li, X. Siamese Dilated Inception Hashing With Intra-Group Correlation Enhancement for Image Retrieval. IEEE Trans. Neural Netw. Learn. Syst. 2019. [Google Scholar] [CrossRef]
- Chen, Y.; Lu, X.; Feng, Y. Deep Voice-Visual Cross-Modal Retrieval with Deep Feature Similarity Learning. In Proceedings of the Chinese Conference on Pattern Recognition and Computer Vision, Xi’an, China, 8–11 November 2019; pp. 454–465. [Google Scholar]
- Ruzaij, M.F.; Neubert, S.; Stoll, N.; Thurow, K. Hybrid voice controller for intelligent wheelchair and rehabilitation robot using voice recognition and embedded technologies. J. Adv. Comput. Intell. 2016, 20, 615–622. [Google Scholar] [CrossRef]
- Harwath, D.; Glass, J.R. Learning Word-Like Units from Joint Audio-Visual Analysis. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, Vancouver, BC, Canada, 30 July–4 August 2017; pp. 506–517. [Google Scholar]
- Chapaneri, S.V. Spoken digits recognition using weighted MFCC and improved features for dynamic time warping. Int. J. Comput. Appl. 2012, 40, 6–12. [Google Scholar]
- Chahal, A.; Kaur, R.; Baghla, S.; Kaushal, G. Heart Rate Monitoring using Human Speech Features Extraction: A Review. Heart 2017, 4, 444–449. [Google Scholar]
- Selvakumari, N.S.; Radha, V. Recent Survey on Feature Extraction Methods for Voice Pathology and Voice Disorder. Int. J. Comput. Math. Sci. 2017, 6, 74–79. [Google Scholar]
- Kim, J.; Kumar, N.; Tsiartas, A.; Li, M.; Narayanan, S.S. Automatic intelligibility classification of sentence-level pathological speech. Comput. Speech Lang. 2015, 29, 132–144. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Harwath, D.F.; Torralba, A.; Glass, J.R. Unsupervised Learning of Spoken Language with Visual Context. In Proceedings of the Advances in Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016; pp. 1858–1866. [Google Scholar]
- Zhang, R.; Lin, L.; Zhang, R.; Zuo, W.; Zhang, L. Bit-Scalable Deep Hashing With Regularized Similarity Learning for Image Retrieval and Person Re-Identification. IEEE Trans. Image Process. 2015, 24, 4766–4779. [Google Scholar] [CrossRef]
- Liu, H.; Wang, R.; Shan, S.; Chen, X. Deep Supervised Hashing for Fast Image Retrieval. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 2064–2072. [Google Scholar]
- Gong, Y.; Lazebnik, S. Iterative quantization: A procrustean approach to learning binary codes. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, Colorado Springs, CO, USA, 20–25 June 2011; pp. 817–824. [Google Scholar]
- Zhu, H.; Long, M.; Wang, J.; Cao, Y. Deep hashing network for efficient similarity retrieval. In Proceedings of the AAAI, Phoenix, AZ, USA, 12–17 February 2016; pp. 2415–2421. [Google Scholar]
- Lu, X.; Chen, Y.; Li, X. Hierarchical Recurrent Neural Hashing for Image Retrieval with Hierarchical Convolutional Features. IEEE Trans. Image Process. 2018, 27, 106–120. [Google Scholar] [CrossRef]
- Hyvärinen, A.; Hurri, J.; Hoyer, P.O. Natural Image Statistics: A Probabilistic Approach to Early Computational Vision; Springer Science&Business Media: Berlin/Heidelberg, Germany, 2009; Volume 39. [Google Scholar]
- Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. In Proceedings of the International Conference for Learning Representations, San Diego, CA, USA, 7–9 May 2015; pp. 1–15. [Google Scholar]
- Qu, B.; Li, X.; Tao, D.; Lu, X. Deep semantic understanding of high resolution remote sensing image. In Proceedings of the International Conference on Computer, Information and Telecommunication Systems, Kunming, China, 6–8 August 2016; pp. 1–5. [Google Scholar]
- Lu, X.; Wang, B.; Zheng, X.; Li, X. Exploring Models and Data for Remote Sensing Image Caption Generation. IEEE Trans. Geosci. Remote Sens. 2017, 56, 2183–2195. [Google Scholar] [CrossRef] [Green Version]
- Li, Y.; Chen, Y.; Huang, J. An Approach to Improve Leaf Pigment Content Retrieval by Removing Specular Reflectance Through Polarization Measurements. IEEE Trans. Geosci. Remote Sens. 2018, 57, 2173–2186. [Google Scholar] [CrossRef]
- Kolassa, J.; Gentine, P.; Prigent, C.; Aires, F.; Alemohammad, S.H. Soil moisture retrieval from AMSR-E and ASCAT microwave observation synergy. Part 2: Product evaluation. Remote Sens. Environ. 2017, 195, 202–217. [Google Scholar] [CrossRef]
- Zhang, H.; Liu, X.; Yang, S.; Yu, L.I. Retrieval of remote sensing images based on semisupervised deep learning. J. Remote Sens. 2017, 21, 406–414. [Google Scholar]
- Imbriaco, R.; Sebastian, C.; Bondarev, E.; De With, P.H.N. Aggregated Deep Local Features for Remote Sensing Image Retrieval. Remote Sens. 2019, 11, 493. [Google Scholar] [CrossRef] [Green Version]
- Mouha, N.; Raunak, M.S.; Kuhn, D.R.; Kacker, R. Finding bugs in cryptographic hash function implementations. IEEE Trans. Reliab. 2018, 67, 870–884. [Google Scholar] [CrossRef] [PubMed]
- Guo, M.; Zhou, C.; Liu, J. Jointly Learning of Visual and Auditory: A New Approach for RS Image and Audio Cross-Modal Retrieval. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2019. [Google Scholar] [CrossRef]
- Pan, P.; Xu, Z.; Yang, Y.; Wu, F.; Zhuang, Y. Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning. In Proceedings of the International Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 1029–1038. [Google Scholar]
Symbol | Definition |
---|---|
the triplet units | |
the set of the RS anchor images | |
the set of the RS positive sounds | |
the set of the RS negative sounds | |
the m-th RS anchor image | |
the m-th RS positive sound | |
the m-th RS negative sound | |
the Hamming distance |
Method | 16 Bits | 32 Bits | 48 Bits | 64 Bits | |
---|---|---|---|---|---|
S→ I | DTBH-T | 42.29 | 43.25 | 44.52 | 45.36 |
DTBH-D | 44.40 | 46.65 | 47.71 | 48.88 | |
DTBH-R | 53.53 | 54.61 | 55.58 | 56.91 | |
DTBH-S | 54.43 | 55.28 | 56.37 | 57.76 | |
DTBH | 57.17 | 58.36 | 59.46 | 60.13 | |
I→ S | DTBH-T | 47.28 | 48.09 | 48.98 | 49.86 |
DTBH-D | 50.32 | 51.56 | 52.38 | 53.24 | |
DTBH-R | 57.15 | 59.04 | 60.93 | 61.84 | |
DTBH-S | 58.61 | 60.45 | 61.38 | 62.42 | |
DTBH | 60.85 | 62.35 | 63.45 | 64.27 |
Method | 16 Bits | 32 Bits | 48 Bits | 64 Bits | |
---|---|---|---|---|---|
S→ I | DTBH+AlexNet | 54.69 | 56.41 | 57.16 | 58.51 |
DTBH+GoogleNet | 55.83 | 57.17 | 57.94 | 59.22 | |
DTBH+VGG16 | 57.17 | 58.36 | 59.46 | 60.13 | |
I→ S | DTBH+AlexNet | 57.87 | 59.25 | 60.47 | 62.15 |
DTBH+GoogleNet | 59.44 | 60.83 | 62.27 | 63.36 | |
DTBH+VGG16 | 60.85 | 62.35 | 63.45 | 64.27 |
Task | Method | MAP | Precision@1 | Precision@5 | Precision@10 |
---|---|---|---|---|---|
S→ I | SIFT+M | 6.66 | 3.58 | 4.41 | 4.68 |
DBLP [39] | 19.33 | 17.12 | 17.62 | 16.31 | |
CNN+SPEC [22] | 21.79 | 19.42 | 19.86 | 19.23 | |
DVAN [14] | 32.28 | 32.37 | 33.91 | 34.34 | |
DTBH-D | 48.88 | 58.56 | 54.29 | 50.35 | |
DTBH-R | 56.91 | 65.48 | 61.54 | 57.32 | |
DTBH | 60.13 | 70.26 | 66.63 | 61.73 |
Task | Method | MAP | Precision@1 | Precision@5 | Precision@10 |
---|---|---|---|---|---|
I→ S | SIFT+M | 8.55 | 4.56 | 4.65 | 4.56 |
DBLP [39] | 25.48 | 24.18 | 23.87 | 23.24 | |
CNN+SPEC [22] | 26.25 | 29.5 | 25.52 | 23.65 | |
DVAN [14] | 36.79 | 32.37 | 33.29 | 33.74 | |
DTBH-D | 53.24 | 63.44 | 59.63 | 55.54 | |
DTBH-R | 61.84 | 70.82 | 66.93 | 62.49 | |
DTBH | 64.27 | 73.10 | 69.69 | 65.63 |
Task | Method | MAP | Precision@1 | Precision@5 | Precision@10 |
---|---|---|---|---|---|
S→ I | SIFT+M | 26.50 | 34.48 | 24.48 | 23.28 |
DBLP [39] | 34.87 | 21.63 | 26.78 | 30.94 | |
CNN+SPEC [22] | 35.72 | 17.24 | 27.76 | 31.21 | |
DVAN [14] | 63.88 | 67.24 | 63.34 | 67.07 | |
DTBH-D | 76.53 | 81.36 | 79.28 | 77.82 | |
DTBH-R | 85.46 | 89.86 | 87.07 | 85.38 | |
DTBH | 87.49 | 92.18 | 90.36 | 88.82 |
Task | Method | MAP | Precision@1 | Precision@5 | Precision@10 |
---|---|---|---|---|---|
I→ S | SIFT+M | 31.67 | 11.21 | 35.00 | 37.59 |
DBLP [39] | 44.38 | 56.51 | 52.65 | 49.68 | |
CNN+SPEC [22] | 46.67 | 58.62 | 55.00 | 51.64 | |
DVAN [14] | 71.77 | 75.86 | 73.62 | 72.93 | |
DTBH-D | 81.23 | 88.51 | 86.79 | 84.47 | |
DTBH-R | 89.64 | 95.60 | 93.48 | 92.54 | |
DTBH | 92.45 | 97.41 | 95.63 | 93.78 |
Task | Method | MAP | Precision@1 | Precision@5 | Precision@10 |
---|---|---|---|---|---|
S→ I | SIFT+M | 4.85 | 3.66 | 3.60 | 3.54 |
DBLP [39] | 8.14 | 6.21 | 6.08 | 6.76 | |
CNN+SPEC [22] | 9.96 | 7.13 | 7.00 | 7.44 | |
DVAN [14] | 15.71 | 16.18 | 15.10 | 14.76 | |
DTBH-D | 18.86 | 19.58 | 18.64 | 17.74 | |
DTBH-R | 21.41 | 22.03 | 21.39 | 20.68 | |
DTBH | 22.72 | 23.30 | 22.48 | 21.17 |
Task | Method | MAP | Precision@1 | Precision@5 | Precision@10 |
---|---|---|---|---|---|
I→ S | SIFT+M | 5.04 | 6.22 | 5.34 | 4.50 |
DBLP [39] | 12.70 | 15.32 | 15.21 | 14.22 | |
CNN+SPEC [22] | 13.24 | 16.82 | 16.62 | 15.69 | |
DVAN [14] | 16.29 | 22.49 | 22.56 | 21.7 | |
DTBH-D | 19.39 | 24.82 | 24.29 | 23.65 | |
DTBH-R | 22.52 | 26.81 | 26.28 | 25.49 | |
DTBH | 23.46 | 27.58 | 26.84 | 26.37 |
© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Chen, Y.; Lu, X. A Deep Hashing Technique for Remote Sensing Image-Sound Retrieval. Remote Sens. 2020, 12, 84. https://doi.org/10.3390/rs12010084
Chen Y, Lu X. A Deep Hashing Technique for Remote Sensing Image-Sound Retrieval. Remote Sensing. 2020; 12(1):84. https://doi.org/10.3390/rs12010084
Chicago/Turabian StyleChen, Yaxiong, and Xiaoqiang Lu. 2020. "A Deep Hashing Technique for Remote Sensing Image-Sound Retrieval" Remote Sensing 12, no. 1: 84. https://doi.org/10.3390/rs12010084
APA StyleChen, Y., & Lu, X. (2020). A Deep Hashing Technique for Remote Sensing Image-Sound Retrieval. Remote Sensing, 12(1), 84. https://doi.org/10.3390/rs12010084