A Hybrid Deep Learning and Knowledge Graph Approach for Intelligent Image Indexing and Retrieval
Abstract
1. Introduction
- Hybrid Deep Learning Model: We propose a novel architecture that combines EfficientNet and Vision Transformer (ViT), leveraging their complementary strengths for both local feature extraction and global contextual modeling.
- Knowledge Graph–Based Indexing: We design a multi-level knowledge graph (contextual, conceptual, and raw data layers) to structure multimedia content and enable semantic navigation beyond low-level features.
- Ontology-Guided Query Expansion: We introduce an ontology-driven query expansion mechanism that aligns user queries with semantically related concepts, improving retrieval relevance and user interaction.
- Comprehensive Evaluation: We conduct extensive experiments on three benchmark datasets (TRECVID, Corel, and MSCOCO), demonstrating significant improvements in precision, robustness, and scalability compared to state-of-the-art methods.
2. Related Works
2.1. Content-Based Retrieval
2.2. Semantic-Based Retrieval
2.3. Multimodal Fusion-Based Retrieval
2.4. Deep Learning-Based Retrieval
2.5. Discussion and Justification of the Choice
3. Intelligent and Generic Approach for Multimedia Indexing and Retrieval
3.1. Global View
3.2. Proposed Architecture
3.2.1. Data Preprocessing
3.2.2. Transition from Level 0 to Level 1 of the Knowledge Graph
- − EfficientNetB1
- − EfficientNet B1/Vision Transform (ViT)
3.2.3. Weighting of Concepts at the First Level of the Graph
3.2.4. Incorporation of Semantic Relations via Word Embeddings
- cos(Emb(Ci), Emb(Cj)): the cosine similarity between the contextual embedding vectors of the concepts Ci and Cj extracted from BERT. This measure captures the direct semantic proximity between the concepts in their context.
- exp(-dist(Ci, Cj)): an exponential weighting of the ontological distance, taking into account the hierarchical or taxonomic relationships between the concepts.
- |Ci ∩ Cj|/|Ci ∪ Cj|: the co-occurrence rate of the concepts in the corpus, evaluating their common frequency in the processed documents.
- α, β, γ: adjustable hyperparameters according to the specific task and corpus, allowing control over the relative importance of each component of the similarity.
3.2.5. Attention-Based Weighting for Better Contextualization
3.2.6. Our Ontology, the Result of the Indexing Phase
4. Retrieval Phase
4.1. Retrieval by Textual Query
| Algorithm 1: Ontology-Based Query Expansion for Multimedia Retrieval | 
| Input: User text query Q, Ontology O, video collection V Output: Ranked list of videos R | 
| Step 1: Query Indexing Remove stop words, and normalize term weights: where ti are query terms and ∣Q∣ is the number of terms after preprocessing Step 2: Concept Matching via Jaccard Similarity For each concept Ck ∈ O, compute similarity with Q: where DQ and DCk are descriptor sets of Q and Ck, respectively (e.g., Table 2). Select top-N concepts Ctop = {C1, C2,…, CN} with highest similarity. Step 3: Ontological Projection for Refinement Expand Ctop using semantic relationships in O: in Section 3.2.5 for projection details.) Step 4: the user intervenes to manually select the corresponding concepts. Step 5: Vector Space Matching with Cosine Similarity Represent the query and videos as concept vectors: Query vector: req⃗ = (PC1(req), PC2(req), …, PCM(req)) where PCj(req) = 1 if Cj ∈ Cfinal, else 0. Video vector: Vi⃗ = (PC1(Vi), PC2(Vi), …, PCM(Vi)) where PCj(Vi) is the precomputed weight of Cj in video Vi. Compute similarity for each Vi ∈ V Notations: • Vi: represents a video with index i • req: represents the user’s query • Pcj (vi): represents the weight of a concept j in video i. • Pcj (req): represents the weight of a concept j in the query. Step 6: Displaying Results: Videos are shown ranked according to their relevance to the query. | 
| Key Features: Ontology Integration: Leverages ontology (Figure 12) for query expansion and semantic refinement. Interactive Refinement: User selects concepts post-projection (Step 4) for personalized results. Hybrid Metrics: Combines Jaccard (term-level) and cosine (concept-level) similarities. | 
4.2. Image Query Retrieval
4.3. User Interface (e.g., Textual Query)
5. Experimentation
5.1. Precision Values
5.2. Image Query
5.3. Ablation Study
5.4. Other Metrics
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Hamroun, M.; Lajmi, S.; Nicolas, H.; Amous, I. VISEN: A video interactive retrieval engine based on semantic network in large video collections. In Proceedings of the 23rd International Database Applications & Engineering Symposium (IDEAS), New York, NY, USA, 10–12 June 2019; pp. 1–10. [Google Scholar] [CrossRef]
- Chen, J.; Mao, J.; Liu, Y.; Zhang, F.; Min, Z.; Ma, S. Towards a better understanding of query reformulation behavior in web search. In Proceedings of the WWW ’21: The Web Conference 2021, Ljubljana, Slovenia, 19–23 April 2021. [Google Scholar] [CrossRef]
- Ntirogiannis, K.; Gatos, B.; Pratikakis, I. Binarization of textual content in video frames. In Proceedings of the 2011 International Conference on Document Analysis and Recognition (ICDAR), Beijing, China, 18–21 September 2011; pp. 673–677. [Google Scholar] [CrossRef]
- Christel, M.G.; Hauptmann, A.G. The use and utility of high-level semantic features in video retrieval. In Image and Video Retrieval; Leow, W.K., Lew, M.S., Chua, T.S., Ma, W.Y., Chaisorn, L., Bakker, E.M., Eds.; Springer: Berlin/Heidelberg, Germany, 2005; pp. 134–144. [Google Scholar]
- Snoek, C.; Worring, M.; Koelma, D.; Smeulders, A. A learned lexicon-driven paradigm for interactive video retrieval. IEEE Trans. Multimed. 2007, 9, 280–292. [Google Scholar] [CrossRef]
- Worring, M.; Snoek, C.; de Rooij, O.; Nguyen, G.; van Balen, R.; Koelma, D. Mediamill: Advanced browsing in news video archives. Lect. Notes Comput. Sci. 2006, 4071, 533–536. [Google Scholar] [CrossRef] [PubMed]
- Vrochidis, S.; Moumtzidou, A.; King, P.; Dimou, A.; Mezaris, V.; Kompatsiaris, I. VERGE: A video interactive retrieval engine. In Proceedings of the 2010 International Workshop on Content Based Multimedia Indexing (CBMI), Grenoble, France, 23–25 June 2010; pp. 1–6. [Google Scholar] [CrossRef]
- Furnas, G.W.; Landauer, T.K.; Gomez, L.M.; Dumais, S.T. The vocabulary problem in human-system communication. Commun. ACM 1987, 30, 964–971. [Google Scholar] [CrossRef]
- Maron, M.E.; Kuhns, J.L. On relevance, probabilistic indexing and information retrieval. J. ACM 1960, 7, 216–244. [Google Scholar] [CrossRef]
- Rocchio, J.J. Relevance Feedback in Information Retrieval. In The SMART Retrieval System: Experiments in Automatic Document Processing; Salton, G., Ed.; Prentice-Hall: Englewood Cliffs, NJ, USA, 1971; pp. 313–323. [Google Scholar]
- Jones, K.S. Automatic Keyword Classification for Information Retrieval. Available online: https://api.semanticscholar.org/CorpusID:62724133 (accessed on 4 September 2025).
- Rijsbergen, C.V. A theoretical basis for the use of co-occurrence data in information retrieval. J. Doc. 1977, 33, 106–119. [Google Scholar] [CrossRef]
- Van Rijsbergen, C.J. A non-classical logic for information retrieval. Comput. J. 1986, 29, 481–485. [Google Scholar] [CrossRef]
- Porter, M. Implementing a probabilistic information retrieval system. Inf. Technol. Res. Dev. 1982, 1, 131–156. [Google Scholar]
- Yu, C.T.; Buckley, C.; Lam, K.; Salton, G. A Generalized Term Dependence Model in Information Retrieval; Technical Report; Cornell University: Ithaca, NY, USA, 1983. [Google Scholar]
- Harman, D. Relevance Feedback Revisited; Association for Computing Machinery: New York, NY, USA, 1992. [Google Scholar]
- Statista: Average Number of Search Terms for Online Search Queries in the United States as of January 2020. Available online: https://www.statista.com/statistics/269740/number-of-search-terms-in-internet-research-in-the-us/ (accessed on 4 September 2025).
- Keyword Discovery: Keyword: Query Size by Country. Available online: https://www.keyworddiscovery.com/keyword-stats.html (accessed on 4 September 2025).
- Azad, H.; Deepak, A.; Chakraborty, C.; Abhishek, A.K. Improving query expansion using pseudo-relevant web knowledge for information retrieval. Pattern Recognit. Lett. 2022, 158, 148–156. [Google Scholar] [CrossRef]
- Azad, H.K.; Deepak, A. Query expansion techniques for information retrieval: A survey. Inf. Process. Manag. 2019, 56, 1698–1735. [Google Scholar] [CrossRef]
- Hu, W.M.; Xie, N.H.; Li, L.; Zeng, X.L.; Maybank, S. A survey on visual content-based video indexing and retrieval. IEEE Trans. Syst. Man Cybern. Part C Appl. Rev. 2011, 41, 797–819. [Google Scholar] [CrossRef]
- Etter, D. KB Video Retrieval at TRECVID 2011. 2009. Available online: https://www-nlpir.nist.gov/projects/tvpubs/tv11.papers/kbvr.pdf (accessed on 4 September 2025).
- Ellouze, N.; Lammari, N.; Métais, E.; Ahmed, M.B. CITOM: Approche de construction incrémentale d’une Topic Map multilingue. Data Knowl. Eng. 2010. [Google Scholar]
- Rossetto, L.; Giangreco, I.; Ta, C.; Schuldt, H. Multimodal video retrieval with the 2017 IMO-TION system. In Proceedings of the ICMR ’17: International Conference on Multimedia Retrieval (ICMR), New York, NY, USA, 6 June 2017; pp. 457–460. [Google Scholar] [CrossRef]
- Spolaôr, N.; Lee, H.D.; Takaki, W.S.R.; Ensina, L.A.; Coy, C.S.R.; Wu, F.C. A systematic review on content-based video retrieval. Eng. Appl. Artif. Intell. 2020, 90, 103557. [Google Scholar] [CrossRef]
- Wu, S.; Li, Y.; Zhu, K.; Zhang, G.; Liang, Y.; Ma, K.; Xiao, C.; Zhang, H.; Yang, B.; Chen, W.; et al. SciMMIR: Benchmarking Scientific Multi-modal Information Retrieval. arXiv 2024. [Google Scholar] [CrossRef]
- Bengio, Y.; Courville, A.; Vincent, P. Representation learning: A review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 35, 1798–1828. [Google Scholar] [CrossRef]
- Byeon, M.; Park, B.; Kim, H.; Lee, S.; Baek, W.; Kim, S.; Kakao Brain Large-Scale AI Studio. Coyo-700m: Image-Text Pair Dataset. GitHub. 2022. Available online: https://github.com/kakaobrain/coyo-dataset (accessed on 4 September 2025).
- Chen, W.; Hu, H.; Chen, X.; Verga, P.; Cohen, W. Murag: Multimodal retrieval-augmented generator for open question answering over images and text. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates, 7–11 December 2022; pp. 5558–5570. [Google Scholar]
- Cheng, X.; Cao, B.; Ye, Q.; Zhu, Z.; Li, H.; Zou, Y. ML-LMCL: Mutual learning and large-margin contrastive learning for improving ASR robustness in spoken language understanding. Find. Assoc. Comput. Linguist. ACL 2023, 2023, 6492–6505. [Google Scholar]
- Gao, P.; Han, J.; Zhang, R.; Lin, Z.; Geng, S.; Zhou, A.; Zhang, W.; Lu, P.; He, C.; Yue, X.; et al. LLaMA-Adapter V2: Parameter-efficient visual instruction model. arXiv 2023, arXiv:2304.15010. [Google Scholar]
- Goldsack, T.; Zhang, Z.; Lin, C.; Scarton, C. Domain-Driven and Discourse-Guided Scientific Summarisation. In European Conference on Information Retrieval; Springer: Cham, Switzerland, 2023; pp. 361–376. [Google Scholar]
- Feki, I.; Ba, A.; Alimi, A. New process to identify audio concepts based on binary classifiers encapsulation. Int. J. Comput. Electr. Eng. 2012, 4, 515–518. [Google Scholar] [CrossRef]
- Elleuch, N.; Zarka, M.; Feki, I.; Ba, A.; Alimi, A. Regimvid at Trecvid2010: Semantic Indexing. In Proceedings of the TRECVID 2010 Workshop, Gaithersburg, MD, USA, 15–17 November 2010. [Google Scholar] [CrossRef]
- Elleuch, N.; Ba, A.; Alimi, A. A generic framework for semantic video indexing based on visual concepts/contexts detection. Multimed. Tools Appl. 2014, 74, 1397–1421. [Google Scholar] [CrossRef]
- Smeulders, A.; Worring, M.; Santini, S.; Gupta, A.; Jain, R. Content-based image retrieval at the end of the early years. IEEE Trans. Pattern Anal. Mach. Intell. 2000, 22, 1349–1380. [Google Scholar] [CrossRef]
- Toriah, S.T.M.; Ghalwash, A.Z.; Youssif, A.A.A. Semantic-based video retrieval survey. J. Comput. Commun. 2023, 6, 28–44. [Google Scholar] [CrossRef]
- Sjoberg, M.; Viitaniemi, V.; Koskela, M.; Laaksonen, J. PicSOM Experiments in TRECVID 2009. Available online: https://research.cs.aalto.fi/cbir/papers/trecvid2009.pdf (accessed on 4 September 2025).
- Slimi, J.; Mansouri, S.; Ammar, A.B.; Alimi, A.M. Video exploration tool based on semantic network. In Proceedings of the 10th Conference on Open Research Areas in Information Retrieval, OAIR ’13, Lisbon, Portugal, 15–17 May 2013; pp. 213–214. [Google Scholar]
- Slimi, J.; Ammar, A.B.; Alimi, A.M. Interactive Video Data Visualization System Based on Semantic Organization. In Proceedings of the 2013 11th International Workshop on Content-Based Multimedia Indexing (CBMI), Veszprem, Hungary, 17–19 June 2013; pp. 161–166. [Google Scholar] [CrossRef]
- Amato, G.; Bolettieri, P.; Carrara, F.; Falchi, F.; Gennaro, C.; Messina, N.; Vadicamo, L.; Vairo, C. VISIONE at video browser showdown 2023. In MultiMedia Modeling: 29th International Conference, MMM 2023, Bergen, Norway, Jan. 9–12, 2023, Proceedings, Part I; Springer International Publishing: Cham, Switzerland, 2023; pp. 615–621. [Google Scholar]
- Fang, H.; Xiong, P.; Xu, L.; Chen, Y. Clip2video: Mastering video-text retrieval via image clip. arXiv 2021, arXiv:2106.11097. [Google Scholar]
- Messina, N.; Stefanini, M.; Cornia, M.; Baraldi, L.; Falchi, F.; Amato, G.; Cucchiara, R. ALADIN: Distilling fine-grained alignment scores for efficient image-text matching and retrieval. arXiv 2022, arXiv:2207.14757. [Google Scholar]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, Online, 18–24 July 2021; PMLR: Cambridge, MA, USA, 2021; pp. 8748–8763. [Google Scholar]
- Amato, G.; Bolettieri, P.; Carrara, F.; Debole, F.; Falchi, F.; Gennaro, C.; Vadicamo, L.; Vairo, C. The VISIONE Video Search System: Exploiting Off-the-Shelf Text Search Engines for Large-Scale Video Retrieval. J. Imaging 2021, 7, 76. [Google Scholar] [CrossRef] [PubMed]
- Amato, G.; Carrara, F.; Falchi, F.; Gennaro, C.; Vadicamo, L. Large-scale instance-level image retrieval. Inf. Process. Manag. 2019, 56, 102100. [Google Scholar] [CrossRef]
- Carrara, F.; Vadicamo, L.; Gennaro, C.; Amato, G. Approximate nearest neighbor search on standard search engines. In Similarity Search and Applications; Springer: Berlin/Heidelberg, Germany, 2022; pp. 214–221. [Google Scholar]
- Gurrin, C.; Zhou, L.; Healy, G.; Jónsson, B.Þ.; Dang-Nguyen, D.-T.; Lokoć, J.; Tran, M.-T.; Hürst, W.; Rossetto, L.; Schöffmann, K. Introduction to the fifth annual lifelog search challenge, LSC’22. In Proceedings of the ICMR ′22: International Conference on Multimedia Retrieval (ICMR’22), Newark, NJ, USA, 27–30 June 2022; Association for Computing Machinery: New York, NY, USA, 2022. [Google Scholar]
- Heller, S.; Gsteiger, V.; Bailer, W.; Gurrin, C.; Jónsson, B.Þ.; Lokoč, J.; Leibetseder, A.; Mejzlík, F.; Peška, L.; Rossetto, L.; et al. Interactive video retrieval evaluation at a distance: Comparing sixteen interactive video search systems in a remote setting at the 10th Video Browser Showdown. Int. J. Multimed. Inf. Retr. 2022, 11, 1–18. [Google Scholar] [CrossRef]
- Lokoč, J.; Bailer, W.; Schoeffmann, K.; Muenzer, B.; Awad, G. On influential trends in interactive video retrieval: Video Browser Showdown 2015–2017. IEEE Trans. Multimed. 2018, 20, 3361–3376. [Google Scholar] [CrossRef]
- Lokoč, J.; Vopálková, Z.; Dokoupil, P.; Peška, L. Video search with CLIP and interactive text query reformulation. In MultiMedia Modeling: 29th International Conference, MMM 2023, Bergen, Norway, 9–12 January 2023, Proceedings, Part I; Springer: Berlin/Heidelberg, Germany, 2023; pp. 628–633. [Google Scholar]
- Halima, B.H.; Hamroun, M.; Moussa, S.B.; Alimi, A.M. An interactive engine for multilingual video browsing using semantic content. arXiv 2013. [Google Scholar] [CrossRef]
- Zhang, Z.; Li, W.; Gurrin, C.; Smeaton, A.F. Faceted navigation for browsing large video collection. In MultiMedia Modeling; Tian, Q., Sebe, N., Qi, G.J., Huet, B., Hong, R., Liu, X., Eds.; Springer: Cham, Switzerland, 2016; pp. 412–417. [Google Scholar] [CrossRef]
- Galanopoulos, D.; Markatopoulou, F.; Mezaris, V.; Patras, I. Concept language models and event-based concept number selection for zero-example event detection. In Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval ICMR ’17, New York, NY, USA, 6–9 June 2017; pp. 397–401. [Google Scholar] [CrossRef]
- Janwe, N.; Bhoyar, K. Semantic concept based video retrieval using convolutional neural network. SN Appl. Sci. 2020, 2, 80. [Google Scholar] [CrossRef]
- Amato, F.; Greco, L.; Persia, F.; Poccia, S.R.; De Santo, A. Content-Based Multimedia Retrieval. In Data Management in Pervasive Systems; Colace, F., De Santo, M., Moscato, V., Picariello, A., Schreiber, F.A., Tanca, L., Eds.; Springer: Cham, Switzerland, 2015; pp. 291–310. [Google Scholar] [CrossRef]
- Faudemay, P.; Seyrat, C. Intelligent delivery of personalised video programmes from a video database. In Proceedings of the Database and Expert Systems Applications, 8th International Conference (DEXA ’97), Toulouse, France, 1–2 September 1997; pp. 172–177. [Google Scholar] [CrossRef]
- Meng, L.; Tan, A.H.; Xu, D. Semi-Supervised Heterogeneous Fusion for Multimedia Data Co-Clustering. IEEE Trans. Knowl. Data Eng. 2013, 26, 2293–2306. [Google Scholar] [CrossRef]
- Poria, S.; Chaturvedi, I.; Cambria, E.; Hussain, A. Convolutional MKL based multimodal emotion recognition and sentiment analysis. In Proceedings of the IEEE 16th International Conference on Data Mining (ICDM), Barcelona, Spain, 12–15 December 2016; pp. 439–448. [Google Scholar] [CrossRef]
- Xu, J.; Huang, F.; Zhang, X.; Wang, S.; Li, C.; Li, Z.; He, Y. Visual-textual sentiment classification with bi-directional multi-level attention networks. Knowl.-Based Syst. 2019, 178, 61–73. [Google Scholar] [CrossRef]
- Xu, J.; Huang, F.; Zhang, X.; Wang, S.; Li, C.; Li, Z.; He, Y. Sentiment analysis of social images via hierarchical deep fusion of content and links. Appl. Soft Comput. 2019, 80, 387–399. [Google Scholar] [CrossRef]
- Huang, F.; Zhang, X.; Zhao, Z.; Xu, J.; Li, Z. Image-text sentiment analysis via deep multimodal attentive fusion. Knowl.-Based Syst. 2019, 167, 26–37. [Google Scholar] [CrossRef]
- Yadav, A.; Vishwakarma, D.K. Sentiment analysis using deep learning architectures: A review. Artif. Intell. Rev. 2019, 53, 4335–4385. [Google Scholar] [CrossRef]
- Xu, N. Analyzing multimodal public sentiment based on hierarchical semantic attentional network. In Proceedings of the IEEE International Conference on Intelligence and Security Informatics (ISI), Beijing, China, 22–24 July 2017; pp. 152–154. [Google Scholar] [CrossRef]
- Chen, F.; Ji, R.; Su, J.; Cao, D.; Gao, Y. Predicting microblog sentiments via weakly supervised multimodal deep learning. IEEE Trans. Multimed. 2017, 20, 997–1007. [Google Scholar] [CrossRef]
- Zhao, Z.; Zhu, H.; Xue, Z.; Liu, Z.; Tian, J.; Chua, M.; Liu, M. An image-text consistency driven multimodal sentiment analysis approach for social media. Inf. Process. Manag. 2019, 56, 102097. [Google Scholar] [CrossRef]
- Yu, J.; Jiang, J.; Xia, R. Entity-sensitive attention and fusion network for entity-level multimodal sentiment classification. IEEE/ACM Trans. Audio Speech Lang. Process. 2020, 28, 429–439. [Google Scholar] [CrossRef]
- Liu, A.A.; Shao, Z.; Wong, Y.; Li, J.; Yu-Ting, S.; Kankanhalli, M. LSTM-based multi-label video event detection. Multimed. Tools Appl. 2019, 78, 677–695. [Google Scholar] [CrossRef]
- Shao, Z.; Han, J.; Debattista, K.; Pang, Y. Textual context-aware dense captioning with diverse words. IEEE Trans. Multimed. 2023, 25, 8753–8766. [Google Scholar] [CrossRef]
- Hu, X.; Gan, Z.; Wang, J.; Yang, Z.; Liu, Z.; Lu, Y.; Wang, L. Scaling up vision-language pretraining for image captioning. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 17959–17968. [Google Scholar]
- Shao, Z.; Han, J.; Marnerides, D.; Debattista, K. Region-object relation-aware dense captioning via transformer. IEEE Trans. Neural Netw. Learn. Syst. 2022, 36, 4184–4195. [Google Scholar] [CrossRef]
- Mahrishi, M.; Morwal, S.; Muzaffar, A.W.; Bhatia, S.; Dadheech, P.; Rahmani, M.K.I. Rahmani Video Index Point Detection and Extraction Framework Using Custom YoloV4 Darknet Object Detection Model. IEEE Access 2021, 9, 143378–143391. [Google Scholar] [CrossRef]
- Riedl, M.; Biemann, C. TopicTiling: A text segmentation algorithm based on LDA. In Proceedings of the ACL 2012 Student Research Workshop, Jeju Island, Republic of Korea, 9–11 July 2012; Association for Computational Linguistics: Stroudsburg, PA, USA; pp. 37–42. [Google Scholar]
- Uke, N. Segmentation and organization of lecture video based on visual contents. Int. J. e-Educ. e-Bus e-Manag. e-Learn. 2012, 2, 132. [Google Scholar] [CrossRef]
- Podlesnaya, A.; Podlesnyy, S. Deep learning based semantic video indexing and retrieval. In Proceedings of the SAI Intelligent Systems Conference (IntelliSys) 2016, London, UK, 21–22 September 2016; Springer: Cham, Switzerland; pp. 359–372. [Google Scholar]
- Lu, W.; Sun, H.; Chu, J.; Huang, X.; Yu, J. A novel approach for video text detection and recognition based on a corner response feature map and transferred deep convolutional neural network. IEEE Access 2018, 6, 40198–40211. [Google Scholar] [CrossRef]
- Li, Z.; Liu, X.; Zhang, S. Shot boundary detection based on multilevel difference of colour histograms. In Proceedings of the 2016 First International Conference on Multimedia and Image Processing (ICMIP), Bandar Seri Begawan, Brunei, 1–3 June 2016; pp. 15–22. [Google Scholar]
- Xu, J.; Song, L.; Xie, R. Shot boundary detection using convolutional neural networks. In Proceedings of the 2016 Visual Communications and Image Processing (VCIP), Chengdu, China, 27–30 November 2016; pp. 1–4. [Google Scholar]
- Gao, J.; Xu, C. Learning Video Moment Retrieval Without a Single Annotated Video. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 1646–1657. [Google Scholar] [CrossRef]
- Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A.; Liu, W.; et al. Going deeper with convolutions. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
- Tang, J.; Wang, K.; Shao, L. Supervised Matrix Factorization Hashing for Cross-Modal Retrieval. IEEE Trans. Image Process. 2016, 25, 3157–3166. [Google Scholar] [CrossRef] [PubMed]
- Tang, J.; Li, Z.; Zhu, X. Supervised deep hashing for scalable face image retrieval. Pattern Recognit. 2018, 75, 25–32. [Google Scholar] [CrossRef]
- Liong, V.E.; Lu, J.; Wang, G.; Moulin, P.; Zhou, J. Deep hashing for compact binary codes learning. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 2475–2483. [Google Scholar]
- Li, W.-J.; Wang, S.; Kang, W.-C. Feature learning based deep supervised hashing with pairwise labels. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, New York, NY, USA, 9–15 July 2016; pp. 1711–1717. [Google Scholar]
- Do, T.-T.; Doan, A.-D.; Cheung, N.-M. Learning to Hash With Binary Deep Neural Network. In Lecture Notes in Computer Science; Matas, J., Sebe, N., Welling, M., Eds.; Springer: Cham, Switzerland, 2016; pp. 219–234. [Google Scholar]
- Faghri, F.; Fleet, D.J.; Kiros, J.R.; Fidler, S. VSE++: Improving visual-semantic embeddings with hard negatives. arXiv 2018, arXiv:1707.05612. [Google Scholar]
- Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common objects in context. In European Conference on Computer Vision; Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T., Eds.; Springer: Cham, Switzerland, 2014; pp. 740–755. [Google Scholar]
- Jin, L.; Li, Z.; Tang, J. Deep Semantic Multimodal Hashing Network for Scalable Image-Text and Video-Text Retrievals. IEEE Trans. Neural Netw. Learn. Syst. 2023, 34, 1838–1851. [Google Scholar] [CrossRef]
- Ding, G.; Guo, Y.; Zhou, J. Collective Matrix Factorization Hashing for Multimodal Data. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 2075–2082. [Google Scholar]
- Zhou, J.; Ding, G.; Guo, Y. Latent semantic sparse hashing for cross-modal similarity search. In Proceedings of the SIGIR ’14: The 37th International ACM SIGIR Conference on Research and Development in Information Retrieval, Gold Coast, QLD, Australia, 6–11 July 2014; pp. 415–424. [Google Scholar]
- Lin, Z.; Ding, G.; Hu, M.; Wang, J. Semantics-preserving hashing for cross-view retrieval. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 3864–3872. [Google Scholar]
- Masci, J.; Bronstein, M.M.; Bronstein, A.M.; Schmidhuber, J. Multimodal similarity-preserving hashing. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 36, 824–830. [Google Scholar] [CrossRef]
- Jiang, Q.-Y.; Li, W.-J. Deep cross-modal hashing. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 3270–3278. [Google Scholar]
- Kunal, B.; Kaur, K.; Choudhary, C. A Machine learning model for content-based image retrieval. In Proceedings of the 2023 2nd International Conference for Innovation in Technology (INOCON), Bangalore, India, 3–5 March 2023; pp. 1–6. [Google Scholar] [CrossRef]
- Manjunathi, B.S.; Ma, W.Y. Texture features for browsing and retrieval of image data. IEEE Trans. Pattern Anal. Mach. Intell. 1996, 18, 837–842. [Google Scholar] [CrossRef]
- Deng, Y.; Manjunath, B.S. Unsupervised segmentation of color-texture regions in images and video. IEEE Trans. Pattern Anal. Mach. Intell. 2001, 23, 800–810. [Google Scholar] [CrossRef]
- Park, B.; Park, H.; Lee, S.M.; Seo, J.B.; Kim, N. Lung segmentation on HRCT and volumetric CT for diffuse interstitial lung disease using deep convolutional neural networks. J. Digit. Imaging 2019, 32, 1019–1026. [Google Scholar] [CrossRef]
- Travis, W.D.; Costabel, U.; Hansell, D.M.; King, T.E., Jr.; Lynch, D.A.; Nicholson, A.G.; Ryerson, C.J.; Ryu, J.H.; Selman, M.; Wells, A.U.; et al. An official American Thoracic Society/European Respiratory Society statement: Update of the international multidisciplinary classification of the idiopathic interstitial pneumonias. Am. J. Respir. Crit. Care Med. 2013, 188, 733–748. [Google Scholar] [CrossRef]
- Kunal, P.; Singh, P.; Hirani, N. A Cohesive relation between cybersecurity and information security. In Proceedings of the 2022 IEEE 3rd Global Conference for Advancement in Technology (GCAT), Bangalore, India, 7–9 October 2022; pp. 1–6. [Google Scholar] [CrossRef]
- Hwang, H.J.; Seo, J.B.; Lee, S.M.; Kim, E.Y.; Park, B.; Bae, H.J.; Kim, N. Content-based image retrieval of chest CT with convolutional neural network for diffuse interstitial lung disease: Performance assessment in three major idiopathic interstitial pneumonias. Korean J. Radiol. 2021, 22, 281–290. [Google Scholar] [CrossRef] [PubMed]
- Duan, G.; Yang, J.; Yang, Y. Content-based image retrieval research. Phys. Procedia 2011, 22, 471–477. [Google Scholar] [CrossRef]
- Latif, A.; Rasheed, A.; Sajid, U. Content-based image retrieval and feature extraction: A comprehensive review. Math. Probl. Eng. 2019, 2019, 9658350. [Google Scholar] [CrossRef]
- Depeursinge, A.; Vargas, A.; Gaillard, F.; Platon, A.; Geissbuhler, A.; Poletti, P.-A.; Müller, H. Case-based lung image categorization and retrieval for interstitial lung diseases: Clinical workflows. Int. J. CARS 2012, 7, 97–110. [Google Scholar] [CrossRef]
- Raghu, G.; Collard, H.R.; Egan, J.J.; Martinez, F.J.; Behr, J.; Brown, K.K.; Colby, T.V.; Cordier, J.-F.; Flaherty, K.R.; Lasky, J.A.; et al. An official ATS/ERS/JRS/ALAT statement: Idiopathic pulmonary fibrosis: Evidence-based guidelines for diagnosis and management. Am. J. Respir. Crit. Care Med. 2011, 183, 788–824. [Google Scholar] [CrossRef]
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021; pp. 9992–10002. [Google Scholar]
- Tan, M.; Le, Q. EfficientNet: Rethinking model scaling for convolutional neural networks. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; Chaudhuri, K., Salakhutdinov, R., Eds.; PMLR: Cambridge, MA, USA, 2019; Volume 97, pp. 6105–6114. [Google Scholar]
- Tan, M.; Le, Q. Efficientnetv2: Smaller models and faster training. In Proceedings of the 38th International Conference on Machine Learning, Virtual, 18–24 July 2021; Meila, M., Zhang, T., Eds.; PMLR: Cambridge, MA, USA, 2021; Volume 139, pp. 10096–10106. [Google Scholar]
- Bao, H.; Dong, L.; Piao, S.; Wei, F. Beit: BERT pre-training of image transformers. arXiv 2021, arXiv:2106.08254. [Google Scholar]
- Fadaei, S.; Amirfattahi, R.; Ahmadzadeh, M.R. A new content-based image retrieval system based on optimized inte-gration of DCD, wavelet and curvelet features. IET Image Process. 2017, 11, 89–98. [Google Scholar] [CrossRef]
- Dubey, S.R.; Singh, S.K.; Singh, R.K. Rotation and scale invariant hybrid image descriptor and retrieval. Comput. Electr. Eng. 2015, 46, 288–302. [Google Scholar] [CrossRef]
- Talib, A.; Mahmuddin, M.; Husni, H.; George, L.E. A weighted dominant color descriptor for content-based image retrieval. J. Vis. Commun. Image Represent. 2013, 24, 345–360. [Google Scholar] [CrossRef]
- Jhanwar, N.; Chaudhuri, S.; Seetharaman, G.; Zavidovique, B. Content-based image retrieval using motif co-occurrence matrix. Image Vis. Comput. 2004, 22, 1211–1220. [Google Scholar] [CrossRef]
- Lin, C.-H.; Chen, R.-T.; Chan, Y.-K. A smart content-based image retrieval system based on color and texture feature. Image Vis. Comput. 2009, 27, 658–666. [Google Scholar] [CrossRef]
- ElAlami, M.E. A novel image retrieval model based on the most relevant features. Knowl.-Based Syst. 2011, 24, 23–32. [Google Scholar] [CrossRef]
- Murala, S.; Maheshwari, R.P.; Balasubramanian, R. Local tetra patterns: A new feature descriptor for content-based image retrieval. IEEE Trans. Image Process. 2012, 21, 2874–2886. [Google Scholar] [CrossRef]
- Kundu, M.K.; Chowdhury, M.; Bulo, S.R. A graph-based relevance feedback mechanism in content-based image retrieval. Knowl.-Based Syst. 2015, 73, 254–264. [Google Scholar] [CrossRef]
- Yildizer, E.; Balci, A.M.; Jarada, T.N.; Alhajj, R. Integrating wavelets with clustering and indexing for effective content-based image retrieval. Knowl.-Based Syst. 2012, 31, 55–66. [Google Scholar] [CrossRef]
- Hamroun, M.; Lajmi, S.; Nicolas, H.; Amous, I. ISE: Interactive image search using visual content. In 20th International Conference, ICEIS 2018, Funchal, Madeira, Portugal, 21–24 March 2018; Springer: Berlin/Heidelberg, Germany, 2018; pp. 253–261. [Google Scholar]





















| Feature | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 
|---|---|---|---|---|---|---|---|---|---|
| Operations | Cv 3 × 3 | MBConv1 k3 × 3 | MBConv6 k3 × 3 | MBConv6 k5 × 5 | MBConv6 k3 × 3 | MBConv6 k5 × 5 | MBConv6 k5 × 5 | MBConv6 k3 × 3 | Cv 1 × 1/Pool/FC | 
| Output size | 224 × 224 | 112 × 112 | 112 × 112 | 56 × 56 | 28 × 28 | 14 × 14 | 14 × 14 | 7 × 7 | 7 × 7 | 
| #Channels | 32 | 16 | 24 | 40 | 80 | 112 | 192 | 320 | 1280 | 
| #Layers | 1 | 1 | 2 | 2 | 3 | 3 | 4 | 1 | 1 | 
| Concepts | Description (Descriptor Vector) | 
|---|---|
| Actor | One or more television or movie actors or actresses | 
| Adult | Shots showing a person over the age of 18 | 
| Airplane | Shots of an airplane | 
| Airplane Flying | An airplane flying in the sky | 
| Animal | Shots depicting an animal (no humans) | 
| Asian people | People of Asian ethnicity | 
| Table: Excerpt from ontology concepts | |
| Corel-1K REF | Africa | Beach | Building | Bus | Dinosaur | Elephant | Flower | Horse | Mountain | Food | Average | 
|---|---|---|---|---|---|---|---|---|---|---|---|
| [109] | 72.4 | 51.15 | 59.55 | 92.35 | 99.9 | 72.7 | 92.25 | 96.6 | 55.75 | 72.35 | 76.5 | 
| [110] | 45.25 | 39.75 | 37.35 | 74.1 | 91.45 | 30.4 | 85.15 | 56.8 | 29.25 | 36.95 | 52.64 | 
| [111] | 68.3 | 54.0 | 56.15 | 88.8 | 99.25 | 65.8 | 89.1 | 80.25 | 52.15 | 73.25 | 72.7 | 
| [112] | 70.3 | 56.1 | 57.1 | 87.6 | 98.7 | 67.5 | 91.4 | 83.4 | 53.6 | 74.1 | 73.98 | 
| [113] | 54.95 | 39.4 | 39.6 | 84.3 | 94.7 | 36.0 | 85.85 | 57.5 | 29.45 | 56.7 | 57.85 | 
| [114] | 49.95 | 71.25 | 30.1 | 79.75 | 92.05 | 59.45 | 99.5 | 82.25 | 54.6 | 20.2 | 63.91 | 
| [115] | 73.05 | 59.35 | 61.1 | 69.15 | 99.15 | 80.1 | 80.15 | 89.1 | 58.0 | 74.5 | 74.36 | 
| [116] | 68.95 | 41.1 | 74.3 | 64.4 | 99.55 | 56.65 | 86.55 | 93.2 | 55.15 | 77.95 | 71.78 | 
| [117] | 59.9 | 50.85 | 50.15 | 94.0 | 97.6 | 46.65 | 87.5 | 76.5 | 35.25 | 56.25 | 65.47 | 
| [118] | 88.5 | 79.5 | 67.65 | 100 | 100 | 93.1 | 100 | 100 | 77.75 | 89.3 | 89.5 | 
| Our system | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 
| ID | Semantic Concepts | Low-Level Features | EfficientNet | ViT | EfficientNet + ViT | EfficientNet + ViT + Knowledge Graph | 
|---|---|---|---|---|---|---|
| 1 | Airplane | 0.23 | 0.45 | 0.55 | 0.90 | 0.95 | 
| 2 | Anchorperson | 0.43 | 0.88 | 0.77 | 0.97 | 0.99 | 
| 3 | Basketball | 0.16 | 0.16 | 0.34 | 0.88 | 0.93 | 
| 4 | Bicycling | 0.33 | 0.56 | 0.55 | 0.78 | 0.89 | 
| 5 | Boat_Ship | 0.40 | 0.61 | 0.71 | 0.78 | 0.89 | 
| 6 | Bridges | 0.15 | 0.35 | 0.31 | 0.88 | 0.94 | 
| 7 | Bus | 0.09 | 0.17 | 0.22 | 0.66 | 0.77 | 
| 8 | Car_Racing | 0.03 | 0.03 | 0.03 | 0.55 | 0.75 | 
| 9 | Cheering | 0.02 | 0.11 | 0.11 | 0.76 | 0.79 | 
| 10 | Computers | 0.21 | 0.57 | 0.45 | 0.79 | 0.85 | 
| 11 | Dancing | 0.08 | 0.08 | 0.12 | 0.72 | 0.84 | 
| 12 | Demonstration_Or_Protest | 0.13 | 0.35 | 0.39 | 0.85 | 0.90 | 
| 13 | Explosion_Fire | 0.08 | 0.18 | 0.14 | 0.79 | 0.89 | 
| 14 | Government−Leader | 0.35 | 0.57 | 0.66 | 0.93 | 0.98 | 
| 15 | Instrumental_Musician | 0.40 | 0.8 | 0.86 | 0.95 | 0.98 | 
| 16 | Kitchen | 0.32 | 0.55 | 0.59 | 0.95 | 0.98 | 
| 17 | Motorcycle | 0.11 | 0.19 | 0.33 | 0.70 | 0.79 | 
| 18 | Office | 0.32 | 0.4 | 0.53 | 0.90 | 0.97 | 
| 19 | Old_People | 0.10 | 0.21 | 0.34 | 0.65 | 0.88 | 
| 20 | Press_Conference | 0.09 | 0.23 | 0.29 | 0.55 | 0.78 | 
| 21 | Running | 0.12 | 0.24 | 0.25 | 0.78 | 0.89 | 
| 22 | Telephones | 0.11 | 0.29 | 0.25 | 0.78 | 0.89 | 
| 23 | Throwing | 0.01 | 0.05 | 0.11 | 0.55 | 0.75 | 
| 24 | Flags | 0.23 | 0.33 | 0.29 | 0.77 | 0.79 | 
| 25 | Hill | 0.17 | 0.25 | 0.35 | 0.89 | 0.95 | 
| 26 | Lakes | 0.21 | 0.21 | 0.34 | 0.85 | 0.93 | 
| 27 | Quadruped | 0.33 | 0.55 | 0.65 | 0.94 | 0.98 | 
| 28 | Soldiers | 0.24 | 0.31 | 0.43 | 0.80 | 0.89 | 
| 29 | Studio_With_Anchorperson | 0.4 | 0.77 | 0.88 | 0.92 | 0.97 | 
| 30 | Traffic | 0.14 | 0.27 | 0.44 | 0.82 | 0.95 | 
| Concept | Precision | Sensitivity | F1-Score | mAP 0.5 | mAP [0.5:0.95] | 
|---|---|---|---|---|---|
| Airplane | 0.95 | 0.92 | 0.94 | 0.96 | 0.75 | 
| Anchorperson | 0.99 | 0.97 | 0.98 | 0.99 | 0.80 | 
| Basketball | 0.93 | 0.90 | 0.92 | 0.94 | 0.72 | 
| Bicycling | 0.89 | 0.87 | 0.88 | 0.91 | 0.70 | 
| Boat_Ship | 0.89 | 0.85 | 0.87 | 0.90 | 0.68 | 
| Bridges | 0.94 | 0.90 | 0.92 | 0.95 | 0.74 | 
| Bus | 0.77 | 0.75 | 0.76 | 0.80 | 0.60 | 
| Car_Racing | 0.75 | 0.72 | 0.73 | 0.78 | 0.58 | 
| Cheering | 0.79 | 0.76 | 0.77 | 0.81 | 0.63 | 
| Computers | 0.85 | 0.83 | 0.84 | 0.87 | 0.69 | 
| Dancing | 0.84 | 0.81 | 0.83 | 0.86 | 0.67 | 
| Demonstration_Or_Protest | 0.90 | 0.87 | 0.89 | 0.92 | 0.73 | 
| Explosion_Fire | 0.89 | 0.86 | 0.88 | 0.91 | 0.70 | 
| Government−Leader | 0.98 | 0.96 | 0.97 | 0.99 | 0.79 | 
| Instrumental_Musician | 0.98 | 0.95 | 0.96 | 0.98 | 0.78 | 
| Kitchen | 0.98 | 0.94 | 0.96 | 0.98 | 0.77 | 
| Motorcycle | 0.79 | 0.76 | 0.77 | 0.82 | 0.64 | 
| Office | 0.97 | 0.94 | 0.95 | 0.98 | 0.76 | 
| Old_People | 0.88 | 0.85 | 0.86 | 0.90 | 0.71 | 
| Press_Conference | 0.78 | 0.75 | 0.76 | 0.80 | 0.62 | 
| Running | 0.89 | 0.86 | 0.88 | 0.91 | 0.70 | 
| Telephones | 0.89 | 0.85 | 0.87 | 0.90 | 0.69 | 
| Throwing | 0.75 | 0.72 | 0.73 | 0.78 | 0.57 | 
| Flags | 0.79 | 0.75 | 0.77 | 0.81 | 0.60 | 
| Hill | 0.95 | 0.91 | 0.93 | 0.96 | 0.75 | 
| Lakes | 0.93 | 0.89 | 0.91 | 0.94 | 0.73 | 
| Quadruped | 0.98 | 0.96 | 0.97 | 0.99 | 0.79 | 
| Soldiers | 0.89 | 0.86 | 0.88 | 0.91 | 0.70 | 
| Studio_With_Anchorperson | 0.97 | 0.94 | 0.95 | 0.98 | 0.76 | 
| Traffic | 0.95 | 0.91 | 0.93 | 0.96 | 0.75 | 
| Avg | 0.95 | 0.91 | 0.93 | 0.96 | 0.74 | 
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. | 
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Hamroun, M.; Sauveron, D. A Hybrid Deep Learning and Knowledge Graph Approach for Intelligent Image Indexing and Retrieval. Appl. Sci. 2025, 15, 10591. https://doi.org/10.3390/app151910591
Hamroun M, Sauveron D. A Hybrid Deep Learning and Knowledge Graph Approach for Intelligent Image Indexing and Retrieval. Applied Sciences. 2025; 15(19):10591. https://doi.org/10.3390/app151910591
Chicago/Turabian StyleHamroun, Mohamed, and Damien Sauveron. 2025. "A Hybrid Deep Learning and Knowledge Graph Approach for Intelligent Image Indexing and Retrieval" Applied Sciences 15, no. 19: 10591. https://doi.org/10.3390/app151910591
APA StyleHamroun, M., & Sauveron, D. (2025). A Hybrid Deep Learning and Knowledge Graph Approach for Intelligent Image Indexing and Retrieval. Applied Sciences, 15(19), 10591. https://doi.org/10.3390/app151910591
 
        



 
       