A Hybrid Deep Learning and Knowledge Graph Approach for Intelligent Image Indexing and Retrieval

Hamroun, Mohamed; Sauveron, Damien

doi:10.3390/app151910591

Open AccessArticle

A Hybrid Deep Learning and Knowledge Graph Approach for Intelligent Image Indexing and Retrieval

by

Mohamed Hamroun

^1,2,* and

Damien Sauveron

¹

Department of Computer Science, XLIM, UMR CNRS 7252, University of Limoges, Avenue Albert Thomas, 87060 Limoges, France

²

3iL Ingénieurs, 43 Rue de Sainte Anne, 87015 Limoges, France

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(19), 10591; https://doi.org/10.3390/app151910591

Submission received: 12 August 2025 / Revised: 4 September 2025 / Accepted: 15 September 2025 / Published: 30 September 2025

(This article belongs to the Special Issue Application of Deep Learning and Big Data Processing)

Download

Browse Figures

Versions Notes

Abstract

Technological advancements have enabled users to digitize and store an unlimited number of multimedia documents, including images and videos. However, the heterogeneous nature of multimedia content poses significant challenges in efficient indexing and retrieval. Traditional approaches primarily focus on visual features, often neglecting the semantic context, which limits retrieval efficiency. This paper proposes a hybrid deep learning and knowledge graph approach for intelligent image indexing and retrieval. By integrating deep learning models such as EfficientNet and Vision Transformer (ViT) with structured knowledge graphs, the proposed framework enhances semantic understanding and retrieval performance. The methodology incorporates feature extraction, concept classification, and hierarchical knowledge graph structuring to facilitate effective multimedia retrieval. Experimental results on benchmark datasets, including TRECVID, Corel, and MSCOCO, demonstrate significant improvements in precision, robustness, and query expansion techniques. The findings highlight the potential of combining deep learning with knowledge graphs to bridge the semantic gap and optimize multimedia indexing and retrieval.

Keywords:

multimedia indexing; deep learning; knowledge graph; image retrieval; Vision Transformer (ViT); EfficientNet

1. Introduction

Technological advancements have enabled users to digitize and store an unlimited number of documents, extending beyond texts to include images and videos. The volume of audiovisual information, such as social media data, has surged with the advent of high-speed internet and powerful computers. Simultaneously, increased storage capacity has made video utilization more accessible across various domains, further facilitated by decreasing costs. The significance of a document (image or video) lies primarily in its richness and semantic expression. However, the heterogeneous nature of multimedia documents enhances their semantic expressiveness while also introducing structural ambiguity, as they integrate multiple media types such as text, images, and sound. Beyond the inherent complexity of video data structures, we are witnessing a proliferation of diverse sources, including news broadcasts, teleconsultations, sports programs, films, documentaries, reality shows, and surveillance recordings. Each type of document possesses a distinct structure, raising essential questions: How can we efficiently retrieve relevant images from such collections? How can an image be meaningfully characterized? How should databases be structured to facilitate access? While some initial answers have emerged in recent years, significant challenges remain in automatically indexing the vast and unstructured influx of images and videos in our daily lives.

Information retrieval and indexing systems have traditionally focused on describing purely visual content. The new challenge lies in incorporating an automatic semantic description of content alongside visual features. In real-world scenarios, users often prefer searching for documents using semantic information, such as events or specific concepts, to obtain the most relevant results. However, most current systems do not adequately address this need, as they often prioritize a single type of media. For instance, many video indexing approaches rely solely on visual features, making it difficult to effectively handle semantic information, except for a few specialized applications such as video surveillance and sports broadcasting.

In recent years, content-based visual retrieval systems have made notable progress in identifying visually similar images to a given query or recognizing specific objects within an image. However, these systems still struggle with representing images and interpreting their semantic meaning. The primary challenge in information retrieval lies in the semantic gap—the discrepancy between the low-level composition of visual data and its high-level semantic interpretation. This gap directly impacts search performance, underscoring the growing need for semantic-based query formulation and retrieval methods. Future automated content-based information retrieval systems must bridge this divide between visual content and semantic understanding.

To address these challenges, a deeper comprehension of visual content is required through semantic-based indexing and retrieval. Classification methods play a crucial role in assigning meaning to documents after a significant learning phase. Identifying key characteristics such as color, shape, texture, motion, sound, and text is essential for determining their category. However, understanding the meaning of a concept for classification purposes remains difficult due to its sensitivity to factors like size, resolution, lighting conditions, and capturing context. A concept may represent an action, an object, or a location. For instance, concepts like “prisoner” and “person” exhibit different levels of abstraction, necessitating specialized methods tailored to such distinctions. However, given the vast number of possible concepts and the complexity of extracting high-level semantic information from low-level data, this approach may not be the most suitable.

The main contributions of this work can be summarized as follows:

Hybrid Deep Learning Model: We propose a novel architecture that combines EfficientNet and Vision Transformer (ViT), leveraging their complementary strengths for both local feature extraction and global contextual modeling.
Knowledge Graph–Based Indexing: We design a multi-level knowledge graph (contextual, conceptual, and raw data layers) to structure multimedia content and enable semantic navigation beyond low-level features.
Ontology-Guided Query Expansion: We introduce an ontology-driven query expansion mechanism that aligns user queries with semantically related concepts, improving retrieval relevance and user interaction.
Comprehensive Evaluation: We conduct extensive experiments on three benchmark datasets (TRECVID, Corel, and MSCOCO), demonstrating significant improvements in precision, robustness, and scalability compared to state-of-the-art methods.

Together, these contributions bridge the gap between low-level feature extraction and high-level semantic interpretation, providing a generic and scalable framework for intelligent multimedia indexing and retrieval.

2. Related Works

Information retrieval involves locating and extracting pertinent data from large collections of documents or databases. Typically, users formulate a query, which they enter into a search interface, often refining their results with filters or advanced search options. Designing effective queries becomes especially complex when dealing with non-textual media such as images and videos, which can convey multiple meanings and concepts. This challenge underlines the significance of precise query formulation, as a user’s stated query may not always reflect their true information need. Although text-based searches are the most widely used, alternative input formats—such as pictures, videos, or sketches—are also possible. Nonetheless, text queries remain the dominant choice for most users [1,2], usually relying on keyword or phrase matching enhanced by natural language processing to connect queries with relevant content. However, textual input can be restrictive, particularly for certain types of visual media such as television news. In video retrieval, one promising strategy is to transcribe the audio track to determine the subject matter [3], rather than relying solely on the visual component. Conversely, conceptual queries search for information based on underlying ideas or concepts within the content, rather than specific keywords. This approach has been the focus of various research efforts. For instance, the INFORMEDIA system leverages a limited set of high-level concepts to filter textual query results [4]. It also organizes groups of keyframes [5] and uses speech recognition results to map keyframes to geographic locations, integrating these with other visualizations to contextualize query results. Another example is the method proposed in [6], which employs semantic indexing. This system relies on an extensive semantic lexicon, categorized into threads, to facilitate interaction. It defines multiple spaces, including visual similarity, semantic similarity, and semantic thread spaces, supported by browsers to navigate these dimensions. Additionally, the VERGE approach [7] offers functionalities for high-level visual conceptual retrieval and visual retrieval, combining indexing, analysis, and retrieval techniques across textual, visual, and conceptual modalities. Several other published studies have explored and evaluated various methods of information retrieval. These works propose a wide range of approaches, from traditional techniques based on indexing and keyword matching to more recent methods leveraging machine learning, natural language processing, and semantic analysis. Their primary goal is to enhance the relevance of search results, reduce ambiguity in query interpretation, and ultimately provide a more effective and user-friendly information access experience [8,9,10,11,12,13,14,15,16,17,18,19,20].

2.1. Content-Based Retrieval

Content-based video retrieval (CBVR) operates by segmenting videos and extracting low-level descriptors. This segmentation typically includes identifying shot boundaries, selecting keyframes, and dividing scenes. Feature extraction then focuses on deriving attributes from keyframes, objects, and motion, emphasizing low-level video characteristics [21]. By indexing these features, systems can retrieve videos in response to example-based queries—such as an image, frame, or sketch—to find a desired clip. In this regard, Etter [22] enhanced a video search platform by incorporating query expansion techniques that draw on external sources like Wikipedia titles and images. Likewise, Elleuch et al. [3] introduced three automated search modules dedicated to text extraction, visual feature analysis, and audio feature analysis. Another study [23] presented a CBVR approach that captures color, texture, and shape information. Texture is determined using multi-fractal Brownian motion (mbm), color through a semantic color framework, and shape using the level-set technique. These descriptors are then indexed to enable fast and efficient retrieval.

The IMOTION system [24] exemplifies a multimodal CBVR application, offering diverse query modes based on an extensive range of features. It is scalable to large video collections, leveraging the ADAMpro polystore and the Cineast retrieval engine for multi-feature fusion. Reference [25] presents a systematic review of content-based video retrieval and indexing methods, covering segmentation, feature extraction, dimensionality reduction, and machine learning approaches from 2011 to 2018. Key strategies include shot boundary detection (SBD), color descriptors, and dimensionality reduction techniques such as k-means clustering and principal component analysis (PCA). Machine learning algorithms, including k-means, neural networks, and support vector machines (SVM), have been employed for segmentation, classification, and retrieval enhancement. The study underscores the growing importance of deep learning techniques, such as convolutional neural networks (CNNs), for semantic feature extraction and compact video representation. Future research should further explore feature subset selection and regularization-based dimensionality reduction methods. Reference [26] highlights significant progress in multimodal information retrieval (MMIR) within scientific domains. Models such as CLIP and BLIP, trained primarily on generic datasets (e.g., everyday scenes and landscapes), exhibit limitations when applied to scientific data (e.g., graphs, tables, and detailed captions describing principles or results). To address this gap, the authors introduced SciMMIR, a benchmark comprising 530,000 image-text pairs extracted from scientific documents on arXiv. These pairs are annotated with a two-level hierarchy, enabling fine-grained performance evaluations on subsets (e.g., figures for results, illustrations, architectures; tables for results and parameters). Evaluations in “zero-shot” and “fine-tuning” modes, using models like CLIP, BLIP, and BLIP-2 with OCR integration for improved text processing, demonstrated that fine-tuning on domain-specific data significantly enhances performance. Results reveal that figures are easier to process than tables, with OCR playing a critical role in boosting performance for the latter. Historically, MMIR has advanced through small datasets such as MSCOCO and Flickr30k [27,28], followed by models like CLIP and BLIP [29], supported by large datasets like LAION-400M and LAION-5B [30,31,32]. However, these approaches, designed for generic contexts, fall short of meeting the specific demands of scientific domains, emphasizing the need for benchmarks like SciMMIR. Semantic video indexing systems, such as SVI REGIMVid [33,34,35], have been developed to enable semantic access to multimedia archives. SVI REGIMVid represents a generic video indexing approach designed to enhance semantic retrieval capabilities. However, despite significant advancements, traditional CBVR methods often fail to fully meet user needs, demonstrating inherent limitations [36].

2.2. Semantic-Based Retrieval

In recent years, semantic-based video retrieval has garnered significant attention from researchers [37]. Semantic concept detection involves identifying the presence or absence of high-level concepts, such as “bus,” “forest,” or “sky,” within videos. For instance, the authors in [38] proposed combining concept matching between the query and the corpus with content matching, while works in [39,40] introduced systems for navigating and visualizing semantic concepts using classification-based navigation modules. Article [41] presents an enhanced version of the VISIONE system, which provides advanced video search features, including free-text search, spatial object or color-based search, visual and semantic similarity search, and temporal search. Leveraging artificial intelligence and advanced indexing techniques, VISIONE ensures scalability and efficiency while prioritizing a user-friendly interface for non-experts. Unlike traditional video retrieval approaches, such as simple text-based engines or dense vector representation methods (e.g., CLIP, ALADIN) [42,43,44], VISIONE integrates pre-trained models and techniques like Surrogate Text Representations to transform dense features into sparse vectors, optimizing indexing efficiency [45,46,47]. It outperforms conventional tools evaluated in competitions like Video Browser Showdown (VBS) or Lifelog Search Challenge, which often rely on complex interfaces [48,49,50,51], by offering both simplified and advanced modes to cater to beginners and experienced users alike, positioning itself as a scalable and versatile solution for large-scale video search.

Further developments in this domain include methods like the one in [52], which introduced a weighting technique to calculate concept membership in the audiovisual domain for efficient indexing and retrieval, and the graphical interface in [23] for dynamically exploring query result spaces through interconnected media objects. The interactive video browsing system presented in [53] for the Video Browser Showdown 2016 demonstrated efficiency in helping users locate specific clips within large collections under time constraints. Zero-shot techniques [54] have also emerged as effective tools for video retrieval, employing predefined concept detectors and mapping functions to calculate similarity between queries and database concepts, returning ranked results. Similarly, [55] introduced a semantic-based retrieval method using ranked intersection filtering and a foreground-focused concept co-occurrence matrix, leveraging convolutional neural networks to identify ranked concept probabilities from query-relevant keyframes.

Concept detection remains challenging due to the complexity and variability of visual and semantic content [56]. Typically treated as a classification task, binary classifiers are trained to predict the presence or absence of specific concepts in videos or keyframes based on extracted features. While many audiovisual retrieval methods focus on low-level content within specialized frameworks or single modalities, these constraints often limit performance. Combining semantic content indexing with features from various domains offers an innovative approach, as video components complement each other to enhance semantic understanding. However, existing techniques often lack robust user interaction mechanisms. Enriching retrieval systems with insights from past user behavior could significantly improve relevance. Few studies address this, but [57] proposed a semi-automatic, user-focused method that places users at the center of the retrieval process, offering a promising direction for advancing current systems.

2.3. Multimodal Fusion-Based Retrieval

Multimodal fusion involves integrating features from various data sources to predict a target class value, as highlighted in [58], and typically employs one of three strategies: early fusion, late fusion, or intermediate fusion. Early fusion merges input features from different modalities into a single unified feature vector; for instance, Poria et al. [59] demonstrated this by extracting visual and textual features using deep networks and applying a multiple kernel classifier for sentiment classification, though this approach often struggles with temporal synchronization and high-dimensional redundancy. Late fusion, on the other hand, combines outcomes from separate classifiers trained on individual modalities, as Xu et al. [60] introduced with a bidirectional attention model leveraging visual-textual correlations, though it may overlook low-level interactions an issue Xu et al. [61] addressed using cross-modal relationships with multi-level LSTMs. Intermediate fusion transforms input data into higher-level representations through multiple layers; for example, Huang et al. [62] proposed a multimodal attentive fusion method that sequentially applies visual and semantic attention models before utilizing multimodal attention, though varying fusion depths may lead to overfitting and hinder effective inter-modality modeling. Each fusion strategy has its strengths and limitations, and selecting the most suitable approach depends on the specific characteristics and requirements of the task at hand.

2.4. Deep Learning-Based Retrieval

Recent breakthroughs in deep learning techniques [63] have significantly expanded their use in information retrieval tasks. These approaches show strong potential across multiple modalities, as demonstrated by a range of studies. For example, Xu et al. [64] combined word- and sentence-level attention mechanisms for modeling textual data with CNN-LSTM architectures to extract semantic representations from images. In a similar vein, Chen et al. [65] utilized emoticons as weak supervision signals and employed both standard and dynamic CNN architectures to jointly process textual and image features. Their framework further incorporated a probabilistic graphical model to uncover relationships between predicted labels across modalities. Zhao et al. [66] tested five pre-trained CNNs for image feature extraction and adopted word2vec to represent textual features, using cosine similarity to assess cross-modal consistency before integrating features for classification. By contrast, Yu et al. [67] proposed an entity-level multimodal classification model, using LSTM networks to encode target entities and an attention mechanism to capture surrounding context, with bilinear pooling to represent intermodal interactions.

Liu et al. [68] developed a sequence-to-sequence–inspired framework designed to identify complex events in surveillance footage. Unlike conventional methods that depend on preprocessing stages such as object detection and tracking, their model predicts video content directly and was validated on a new dataset with diverse visual descriptors. Similarly, Z. Shao et al. [69] introduced the Enhanced Transformer Dense Captioner (ETDC), which integrates a Textual Context Module (TCM) into the Transformer decoder to incorporate surrounding textual cues, and proposed a Dynamic Vocabulary Frequency Histogram (DVFH) re-sampling technique to mitigate word frequency imbalance during training.

Hu et al. [70] explored advancements in image captioning through vision-language pre-training (VLP) techniques. They proposed LEMON, a large-scale image captioner, and conducted an empirical investigation into scaling behavior for image captioning. Using the VinVL model as a foundation, they experimented with transformer model sizes ranging from 13 to 675 million parameters to optimize performance. Additionally, Z. Shao et al. [71] developed the Transformer-based Dense Captioner (TDC) to enhance dense image captioning by prioritizing informative regions. Their architecture introduced the region-object correlation score unit (ROCSU), which considers relationships between detected objects, regions, and confidence scores to determine region importance. Reference [72] proposes a framework for video indexing point detection and extraction using a customized YoloV4 Darknet model trained on 6000 images and tested on 50 educational videos totaling 20 h of content. The method relies on a shot boundary detection algorithm combining the structural similarity index (SSIM) and a binary search algorithm, reducing computation time by 21% while achieving 96% accuracy for abrupt transitions. Automatic slide keyword extraction is performed using the Tesseract OCR, associating keywords with timestamps for easy navigation. Performance is evaluated using standard metrics, with precision, recall, and F1 scores ranging from 60% to 80%. Compared to state-of-the-art methods, traditional approaches, such as those by Riedl and Biemann [73] or Uke and Thool [74], use textual segmentation methods or simple OCR, while modern approaches based on convolutional neural networks, such as those by Podlesnaya and Podlesnyy [75] or Lu et al. [76], focus on integrating CNNs with OCR tools for high accuracy. Additionally, advanced shot boundary detection techniques using color histograms or CNNs have been explored to improve video segmentation [77,78]. The contribution of this article stands out for its combination of techniques for efficient and fast automatic indexing, with potential applications in various domains. Reference [79] introduces an innovative approach for video moment retrieval without annotated data, using a video-conditioned phrase generator and a moment locator based on graphical neural networks (GNNs). The method leverages visual concept detectors (objects, actions, scenes) and a pre-trained image-phrase embedding space to align video moments with textual descriptions while transferring knowledge from the image domain to videos. Unlike supervised approaches [80,81] or weakly supervised ones [82,83,84,85], which require moment-phrase or video-text pairs, this method learns in an unsupervised manner. Tested on datasets such as Charades-STA and ActivityNet Captions, it demonstrates comparable or superior performance to weakly supervised models like TGA [83] and SCN [84]. This is attributed to the complementarity between the phrase generator and the moment locator, which efficiently exploit temporal relationships between video clips under the guidance of a pre-trained embedding model [86,87]. Reference [88] proposes a novel method, DSMHN, for multimodal hashing, using 2D CNNs for images and 3D CNNs for videos to capture spatial and spatiotemporal information, respectively. This approach jointly exploits intermodal similarity and intramodal labels to generate compact, balanced, and discriminative binary codes while integrating multiple loss functions, such as L1 and contrastive losses, in a unified and flexible framework. DSMHN surpasses the limitations of existing approaches, including non-deep methods like CMFH, LSSH, and SePH, which rely on matrix factorization and semantic correlation maximization but lack effective deep learning capabilities [89,90,91]. Moreover, it improves the performance of deep methods like DCMH and MMNN, which use neural networks for multimodal similarities but do not effectively preserve inter- and intramodality correlations [92,93]. Experiments conducted on datasets like Wiki, MIRFlickr25k, and MSR-VTT-10K show that DSMHN offers significant improvements in mean average precision (mAP) and recall, demonstrating its superiority for image, text, and video retrieval tasks. Reference [94] proposes a content-based image retrieval (CBIR) model, addressing the limitations of traditional systems based on textual annotations, which are often subjective or insufficient. The model extracts visual features, such as color, texture, shape, and spatial location, to analyze and compare images using similarity measures like Euclidean distance, allowing for the retrieval of relevant images based on user queries. The approach combines global features for speed and local features for precision, albeit at increased computational complexity [95,96,97]. The article highlights the challenge of the “semantic gap,” i.e., the difficulty of linking low-level visual features to high-level concepts such as context or meaning [36,98]. To address this, techniques such as supervised and unsupervised learning, as well as natural language processing algorithms, are used to align visual features with semantic concepts [99,100]. Compared to state-of-the-art systems, modern CBIR systems surpass purely textual approaches and adopt mixed methodologies integrating computer vision and text processing [101,102]. The proposed model achieves improved accuracy from 80% to 95% when the database size is increased from 10,000 to 70,000 images, confirming that database size directly impacts system performance [103,104]. These contributions highlight CBIR advancements and their potential applications, notably in medicine, facial recognition, and visual search engines. Lastly, Liu, Ze et al. [105] proposed the Swin Transformer, a novel vision adaptation of the Transformer architecture designed for various computer vision tasks. Addressing challenges like scale variations and pixel-level resolution, they introduced a hierarchical design using shifted windows, enabling efficient computation by confining self-attention to non-overlapping local windows while preserving cross-window connections. These studies showcase the potential of deep learning to address challenges in multimodal information retrieval, dense captioning, and video analysis, driving innovation in the field.

2.5. Discussion and Justification of the Choice

The state of the art in information retrieval emphasizes the importance of effectively integrating diverse data sources and leveraging advanced deep learning techniques. Convolutional Neural Networks (CNNs) remain reference models for feature extraction from images, capturing visual patterns at different levels, but they face limitations with multimodal data, requiring more advanced approaches. EfficientNet offers an optimized solution by balancing depth, width, and resolution, achieving high performance with reduced resource consumption, crucial for real-time applications. Vision Transformers (ViT) introduce a Transformer-based approach to model long-range relationships between visual elements, making them particularly effective for indexing and retrieving information in complex scenarios where contextualization is essential. Knowledge graphs further enhance multimodal information retrieval by structuring contextual and relational knowledge, facilitating better data organization and retrieval in applications where understanding inter-element relationships is key. Combining EfficientNet, ViT, and knowledge graphs provides a robust and scalable foundation for addressing multimodal data challenges while optimizing computational performance across various tasks, from image classification to managing complex relationships.

3. Intelligent and Generic Approach for Multimedia Indexing and Retrieval

3.1. Global View

Through this work, our objective was to propose a generic indexing method capable of indexing different types of data (Figure 1). The innovative aspect of our system lies in several elements: a generic approach, a hybrid model based on deep learning, and a novel combination of machine learning and knowledge graphs. Our system is structured into multiple functional phases, each playing a key role in the overall efficiency of the indexing process.

Firstly, we worked with several heterogeneous data sources. For general-purpose data, we used three datasets: TRECVID, MSCOCO, and Corel. The first step of our methodology involves data preprocessing, which mainly includes resizing, enhancing data quality, and, finally, data augmentation to enrich the available datasets.

Next, we developed several hybrid models to classify the data. An additional layer was integrated to organize the data, facilitating access and management. In the retrieval phase, we leverage the results from the indexing process by incorporating new aspects, such as query expansion. This feature guides the user in reformulating their queries in terms of concepts, improving the relevance of the results. Furthermore, we developed a feedback method based on a Siamese network, which further enriches the data and enhances the efficiency of our models. That is, each new test image with a confidence score higher than 0.9 compared to the images of a dataset class will be assigned to that class. Using this method, we can increase the dataset images in a consistent manner.

The details of the key modules in each functional phase are presented in the following sections.

3.2. Proposed Architecture

We propose a knowledge graph-based approach to structure multi-level indexes, organized into three hierarchical layers (contextual, conceptual, and raw data) to facilitate their utilization during the retrieval phase. The goal is to organize data with common characteristics and semantic relationships by assigning videos sharing a common concept to the same group. In this knowledge graph, we distinguish between concepts and contexts: concepts represent real-world objects, while contexts, which are more generic, correspond to “super-concepts” linked by generalization relationships. Moving to a higher level of abstraction within the graph enables a more intuitive organization of data by grouping similar concepts under shared contexts.

This structuring provides a three-level hierarchical navigation. Level 0 includes all audiovisual documents associated with the subjects of the corpus. Level 1 groups concepts such as “Actor,” “Boy,” “Girl,” “Face,” etc. Finally, Level 2 corresponds to the most generic or relevant contexts, such as Person, Animal, and Vehicle.

This approach considers various aspects: concept detection and definition, weighting of relationships between concepts, measuring similarity between concepts, context detection and definition, as well as evaluating video similarity and inter-shot relationships. This structuring via the knowledge graph is illustrated in Figure 2, and the following sections detail the different modules integrated into this architecture.

3.2.1. Data Preprocessing

Three datasets are used: TRECVID, provided by the National Institute of Standards and Technology (NIST), which includes a test set and a development set. The development set contains 3200 videos, while the test set contains about 8000. It is annotated with 130 semantic concepts. The second dataset, MS COCO (Microsoft Common Objects in Context), is a large collection of images containing 328.000 images of everyday objects and humans. The third dataset, Corel, is a large image collection containing approximately 68.000 annotated images, covering a wide variety of objects and visual categories.

To improve image quality before training, we applied several preprocessing techniques. First, we worked on histogram enhancement, brightness and contrast correction, and applied filters to reduce noise and highlight contours. Then, to balance underrepresented classes, we used data augmentation techniques based on classical geometric transformations (rotation, scaling, translation, etc.). Additionally, we used the SMOTE (Synthetic Minority Over-sampling Technique) method to artificially generate new samples and thus increase data diversity while avoiding overfitting.

This approach includes data processing, model implementation, training, and evaluation using appropriate metrics. These steps ensure the quality of the data before partitioning it into three subsets: one for training, one for validation, and one for testing.

The training process integrates the training and validation data to adjust the model’s parameters. The model consists of two main components: a pre-trained model and a transfer block. Once training is complete, the model is tested on unknown data to evaluate its final performance.

Particular attention was given to data cleaning. This includes identifying and removing outliers, preventing data leakage, and checking the consistency of classes. The reshaped data is divided as follows: the majority for training, a medium portion for validation, and a smaller fraction for the final test.

To improve the model’s robustness and reduce the risk of overfitting, we applied data augmentation techniques. These techniques include horizontal flipping, adjusting light intensity (brightness values are modified in a range of [−10%, +10%] from the average), random cropping (cropped size varies between 70% and 100% of the original image, with width/height ratios between 3:4 and 4:3), as well as pixel normalization to scale them between 0 and 255. Furthermore, to address the issue of imbalances in the data, we incorporated specific weights (positive and negative) into the cross-entropy loss function, as defined in the following equation.

L_{c r o s s - e n t r o p y}^{w} = - (w_{p s} \log (1 - f (x)))

(1)

We want the positive weight (wp) associated with each class, multiplied by its positive frequency (freqps), to be equal to the negative weight (wng) multiplied by the negative frequency (freqng) of the same class (the “negative frequency” is defined as the number of examples not belonging to a given class). This relationship is expressed in the following formula.

w_{p s} \times {f r e q}_{p s} = w_{n g} \times {f r e q}_{n g}

(2)

3.2.2. Transition from Level 0 to Level 1 of the Knowledge Graph

This step involves classifying the data into 130 categories. It allows us to transition from level 0, which corresponds to raw data, to level 1, which represents semantic features. To achieve this, we use a hybrid model combining the pre-trained EfficientNet B1 model and a Vision Transformer (ViT) model. This combination leverages the strengths of both architectures: EfficientNet B1 ensures efficient extraction of local features with its optimized design in terms of size and performance, while ViT captures global relationships in the image through its attention mechanism.

− EfficientNetB1

EfficientNet [79] is a family of models based on a core network presented in Table 1. Its main component is the Mobile Inverted Bottleneck convolution block (MB-Conv), introduced in [25] and illustrated in Figure 3. This family of convolutional networks is developed by starting from a compact and efficient base model, then systematically adjusting its dimensions using a fixed set of scaling coefficients.

EfficientNet (Table 1) is a highly efficient convolutional neural network designed to optimize scaling. This is important because scaling improves the model’s efficiency. EfficientNet uses Neural Architecture Search (NAS) to find an optimal balance between accuracy and FLOPS cost.

The core blocks are presented in Figure 4 Our dataset was divided into three parts: 70% for training, 20% for validation, and 10% for testing in phase 1. For EfficientNet B1, the input size is (224, 224, 3). We used the Adam optimization algorithm [41] to train the model. Training begins with a learning rate of 0.1 × 10⁻², which is then reduced to 0.01 × 10⁻³ with the ReduceLROnPlateau scheduler.

The model was first trained by freezing all layers of the pre-trained model for 15 epochs with half the steps per epoch. Then, a second training phase of 15 epochs was conducted to fine-tune the model. We also explored thresholds of 0.5 to improve accuracy and help the model generalize better.

Instead of creating a model from scratch, we can use the weights of a pre-trained network to speed up and improve the learning process. Our method consists of two blocks, as shown in Figure 5 by adding pre-trained models on the ImageNet data [88]. To make the most of the pre-trained weights, we added a transfer block with several elements: a zero-padding layer to smooth the results, a convolutional layer with a 512 kernel and a stride of 33, a GAP layer followed by a conditional dropout layer, a fully connected layer with 1024 output nodes, and finally an output classifier with 8 nodes.

− EfficientNet B1/Vision Transform (ViT)

Figure 5 shows our hybrid architecture. We use EfficientNet B1, based on the EfficientNet architectures [106], which have demonstrated clear superiority over other networks in classification tasks while having fewer parameters. EfficientNet B1 has fewer parameters, making it more suitable for low-resource environments, and it leverages a combination of efficient network design and compound scaling to achieve high accuracy with fewer parameters [107].

The second pillar of the ensemblemodel is based on Bidirectional Encoder representation from Image Transformers (BEiT). Specifically, BEiT employs a pre-training task called masked image modeling (MIM) and is inspired by BERT [108]. MIM utilizes two views for each image: image patches and visual tokens. The image is divided into a grid of patches, which serve as the input representation for the base Transformer. The image is “tokenized” into discrete visual tokens. During pre-training, a certain proportion of image patches are randomly masked, and the corrupted input is fed into the Transformer. The model learns to recover the visual tokens of the original image rather than the raw pixels of the masked patches.

The vector representation of the image is passed to both the EfficientNet and ViT models. In EfficientNet, we remove the last dense layer of the pre-trained model to extract features from the final flattened layer (average pooling). In the ViT model, we obtain the last hidden states, which contain all patches from the last attention layer except for the classification token. These features are then flattened, and an additional dense layer is used to reduce the shape so that the output features have the same size as those extracted from EfficientNet. Finally, both feature sets are merged.

During the model validation and prediction processes, it is important to note that the K-nearest neighbors (KNN) classifier is used, as illustrated in Figure 6. We iterate through K values from 1 to 30 to determine the optimal K. The metric used to select the best K is based on the F1-macro score. KNN is a classification model that relies on the density estimation of the nearest neighbors. For each epoch, an attempt is made to identify the optimal K value based on the F1-macro score. Once the best K value is determined for an epoch, the corresponding weights are saved, and predictions are made on the test dataset. This iterative process ensures that the best predictions are obtained for each epoch of the test data.

The complete process of classifying a test image is presented in Figure 7. The test image is fed into the model, which performs encoding and generates a compact feature vector in a lower-dimensional space. In this reduced feature space, the encoded representation of the test image is compared to that of all training samples using appropriate distance measures. Classification is then performed using the KNN algorithm.

The choice of KNN is motivated by its simplicity, interpretability, and suitability for similarity-based retrieval tasks. Unlike more complex classifiers (e.g., SVM or MLP), KNN introduces minimal additional parameters, thereby allowing us to highlight the effectiveness of the learned representations rather than the sophistication of the classifier. Moreover, KNN naturally aligns with the retrieval paradigm, where decisions are based on proximity in the feature space, and it facilitates seamless integration with our knowledge graph and query expansion mechanisms. Preliminary experiments confirmed that replacing KNN with more advanced classifiers yielded only marginal improvements at the cost of higher computational complexity. Thus, KNN provides an efficient and pragmatic choice, ensuring robustness, scalability, and interpretability within our proposed framework.

3.2.3. Weighting of Concepts at the First Level of the Graph

Weighting is one of the fundamental functions in information retrieval. It involves assigning weight to each concept in a document to reflect its importance. The goal is to identify the concepts that best represent the content of the document [52]. Different approaches and weighting techniques are proposed in the literature. Traditionally, concept weighting is inspired by the TF-IDF (Term Frequency-Inverse Document Frequency) approach, widely used in information retrieval systems to adjust the importance of a concept within a document collection. However, this approach has some limitations, including its inability to capture semantic relationships between concepts and its sensitivity to frequency imbalances.

Rather than a simple TF-IDF weighting, we propose using BM25, a weighted version that takes into account the frequency saturation of concepts and the length of documents (videos in our case).

B M 25 (C i, V j) = I D F (C i) . \frac{T F (C i, V j) . (K_{1} + 1)}{T F (C i, V j) + K_{1} . (1 - b + b . \frac{|V_{j}|}{a v g (|V|)})}

(3)

with: K1 and b are tuning parameters (typically K1 = 1.2 and b = 0.75). |Vj| is the length of the video (number of shots). avg(|V|) is the average length of videos in the collection. IDF(Ci) is the inverse document frequency of the concept Ci. TF(Ci,Vj) is the frequency of the concept Ci:

T F (C i, V j) = \frac{N b s (C i, V j)}{n}

(4)

with: Nbs(Ci, Vj) is the number of shots containing the concept Ci in the video Vj. n is the total number of shots in video Vj.

This method prevents an over-represented concept from overshadowing the importance of others and adjusts the weighting based on the size of the video. Figure 8 shows an example of concept weighting in video shots.

From this formula, an XML file is created to describe the base videos. Each video is associated with a set of concepts (Figure 9). A weight is assigned to each concept to reflect its importance in the video. This file is then used to match the concepts extracted from the user’s query with those in the base videos.

3.2.4. Incorporation of Semantic Relations via Word Embeddings

The TF-IDF/BM25 approach does not account for the semantic relationships between concepts. We improve this weighting by integrating vector representations of concepts based on BERT. This is where Transformer-based models, such as BERT (Bidirectional Encoder Representations from Transformers), offer a significant advantage. BERT is a pre-trained language model that enables the extraction of more precise contextual representations of words based on their surrounding context in a sentence or document. BERT works by analyzing text in a bidirectional manner, meaning it takes into account the words both before and after a given term to understand its meaning. This allows for the generation of contextual embeddings that consider the dynamics of each word within its sentence, which is particularly useful in classification tasks where the meaning of concepts can heavily depend on the context in which they appear. We propose using BERT to enhance the weighting of concepts by adjusting the representation of each concept Ci based on the global context in which it appears. For each concept Ci, we extract a representation vector W(Ci) using BERT embeddings. This vector is calculated by passing the text containing the concept through the BERT model, which generates a contextual representation of the word. The final weighting of the concept in a video is adjusted based on its similarity with other concepts present in the video, taking their context into account. This can be formulated similarly to the Word2Vec approach, but using the contextual embeddings generated by BERT:

W (C i, V j) = B M 25 (C i, V j) . \frac{1}{n} \sum_{C_{k} ϵ V_{j}} S i m (B E R T (C i), B E R T (C k))

(5)

with Sim(BERT(W(Ci), BERT(Ck))) being the cosine similarity between the contextual representation vectors of the concepts Ci and Ck generated by BERT.

N is the total number of concepts in the video.

Using BERT allows taking into account not only the meaning of concepts in a general context (as Word2Vec does), but also the variations in their meaning depending on the immediate context of their usage. For example, the same word may have different meanings depending on whether it is used in a medical, technological, or everyday context. BERT helps capture these nuances more accurately. The goal is to facilitate the transition between the contextual and conceptual levels. The similarity is defined by the following equation:

S i m (C i, C j) = α . \cos (E m b (C i), E m b (C j)) + β . \exp (- d i s t (C i, C j)) + γ . (\frac{|C i \cap C j|}{|C i \cup C j|})

(6)

with:

cos(Emb(Ci), Emb(Cj)): the cosine similarity between the contextual embedding vectors of the concepts Ci and Cj extracted from BERT. This measure captures the direct semantic proximity between the concepts in their context.
exp(-dist(Ci, Cj)): an exponential weighting of the ontological distance, taking into account the hierarchical or taxonomic relationships between the concepts.
|Ci ∩ Cj|/|Ci ∪ Cj|: the co-occurrence rate of the concepts in the corpus, evaluating their common frequency in the processed documents.
α, β, γ: adjustable hyperparameters according to the specific task and corpus, allowing control over the relative importance of each component of the similarity.

This approach allows for a better consideration of the context of concepts, thanks to the distributed representations generated by BERT, and offers a more refined modeling of the semantic relationships between concepts. This ensures a more accurate understanding of complex and contextual relationships, overcoming the limitations of traditional approaches based on static representations.

From this formula, an XML file is created to describe the relationships between the concepts. Each concept has similarities with several other concepts (Figure 10). A similarity weight is assigned to each concept to reflect its degree of similarity to other concepts. This file is then used to establish correspondences between the concepts during the navigation-based search.

3.2.5. Attention-Based Weighting for Better Contextualization

Moving to a higher level of abstraction helps us better organize data and accelerate access. We have grouped semantically similar concepts under a common context by proposing a method to extract this context from the concepts. Our notion of “context” is inspired by the work of [94], which uses a knowledge encoding technology called “topic maps.” We adapted this technique to the audiovisual domain by introducing semantic entities called “contexts.” A context is defined as a concept that meets two main criteria:

(1) Maximum total similarity: The context is the concept with the highest sum of similarities with all other concepts in the collection. This allows us to identify a central concept that is highly interconnected with others. Formally, for a concept Ci belonging to a set C of concepts, the total similarity is expressed by:

S (C_{i}) = \sum_{j = 1, j \neq i}^{|C|} S i m (C_{i}, C_{j})

(7)

where Sim(Ci, Cj) represents a similarity function between two concepts Ci and Cj, the context C_contexte is the concept that maximizes S(Ci):

C_{c o n t e x t e} = \underset{C_{i} \in C}{a r g m a x} S (C_{i})

(8)

(2) Maximum occurrence frequency: Among the considered concepts, the context is also the one that appears most frequently in the entire audiovisual collection. The occurrence frequency is defined by:

F (C_{i}) = \frac{O c c (C_{i})}{N}

(9)

where Occ(Ci) is the number of occurrences of the concept Ci in the collection, and N is the total number of videos. The context is determined by maximizing a combined function of total similarity and occurrence frequency, ensuring that it is both highly connected to other concepts and prevalent in the collection:

C_{c o n t e x t e} = \underset{C_{i} \in C}{a r g m a x} (S (C_{i}) \times F (C_{i}))

(10)

This method allows the creation of a semantic summary, that is, a set of contexts that show the importance of each concept in the collection. It also helps the user easily start navigating through the data.

In practice, we created an XML file (Figure 11) to describe the context of each concept. This file contains an attribute called “number of concepts,” which indicates how many concepts are included. Each context element contains several “concept” elements, each with an attribute “weight” that represents the level of similarity with other concepts.

3.2.6. Our Ontology, the Result of the Indexing Phase

After these steps, which constitute our indexing method, we obtained an ontology representing the data at three levels of abstraction: the raw data level, corresponding to data from various sources; the conceptual level, representing classes and contexts; and the super-class level, grouping contexts into more general categories. Classes and super-classes are interconnected through semantic relationships, and each image or video has a weight in each class. The following figure (Figure 12) represents our ontology.

4. Retrieval Phase

4.1. Retrieval by Textual Query

In general, the most implicit way for a user to express their query is through text defining the content of the images they are searching for. Subsequently, the system analyzes the query and translates it into features describing the request. The issue at this stage is how to translate a textual query into a concept. For example: what are the concepts in the database that effectively describe a football match? To solve this problem, we use the query expansion technique, leveraging our ontology to assist the user in formulating queries (Table 2). In this way, we can help them express their query through N concepts selected from the existing concepts in our database. Figure 13 precisely explains the process of the query expansion mechanism.

The following algorithm (Algorithm 1) represents the key steps to guide the user in formulating their query in the form of a concept query.

Algorithm 1: Ontology-Based Query Expansion for Multimedia Retrieval

Input: User text query Q, Ontology O, video collection V
Output: Ranked list of videos R

Step 1: Query Indexing
Remove stop words, and normalize term weights:
where ti are query terms and ∣Q∣ is the number of terms after preprocessing
Step 2: Concept Matching via Jaccard Similarity
For each concept Ck ∈ O, compute similarity with Q:
where DQ and DCk are descriptor sets of Q and Ck, respectively (e.g., Table 2).
Select top-N concepts Ctop = {C1, C2,…, CN} with highest similarity.

J a c c a r d (D_{r q}, D_{c}) = \frac{|D_{r q} \cap D_{c}|}{[D_{r q} \cup D_{c}]}

Step 3: Ontological Projection for Refinement
Expand Ctop using semantic relationships in O:

(See Equation C_{contexte} = \underset{C_{i} \in C}{argmax} S (C_{i})

in Section 3.2.5 for projection details.)
Step 4: the user intervenes to manually select the corresponding concepts.
Step 5: Vector Space Matching with Cosine Similarity
Represent the query and videos as concept vectors:
Query vector: req⃗ = (PC1(req), PC2(req), …, PCM(req))
where PCj(req) = 1 if Cj ∈ Cfinal, else 0.
Video vector: Vi⃗ = (PC1(Vi), PC2(Vi), …, PCM(Vi))
where PCj(Vi) is the precomputed weight of Cj in video Vi.
Compute similarity for each Vi ∈ V

S i m i (r e q, V_{i}) = c o s (\vec{r e q,} \vec{V_{i}}) = \sum_{j = 1}^{n} \frac{P c_{j} (V_{i}) * P c_{j} (r e q)}{\sqrt{\sum_{k = 1}^{n} {(P c_{k} (V_{i}))}^{2}} * \sqrt{\sum_{k = 1}^{n} {(P c_{l} (r e q))}^{2}}}

Notations:
• Vi: represents a video with index i
• req: represents the user’s query
• Pcj (vi): represents the weight of a concept j in video i.
• Pcj (req): represents the weight of a concept j in the query.
Step 6: Displaying Results: Videos are shown ranked according to their relevance to the query.

Key Features:
Ontology Integration: Leverages ontology (Figure 12) for query expansion and semantic refinement.
Interactive Refinement: User selects concepts post-projection (Step 4) for personalized results.
Hybrid Metrics: Combines Jaccard (term-level) and cosine (concept-level) similarities.

4.2. Image Query Retrieval

First, the input image is normalized to a standard format of 1024 × 1024 to ensure optimal compatibility between the different modules. Then, two parallel processing steps are performed: EfficientNet B0 is used as an encoder to extract discriminative and detailed features from the query image, while YOLOv8 is responsible for locating the relevant concepts in the same image. Once the outputs are obtained, a critical matching step is carried out between the results from YOLOv8 and those from EfficientNet. This matching process aims to align the areas detected by YOLOv8 (represented as regions of interest or bounding boxes) with the features extracted by EfficientNet. This process effectively combines the spatial localization of concepts provided by YOLOv8 with the contextual richness of the features extracted by EfficientNet (Figure 14).

For each detection, a confidence score is calculated to determine the relevance of the results. If the detection score is below a predefined threshold, the detection is rejected. This validation step ensures that only reliable detections consistent between YOLOv8 and EfficientNet are retained. For example, if YOLO identifies a region but it does not correspond to relevant features according to EfficientNet, it is ignored. The validated detections then proceed to a final ranking step to select the most relevant results. These final results are subsequently presented to the user for consultation and visualization.

By integrating a rigorous matching process between the outputs of YOLOv8 and EfficientNet, this architecture enhances the accuracy and reliability of the CBIR system. It ensures precise localization of concepts while leveraging rich contextual features to reduce errors and improve the overall relevance of the results.

4.3. User Interface (e.g., Textual Query)

The end user of our system formulates their query, which represents a set of keywords, to search for the appropriate concepts corresponding to their query by performing a comparison between the query terms and the description of each concept (Figure 15). We assume that the user has typed “news” as their query. Our system will search for the appropriate concepts for this term by matching the query terms with the description of each concept. The returned result includes the following concepts: news_studio, reporters, weather.

The user selects one of the concepts returned by the system (Figure 16), which then retrieves the images corresponding to that concept. Additionally, our system offers the possibility of selecting multiple concepts. For instance, suppose the user selects the concept “Reporters”; the corresponding results will be displayed as follows: (the images related to this query are sorted according to their degree of relevance).

5. Experimentation

5.1. Precision Values

The accuracy and loss curves, shown in Figure 17, illustrate the evolution of model performance during training. The accuracy of the EfficientNet and ViT models progressively increases, reaching a final value of 96%. The hybrid model achieves a higher final accuracy of 99%, as shown by the solid line. Simultaneously, the loss curves indicate the corresponding reduction in loss over time. Both the EfficientNet and ViT models exhibit similar convergence behavior, while the hybrid model demonstrates the fastest decrease in loss, achieving the lowest final loss.

The evaluation of the EfficientNet, ViT models, and their combination reveals significant differences in terms of accuracy and convergence. EfficientNet alone achieves a final accuracy of 96%, demonstrating its effectiveness as a CNN optimized for local feature extraction. However, although it converges quickly, its learning reaches a plateau after a certain number of epochs, limiting its ability to capture high-level relationships between visual elements. On the other hand, ViT (Vision Transformer), due to its self-attention mechanism, excels at modeling global spatial relationships between pixels. However, it has a slower learning phase, requiring more epochs to reach a final accuracy of 96%. This latency is explained by the fact that ViT does not rely on traditional convolutional filters but on matrix transformations, which require more data and time to learn robust representations. The hybrid approach combining EfficientNet and ViT takes advantage of the strengths of both architectures. Indeed, accuracy reaches 99%, a significant improvement over the models taken individually. This performance is explained by the complementarity between the two networks: EfficientNet ensures efficient extraction of local textures, shapes, and patterns, while ViT captures higher-level relationships between objects in the image. Moreover, the loss curve associated with the hybrid model is the lowest (0.10), indicating better generalization and less variability in learning. The convergence of the model is faster and more stable, reducing the risk of overfitting.

In summary, these results confirm the value of multi-level fusion in image and video search. The combination of techniques from CNNs and Transformers optimizes both the accuracy and robustness of the model, fully exploiting the visual features of the processed images and videos.

5.2. Image Query

Through an experiment, all images from each category are presented separately as a query image. The precision of the top 20 returned images is then calculated for each query. Finally, the average precision across all queries is computed and reported for each category. The overall performance of our system is compared to that of some state-of-the-art CBIR systems.

The average precision for all image categories in the Corel-1k dataset is reported in Table 3. To demonstrate the usefulness of our system, the results of other CBIR systems are also included in this table. Since the average precision of our results is 100%, our system achieves the highest precision among other state-of-the-art CBIR systems. The results are illustrated in Figure 18. These initial findings indicate that our system performs better by correctly retrieving 20 images in all classes.

5.3. Ablation Study

The proposed contribution consists of several layers. We begin with a low-level descriptor phase, followed by a high-level descriptor extraction phase using two classification methods: EfficientNet and ViT. Next, we have a phase where the two models are combined. Finally, the last phase involves the organization and enrichment of concepts using a knowledge graph model. Figure 19 illustrates an experimental model where one layer is gradually removed in each iteration to assess the importance of each layer. Additionally, we compare the results obtained from traditional machine learning and deep learning approaches. In conclusion, we also evaluate the importance of the organization layer.

We conducted a progressive evaluation of each component of our system (Table 4). First, the exclusive use of low-level descriptors resulted in relatively modest performance, as evidenced by the scores observed for concepts such as Airplane (0.23) and Desk (0.32). This step provided a baseline for assessing the improvements made in the subsequent phases. Next, we evaluated the high-level descriptors through two distinct experimental phases: one involving machine learning techniques and the other relying on advanced deep learning methods. The results show a clear improvement in performance, particularly with ViT and EfficientNet for concepts like Musician Instrumental (0.8 and 0.86, respectively) and Studio with Host (0.77 and 0.88).

Subsequently, the combination of these approaches revealed significant gains, as illustrated by scores like 0.94 for Quadruped and 0.89 for Bicycle, highlighting the advantage of fusing multi-level descriptors.

Finally, the full integration of our system, including the organization phase, enabled us to achieve optimal performance. The final scores for several concepts, such as Airplane (0.95), Desk (0.97), and Musician Instrumental (0.98), reflect the stabilization and refinement of overall performance thanks to this phase.

The analysis of the table reveals that the combination of low-level and high-level descriptors, as well as the use of deep learning techniques such as EfficientNet and ViT, significantly improves precision and performance across various categories of semantic concepts (Figure 20).

For example, the results for concepts like Airplane show a substantial improvement, reaching a score of 0.95 with the combination of EfficientNet + ViT + Knowledge Graph, well above the values obtained with the descriptors taken individually (0.45 for ViT alone and 0.55 for EfficientNet alone). A similar observation can be made for Cycling, where the combination achieves 0.89, greatly surpassing the performance of the isolated approaches.

This trend holds for most of the concepts studied, such as Musician Instrumental (0.98) or Studio with Host (0.97), demonstrating the relevance of fusing multi-level features for precise classification. In particular, the combination enriched with Knowledge Graphs further optimizes performance, likely by stabilizing and refining the results.

The positive impact of deep learning architectures is evident: the performance achieved with EfficientNet and ViT far exceeds traditional methods in several categories such as Boat/Ship (0.89) and Desk (0.97). This analysis thus highlights the importance of an integrated approach combining multi-level descriptors and advanced deep learning techniques, while leveraging efficient data organization to maximize the precision and effectiveness of semantic classification systems.

5.4. Other Metrics

We presented the multimedia concept detection model (Table 5), along with the processing of the dataset used and the metrics adopted to evaluate the performance of the obtained models. We used mAP with a threshold of 0.5, as well as mAP with a threshold ranging from [0.5:0.95]. We achieved an average precision of 0.95, an average sensitivity of 0.91, and an average F1 score of 0.93, as shown in the table below. The model reached an average mAP of 0.96 at a threshold of 0.5, and 0.74 on the [0.5:0.95] range. Among the top-performing concepts, the detection of “Anchorperson” achieved a precision of 0.99, a sensitivity of 0.97, and a mAP of 0.99 at a threshold of 0.5. Similarly, the detection of “Government-Leader” and “Quadruped” showed high performance, with precision values of 0.98 and 0.98, respectively, and mAPs of 0.99. In contrast, some concepts showed more modest performance. For instance, the detection of “Car Racing” and “Throwing” had a precision of 0.75, a sensitivity of 0.72, and an mAP of 0.78 at a threshold of 0.5. Similarly, the detection of “Flags” and “Bus” had lower scores, with mAPs of 0.81 and 0.80, respectively. Nevertheless, these results are comparable to those in the literature and demonstrate the robustness of the model in handling the variability of multimedia concepts. Figure 10 illustrates the precision-sensitivity curve.

The performance of the detection model was evaluated using various performance indicators. The analysis revealed an average F1 score of 0.93, as shown in Figure 21. Additionally, the precision of the model was evaluated using the mean average precision (mAP) criterion, which yielded a value of 96% at a threshold of 0.5 and 74% on the interval [0.5:0.95]. Overall, the results of this work suggest that the proposed model provides strong performance, with high precision on several concepts such as “Anchorperson” (0.99), “Government-Leader” (0.98), and “Quadruped” (0.98). Moreover, improvements in F1 and mAP metrics for more difficult-to-detect concepts indicate increased robustness of the model. These results suggest that the proposed approach holds significant potential for the detection task, offering high precision and sensitivity across a wide range of multimedia concepts.

The results obtained demonstrate the effectiveness of our proposed method, with a significant improvement in performance in terms of F1-score and mAP 0.5, reflecting a better balance between precision and recall. Most concepts achieve high scores, indicating a reduction in false positives and false negatives, which enhances the model’s reliability. Unlike traditional approaches, our method ensures consistent performance even on difficult-to-classify classes, thus demonstrating better generalization. With advanced parameter optimization and more efficient feature extraction, our approach provides a more robust and accurate classification. These results confirm its potential for applications requiring high precision, such as real-time object recognition or medical image analysis, thereby reinforcing its relevance in the state of the art. To assess the robustness of the reported improvements, we conducted statistical significance testing. Specifically, paired t-tests were applied to compare the performance of the hybrid model (EfficientNet-B1 + ViT + Knowledge Graph) with the individual baselines (EfficientNet-B1 alone and ViT alone) across the three benchmark datasets. The results indicate that the improvements in mAP and F1-score achieved by the hybrid approach are statistically significant, with p < 0.01 in all cases. These findings confirm that the superiority of the proposed framework is not attributable to random chance but reflects a genuine performance improvement.

In addition to accuracy evaluations, we also analyzed the computational efficiency of the proposed framework. EfficientNet-B1 provides a lightweight backbone, while the integration of ViT and knowledge graph reasoning introduces moderate overhead. Empirical measurements indicate that hybrid training requires approximately 25% more time compared to EfficientNet alone, with a memory footprint of ~6 GB during training on a single NVIDIA RTX 3090 GPU. For retrieval tasks, query processing averaged 0.45 s per request on the MSCOCO dataset, including both feature comparison and knowledge graph navigation. The memory consumption of the knowledge graph module remained stable, scaling primarily with the number of indexed concepts and relationships (2–3.5 GB for the datasets used). These results confirm that the proposed system maintains an acceptable trade-off between computational cost and retrieval accuracy, ensuring its applicability to large-scale multimedia indexing and retrieval scenarios.

6. Conclusions

This paper introduced a hybrid approach combining deep learning and knowledge graphs to address the challenges of multimedia indexing and retrieval. By leveraging EfficientNet and ViT for feature extraction and structuring concepts hierarchically using knowledge graphs, our system successfully enhances semantic representation and retrieval accuracy. The integration of query expansion techniques further improves user search experiences by aligning textual queries with relevant visual content. Experimental evaluations demonstrate the superiority of our approach over traditional methods, achieving high precision and recall across multiple datasets. Beyond achieving high retrieval accuracy, our approach paves the way for a more intelligent and adaptable multimedia search system. The integration of knowledge graphs not only structures multimedia content but also enables contextual reasoning and semantic inferences, making retrieval more intuitive and efficient. Additionally, the use of deep learning models ensures scalability and robustness when processing large-scale and diverse datasets. Despite these advancements, several challenges remain. The computational complexity of deep learning models and knowledge graph processing requires optimized hardware and efficient algorithmic solutions. Furthermore, handling evolving multimedia content and dynamically updating the knowledge graph pose additional research challenges. Future work will focus on enhancing multimodal learning by incorporating transformer-based architectures and self-supervised learning techniques to further improve semantic understanding. Additionally, expanding the knowledge graph to include user feedback and reinforcement learning mechanisms can enhance adaptability and personalization in retrieval tasks. From a broader perspective, future research should explore the integration of multimodal data beyond images and videos, including textual, audio, and sensor data, to create a more holistic indexing framework. Moreover, developing real-time retrieval systems capable of adapting to changing user preferences and dynamic data streams will be crucial. Investigating the role of federated learning and privacy-preserving techniques in multimedia retrieval can further enhance security and user trust. Finally, interdisciplinary collaborations with cognitive science and human–computer interaction experts could lead to more user-friendly and interpretable AI-driven retrieval systems. In conclusion, this research contributes significantly to the advancement of AI-driven multimedia retrieval systems by bridging the gap between low-level visual features and high-level semantic interpretation. The proposed hybrid approach not only improves retrieval accuracy but also offers a scalable and semantically rich framework for future intelligent indexing and retrieval applications.

Author Contributions

M.H.: Conducted the research, wrote the manuscript, created figures, designed algorithms, implemented and tested code, acquired datasets, and carried out experiments. D.S.: Guided the manuscript’s structure, proofread the paper, and provided advice for journal submission. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data sharing is not applicable to this article as no datasets were generated or analyzed during the current study.

Conflicts of Interest

The author declare that they have no conflicts of interest.

References

Hamroun, M.; Lajmi, S.; Nicolas, H.; Amous, I. VISEN: A video interactive retrieval engine based on semantic network in large video collections. In Proceedings of the 23rd International Database Applications & Engineering Symposium (IDEAS), New York, NY, USA, 10–12 June 2019; pp. 1–10. [Google Scholar] [CrossRef]
Chen, J.; Mao, J.; Liu, Y.; Zhang, F.; Min, Z.; Ma, S. Towards a better understanding of query reformulation behavior in web search. In Proceedings of the WWW ’21: The Web Conference 2021, Ljubljana, Slovenia, 19–23 April 2021. [Google Scholar] [CrossRef]
Ntirogiannis, K.; Gatos, B.; Pratikakis, I. Binarization of textual content in video frames. In Proceedings of the 2011 International Conference on Document Analysis and Recognition (ICDAR), Beijing, China, 18–21 September 2011; pp. 673–677. [Google Scholar] [CrossRef]
Christel, M.G.; Hauptmann, A.G. The use and utility of high-level semantic features in video retrieval. In Image and Video Retrieval; Leow, W.K., Lew, M.S., Chua, T.S., Ma, W.Y., Chaisorn, L., Bakker, E.M., Eds.; Springer: Berlin/Heidelberg, Germany, 2005; pp. 134–144. [Google Scholar]
Snoek, C.; Worring, M.; Koelma, D.; Smeulders, A. A learned lexicon-driven paradigm for interactive video retrieval. IEEE Trans. Multimed. 2007, 9, 280–292. [Google Scholar] [CrossRef]
Worring, M.; Snoek, C.; de Rooij, O.; Nguyen, G.; van Balen, R.; Koelma, D. Mediamill: Advanced browsing in news video archives. Lect. Notes Comput. Sci. 2006, 4071, 533–536. [Google Scholar] [CrossRef] [PubMed]
Vrochidis, S.; Moumtzidou, A.; King, P.; Dimou, A.; Mezaris, V.; Kompatsiaris, I. VERGE: A video interactive retrieval engine. In Proceedings of the 2010 International Workshop on Content Based Multimedia Indexing (CBMI), Grenoble, France, 23–25 June 2010; pp. 1–6. [Google Scholar] [CrossRef]
Furnas, G.W.; Landauer, T.K.; Gomez, L.M.; Dumais, S.T. The vocabulary problem in human-system communication. Commun. ACM 1987, 30, 964–971. [Google Scholar] [CrossRef]
Maron, M.E.; Kuhns, J.L. On relevance, probabilistic indexing and information retrieval. J. ACM 1960, 7, 216–244. [Google Scholar] [CrossRef]
Rocchio, J.J. Relevance Feedback in Information Retrieval. In The SMART Retrieval System: Experiments in Automatic Document Processing; Salton, G., Ed.; Prentice-Hall: Englewood Cliffs, NJ, USA, 1971; pp. 313–323. [Google Scholar]
Jones, K.S. Automatic Keyword Classification for Information Retrieval. Available online: https://api.semanticscholar.org/CorpusID:62724133 (accessed on 4 September 2025).
Rijsbergen, C.V. A theoretical basis for the use of co-occurrence data in information retrieval. J. Doc. 1977, 33, 106–119. [Google Scholar] [CrossRef]
Van Rijsbergen, C.J. A non-classical logic for information retrieval. Comput. J. 1986, 29, 481–485. [Google Scholar] [CrossRef]
Porter, M. Implementing a probabilistic information retrieval system. Inf. Technol. Res. Dev. 1982, 1, 131–156. [Google Scholar]
Yu, C.T.; Buckley, C.; Lam, K.; Salton, G. A Generalized Term Dependence Model in Information Retrieval; Technical Report; Cornell University: Ithaca, NY, USA, 1983. [Google Scholar]
Harman, D. Relevance Feedback Revisited; Association for Computing Machinery: New York, NY, USA, 1992. [Google Scholar]
Statista: Average Number of Search Terms for Online Search Queries in the United States as of January 2020. Available online: https://www.statista.com/statistics/269740/number-of-search-terms-in-internet-research-in-the-us/ (accessed on 4 September 2025).
Keyword Discovery: Keyword: Query Size by Country. Available online: https://www.keyworddiscovery.com/keyword-stats.html (accessed on 4 September 2025).
Azad, H.; Deepak, A.; Chakraborty, C.; Abhishek, A.K. Improving query expansion using pseudo-relevant web knowledge for information retrieval. Pattern Recognit. Lett. 2022, 158, 148–156. [Google Scholar] [CrossRef]
Azad, H.K.; Deepak, A. Query expansion techniques for information retrieval: A survey. Inf. Process. Manag. 2019, 56, 1698–1735. [Google Scholar] [CrossRef]
Hu, W.M.; Xie, N.H.; Li, L.; Zeng, X.L.; Maybank, S. A survey on visual content-based video indexing and retrieval. IEEE Trans. Syst. Man Cybern. Part C Appl. Rev. 2011, 41, 797–819. [Google Scholar] [CrossRef]
Etter, D. KB Video Retrieval at TRECVID 2011. 2009. Available online: https://www-nlpir.nist.gov/projects/tvpubs/tv11.papers/kbvr.pdf (accessed on 4 September 2025).
Ellouze, N.; Lammari, N.; Métais, E.; Ahmed, M.B. CITOM: Approche de construction incrémentale d’une Topic Map multilingue. Data Knowl. Eng. 2010. [Google Scholar]
Rossetto, L.; Giangreco, I.; Ta, C.; Schuldt, H. Multimodal video retrieval with the 2017 IMO-TION system. In Proceedings of the ICMR ’17: International Conference on Multimedia Retrieval (ICMR), New York, NY, USA, 6 June 2017; pp. 457–460. [Google Scholar] [CrossRef]
Spolaôr, N.; Lee, H.D.; Takaki, W.S.R.; Ensina, L.A.; Coy, C.S.R.; Wu, F.C. A systematic review on content-based video retrieval. Eng. Appl. Artif. Intell. 2020, 90, 103557. [Google Scholar] [CrossRef]
Wu, S.; Li, Y.; Zhu, K.; Zhang, G.; Liang, Y.; Ma, K.; Xiao, C.; Zhang, H.; Yang, B.; Chen, W.; et al. SciMMIR: Benchmarking Scientific Multi-modal Information Retrieval. arXiv 2024. [Google Scholar] [CrossRef]
Bengio, Y.; Courville, A.; Vincent, P. Representation learning: A review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 35, 1798–1828. [Google Scholar] [CrossRef]
Byeon, M.; Park, B.; Kim, H.; Lee, S.; Baek, W.; Kim, S.; Kakao Brain Large-Scale AI Studio. Coyo-700m: Image-Text Pair Dataset. GitHub. 2022. Available online: https://github.com/kakaobrain/coyo-dataset (accessed on 4 September 2025).
Chen, W.; Hu, H.; Chen, X.; Verga, P.; Cohen, W. Murag: Multimodal retrieval-augmented generator for open question answering over images and text. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates, 7–11 December 2022; pp. 5558–5570. [Google Scholar]
Cheng, X.; Cao, B.; Ye, Q.; Zhu, Z.; Li, H.; Zou, Y. ML-LMCL: Mutual learning and large-margin contrastive learning for improving ASR robustness in spoken language understanding. Find. Assoc. Comput. Linguist. ACL 2023, 2023, 6492–6505. [Google Scholar]
Gao, P.; Han, J.; Zhang, R.; Lin, Z.; Geng, S.; Zhou, A.; Zhang, W.; Lu, P.; He, C.; Yue, X.; et al. LLaMA-Adapter V2: Parameter-efficient visual instruction model. arXiv 2023, arXiv:2304.15010. [Google Scholar]
Goldsack, T.; Zhang, Z.; Lin, C.; Scarton, C. Domain-Driven and Discourse-Guided Scientific Summarisation. In European Conference on Information Retrieval; Springer: Cham, Switzerland, 2023; pp. 361–376. [Google Scholar]
Feki, I.; Ba, A.; Alimi, A. New process to identify audio concepts based on binary classifiers encapsulation. Int. J. Comput. Electr. Eng. 2012, 4, 515–518. [Google Scholar] [CrossRef]
Elleuch, N.; Zarka, M.; Feki, I.; Ba, A.; Alimi, A. Regimvid at Trecvid2010: Semantic Indexing. In Proceedings of the TRECVID 2010 Workshop, Gaithersburg, MD, USA, 15–17 November 2010. [Google Scholar] [CrossRef]
Elleuch, N.; Ba, A.; Alimi, A. A generic framework for semantic video indexing based on visual concepts/contexts detection. Multimed. Tools Appl. 2014, 74, 1397–1421. [Google Scholar] [CrossRef]
Smeulders, A.; Worring, M.; Santini, S.; Gupta, A.; Jain, R. Content-based image retrieval at the end of the early years. IEEE Trans. Pattern Anal. Mach. Intell. 2000, 22, 1349–1380. [Google Scholar] [CrossRef]
Toriah, S.T.M.; Ghalwash, A.Z.; Youssif, A.A.A. Semantic-based video retrieval survey. J. Comput. Commun. 2023, 6, 28–44. [Google Scholar] [CrossRef]
Sjoberg, M.; Viitaniemi, V.; Koskela, M.; Laaksonen, J. PicSOM Experiments in TRECVID 2009. Available online: https://research.cs.aalto.fi/cbir/papers/trecvid2009.pdf (accessed on 4 September 2025).
Slimi, J.; Mansouri, S.; Ammar, A.B.; Alimi, A.M. Video exploration tool based on semantic network. In Proceedings of the 10th Conference on Open Research Areas in Information Retrieval, OAIR ’13, Lisbon, Portugal, 15–17 May 2013; pp. 213–214. [Google Scholar]
Slimi, J.; Ammar, A.B.; Alimi, A.M. Interactive Video Data Visualization System Based on Semantic Organization. In Proceedings of the 2013 11th International Workshop on Content-Based Multimedia Indexing (CBMI), Veszprem, Hungary, 17–19 June 2013; pp. 161–166. [Google Scholar] [CrossRef]
Amato, G.; Bolettieri, P.; Carrara, F.; Falchi, F.; Gennaro, C.; Messina, N.; Vadicamo, L.; Vairo, C. VISIONE at video browser showdown 2023. In MultiMedia Modeling: 29th International Conference, MMM 2023, Bergen, Norway, Jan. 9–12, 2023, Proceedings, Part I; Springer International Publishing: Cham, Switzerland, 2023; pp. 615–621. [Google Scholar]
Fang, H.; Xiong, P.; Xu, L.; Chen, Y. Clip2video: Mastering video-text retrieval via image clip. arXiv 2021, arXiv:2106.11097. [Google Scholar]
Messina, N.; Stefanini, M.; Cornia, M.; Baraldi, L.; Falchi, F.; Amato, G.; Cucchiara, R. ALADIN: Distilling fine-grained alignment scores for efficient image-text matching and retrieval. arXiv 2022, arXiv:2207.14757. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, Online, 18–24 July 2021; PMLR: Cambridge, MA, USA, 2021; pp. 8748–8763. [Google Scholar]
Amato, G.; Bolettieri, P.; Carrara, F.; Debole, F.; Falchi, F.; Gennaro, C.; Vadicamo, L.; Vairo, C. The VISIONE Video Search System: Exploiting Off-the-Shelf Text Search Engines for Large-Scale Video Retrieval. J. Imaging 2021, 7, 76. [Google Scholar] [CrossRef] [PubMed]
Amato, G.; Carrara, F.; Falchi, F.; Gennaro, C.; Vadicamo, L. Large-scale instance-level image retrieval. Inf. Process. Manag. 2019, 56, 102100. [Google Scholar] [CrossRef]
Carrara, F.; Vadicamo, L.; Gennaro, C.; Amato, G. Approximate nearest neighbor search on standard search engines. In Similarity Search and Applications; Springer: Berlin/Heidelberg, Germany, 2022; pp. 214–221. [Google Scholar]
Gurrin, C.; Zhou, L.; Healy, G.; Jónsson, B.Þ.; Dang-Nguyen, D.-T.; Lokoć, J.; Tran, M.-T.; Hürst, W.; Rossetto, L.; Schöffmann, K. Introduction to the fifth annual lifelog search challenge, LSC’22. In Proceedings of the ICMR ′22: International Conference on Multimedia Retrieval (ICMR’22), Newark, NJ, USA, 27–30 June 2022; Association for Computing Machinery: New York, NY, USA, 2022. [Google Scholar]
Heller, S.; Gsteiger, V.; Bailer, W.; Gurrin, C.; Jónsson, B.Þ.; Lokoč, J.; Leibetseder, A.; Mejzlík, F.; Peška, L.; Rossetto, L.; et al. Interactive video retrieval evaluation at a distance: Comparing sixteen interactive video search systems in a remote setting at the 10th Video Browser Showdown. Int. J. Multimed. Inf. Retr. 2022, 11, 1–18. [Google Scholar] [CrossRef]
Lokoč, J.; Bailer, W.; Schoeffmann, K.; Muenzer, B.; Awad, G. On influential trends in interactive video retrieval: Video Browser Showdown 2015–2017. IEEE Trans. Multimed. 2018, 20, 3361–3376. [Google Scholar] [CrossRef]
Lokoč, J.; Vopálková, Z.; Dokoupil, P.; Peška, L. Video search with CLIP and interactive text query reformulation. In MultiMedia Modeling: 29th International Conference, MMM 2023, Bergen, Norway, 9–12 January 2023, Proceedings, Part I; Springer: Berlin/Heidelberg, Germany, 2023; pp. 628–633. [Google Scholar]
Halima, B.H.; Hamroun, M.; Moussa, S.B.; Alimi, A.M. An interactive engine for multilingual video browsing using semantic content. arXiv 2013. [Google Scholar] [CrossRef]
Zhang, Z.; Li, W.; Gurrin, C.; Smeaton, A.F. Faceted navigation for browsing large video collection. In MultiMedia Modeling; Tian, Q., Sebe, N., Qi, G.J., Huet, B., Hong, R., Liu, X., Eds.; Springer: Cham, Switzerland, 2016; pp. 412–417. [Google Scholar] [CrossRef]
Galanopoulos, D.; Markatopoulou, F.; Mezaris, V.; Patras, I. Concept language models and event-based concept number selection for zero-example event detection. In Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval ICMR ’17, New York, NY, USA, 6–9 June 2017; pp. 397–401. [Google Scholar] [CrossRef]
Janwe, N.; Bhoyar, K. Semantic concept based video retrieval using convolutional neural network. SN Appl. Sci. 2020, 2, 80. [Google Scholar] [CrossRef]
Amato, F.; Greco, L.; Persia, F.; Poccia, S.R.; De Santo, A. Content-Based Multimedia Retrieval. In Data Management in Pervasive Systems; Colace, F., De Santo, M., Moscato, V., Picariello, A., Schreiber, F.A., Tanca, L., Eds.; Springer: Cham, Switzerland, 2015; pp. 291–310. [Google Scholar] [CrossRef]
Faudemay, P.; Seyrat, C. Intelligent delivery of personalised video programmes from a video database. In Proceedings of the Database and Expert Systems Applications, 8th International Conference (DEXA ’97), Toulouse, France, 1–2 September 1997; pp. 172–177. [Google Scholar] [CrossRef]
Meng, L.; Tan, A.H.; Xu, D. Semi-Supervised Heterogeneous Fusion for Multimedia Data Co-Clustering. IEEE Trans. Knowl. Data Eng. 2013, 26, 2293–2306. [Google Scholar] [CrossRef]
Poria, S.; Chaturvedi, I.; Cambria, E.; Hussain, A. Convolutional MKL based multimodal emotion recognition and sentiment analysis. In Proceedings of the IEEE 16th International Conference on Data Mining (ICDM), Barcelona, Spain, 12–15 December 2016; pp. 439–448. [Google Scholar] [CrossRef]
Xu, J.; Huang, F.; Zhang, X.; Wang, S.; Li, C.; Li, Z.; He, Y. Visual-textual sentiment classification with bi-directional multi-level attention networks. Knowl.-Based Syst. 2019, 178, 61–73. [Google Scholar] [CrossRef]
Xu, J.; Huang, F.; Zhang, X.; Wang, S.; Li, C.; Li, Z.; He, Y. Sentiment analysis of social images via hierarchical deep fusion of content and links. Appl. Soft Comput. 2019, 80, 387–399. [Google Scholar] [CrossRef]
Huang, F.; Zhang, X.; Zhao, Z.; Xu, J.; Li, Z. Image-text sentiment analysis via deep multimodal attentive fusion. Knowl.-Based Syst. 2019, 167, 26–37. [Google Scholar] [CrossRef]
Yadav, A.; Vishwakarma, D.K. Sentiment analysis using deep learning architectures: A review. Artif. Intell. Rev. 2019, 53, 4335–4385. [Google Scholar] [CrossRef]
Xu, N. Analyzing multimodal public sentiment based on hierarchical semantic attentional network. In Proceedings of the IEEE International Conference on Intelligence and Security Informatics (ISI), Beijing, China, 22–24 July 2017; pp. 152–154. [Google Scholar] [CrossRef]
Chen, F.; Ji, R.; Su, J.; Cao, D.; Gao, Y. Predicting microblog sentiments via weakly supervised multimodal deep learning. IEEE Trans. Multimed. 2017, 20, 997–1007. [Google Scholar] [CrossRef]
Zhao, Z.; Zhu, H.; Xue, Z.; Liu, Z.; Tian, J.; Chua, M.; Liu, M. An image-text consistency driven multimodal sentiment analysis approach for social media. Inf. Process. Manag. 2019, 56, 102097. [Google Scholar] [CrossRef]
Yu, J.; Jiang, J.; Xia, R. Entity-sensitive attention and fusion network for entity-level multimodal sentiment classification. IEEE/ACM Trans. Audio Speech Lang. Process. 2020, 28, 429–439. [Google Scholar] [CrossRef]
Liu, A.A.; Shao, Z.; Wong, Y.; Li, J.; Yu-Ting, S.; Kankanhalli, M. LSTM-based multi-label video event detection. Multimed. Tools Appl. 2019, 78, 677–695. [Google Scholar] [CrossRef]
Shao, Z.; Han, J.; Debattista, K.; Pang, Y. Textual context-aware dense captioning with diverse words. IEEE Trans. Multimed. 2023, 25, 8753–8766. [Google Scholar] [CrossRef]
Hu, X.; Gan, Z.; Wang, J.; Yang, Z.; Liu, Z.; Lu, Y.; Wang, L. Scaling up vision-language pretraining for image captioning. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 17959–17968. [Google Scholar]
Shao, Z.; Han, J.; Marnerides, D.; Debattista, K. Region-object relation-aware dense captioning via transformer. IEEE Trans. Neural Netw. Learn. Syst. 2022, 36, 4184–4195. [Google Scholar] [CrossRef]
Mahrishi, M.; Morwal, S.; Muzaffar, A.W.; Bhatia, S.; Dadheech, P.; Rahmani, M.K.I. Rahmani Video Index Point Detection and Extraction Framework Using Custom YoloV4 Darknet Object Detection Model. IEEE Access 2021, 9, 143378–143391. [Google Scholar] [CrossRef]
Riedl, M.; Biemann, C. TopicTiling: A text segmentation algorithm based on LDA. In Proceedings of the ACL 2012 Student Research Workshop, Jeju Island, Republic of Korea, 9–11 July 2012; Association for Computational Linguistics: Stroudsburg, PA, USA; pp. 37–42. [Google Scholar]
Uke, N. Segmentation and organization of lecture video based on visual contents. Int. J. e-Educ. e-Bus e-Manag. e-Learn. 2012, 2, 132. [Google Scholar] [CrossRef]
Podlesnaya, A.; Podlesnyy, S. Deep learning based semantic video indexing and retrieval. In Proceedings of the SAI Intelligent Systems Conference (IntelliSys) 2016, London, UK, 21–22 September 2016; Springer: Cham, Switzerland; pp. 359–372. [Google Scholar]
Lu, W.; Sun, H.; Chu, J.; Huang, X.; Yu, J. A novel approach for video text detection and recognition based on a corner response feature map and transferred deep convolutional neural network. IEEE Access 2018, 6, 40198–40211. [Google Scholar] [CrossRef]
Li, Z.; Liu, X.; Zhang, S. Shot boundary detection based on multilevel difference of colour histograms. In Proceedings of the 2016 First International Conference on Multimedia and Image Processing (ICMIP), Bandar Seri Begawan, Brunei, 1–3 June 2016; pp. 15–22. [Google Scholar]
Xu, J.; Song, L.; Xie, R. Shot boundary detection using convolutional neural networks. In Proceedings of the 2016 Visual Communications and Image Processing (VCIP), Chengdu, China, 27–30 November 2016; pp. 1–4. [Google Scholar]
Gao, J.; Xu, C. Learning Video Moment Retrieval Without a Single Annotated Video. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 1646–1657. [Google Scholar] [CrossRef]
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A.; Liu, W.; et al. Going deeper with convolutions. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
Tang, J.; Wang, K.; Shao, L. Supervised Matrix Factorization Hashing for Cross-Modal Retrieval. IEEE Trans. Image Process. 2016, 25, 3157–3166. [Google Scholar] [CrossRef] [PubMed]
Tang, J.; Li, Z.; Zhu, X. Supervised deep hashing for scalable face image retrieval. Pattern Recognit. 2018, 75, 25–32. [Google Scholar] [CrossRef]
Liong, V.E.; Lu, J.; Wang, G.; Moulin, P.; Zhou, J. Deep hashing for compact binary codes learning. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 2475–2483. [Google Scholar]
Li, W.-J.; Wang, S.; Kang, W.-C. Feature learning based deep supervised hashing with pairwise labels. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, New York, NY, USA, 9–15 July 2016; pp. 1711–1717. [Google Scholar]
Do, T.-T.; Doan, A.-D.; Cheung, N.-M. Learning to Hash With Binary Deep Neural Network. In Lecture Notes in Computer Science; Matas, J., Sebe, N., Welling, M., Eds.; Springer: Cham, Switzerland, 2016; pp. 219–234. [Google Scholar]
Faghri, F.; Fleet, D.J.; Kiros, J.R.; Fidler, S. VSE++: Improving visual-semantic embeddings with hard negatives. arXiv 2018, arXiv:1707.05612. [Google Scholar]
Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common objects in context. In European Conference on Computer Vision; Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T., Eds.; Springer: Cham, Switzerland, 2014; pp. 740–755. [Google Scholar]
Jin, L.; Li, Z.; Tang, J. Deep Semantic Multimodal Hashing Network for Scalable Image-Text and Video-Text Retrievals. IEEE Trans. Neural Netw. Learn. Syst. 2023, 34, 1838–1851. [Google Scholar] [CrossRef]
Ding, G.; Guo, Y.; Zhou, J. Collective Matrix Factorization Hashing for Multimodal Data. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 2075–2082. [Google Scholar]
Zhou, J.; Ding, G.; Guo, Y. Latent semantic sparse hashing for cross-modal similarity search. In Proceedings of the SIGIR ’14: The 37th International ACM SIGIR Conference on Research and Development in Information Retrieval, Gold Coast, QLD, Australia, 6–11 July 2014; pp. 415–424. [Google Scholar]
Lin, Z.; Ding, G.; Hu, M.; Wang, J. Semantics-preserving hashing for cross-view retrieval. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 3864–3872. [Google Scholar]
Masci, J.; Bronstein, M.M.; Bronstein, A.M.; Schmidhuber, J. Multimodal similarity-preserving hashing. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 36, 824–830. [Google Scholar] [CrossRef]
Jiang, Q.-Y.; Li, W.-J. Deep cross-modal hashing. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 3270–3278. [Google Scholar]
Kunal, B.; Kaur, K.; Choudhary, C. A Machine learning model for content-based image retrieval. In Proceedings of the 2023 2nd International Conference for Innovation in Technology (INOCON), Bangalore, India, 3–5 March 2023; pp. 1–6. [Google Scholar] [CrossRef]
Manjunathi, B.S.; Ma, W.Y. Texture features for browsing and retrieval of image data. IEEE Trans. Pattern Anal. Mach. Intell. 1996, 18, 837–842. [Google Scholar] [CrossRef]
Deng, Y.; Manjunath, B.S. Unsupervised segmentation of color-texture regions in images and video. IEEE Trans. Pattern Anal. Mach. Intell. 2001, 23, 800–810. [Google Scholar] [CrossRef]
Park, B.; Park, H.; Lee, S.M.; Seo, J.B.; Kim, N. Lung segmentation on HRCT and volumetric CT for diffuse interstitial lung disease using deep convolutional neural networks. J. Digit. Imaging 2019, 32, 1019–1026. [Google Scholar] [CrossRef]
Travis, W.D.; Costabel, U.; Hansell, D.M.; King, T.E., Jr.; Lynch, D.A.; Nicholson, A.G.; Ryerson, C.J.; Ryu, J.H.; Selman, M.; Wells, A.U.; et al. An official American Thoracic Society/European Respiratory Society statement: Update of the international multidisciplinary classification of the idiopathic interstitial pneumonias. Am. J. Respir. Crit. Care Med. 2013, 188, 733–748. [Google Scholar] [CrossRef]
Kunal, P.; Singh, P.; Hirani, N. A Cohesive relation between cybersecurity and information security. In Proceedings of the 2022 IEEE 3rd Global Conference for Advancement in Technology (GCAT), Bangalore, India, 7–9 October 2022; pp. 1–6. [Google Scholar] [CrossRef]
Hwang, H.J.; Seo, J.B.; Lee, S.M.; Kim, E.Y.; Park, B.; Bae, H.J.; Kim, N. Content-based image retrieval of chest CT with convolutional neural network for diffuse interstitial lung disease: Performance assessment in three major idiopathic interstitial pneumonias. Korean J. Radiol. 2021, 22, 281–290. [Google Scholar] [CrossRef] [PubMed]
Duan, G.; Yang, J.; Yang, Y. Content-based image retrieval research. Phys. Procedia 2011, 22, 471–477. [Google Scholar] [CrossRef]
Latif, A.; Rasheed, A.; Sajid, U. Content-based image retrieval and feature extraction: A comprehensive review. Math. Probl. Eng. 2019, 2019, 9658350. [Google Scholar] [CrossRef]
Depeursinge, A.; Vargas, A.; Gaillard, F.; Platon, A.; Geissbuhler, A.; Poletti, P.-A.; Müller, H. Case-based lung image categorization and retrieval for interstitial lung diseases: Clinical workflows. Int. J. CARS 2012, 7, 97–110. [Google Scholar] [CrossRef]
Raghu, G.; Collard, H.R.; Egan, J.J.; Martinez, F.J.; Behr, J.; Brown, K.K.; Colby, T.V.; Cordier, J.-F.; Flaherty, K.R.; Lasky, J.A.; et al. An official ATS/ERS/JRS/ALAT statement: Idiopathic pulmonary fibrosis: Evidence-based guidelines for diagnosis and management. Am. J. Respir. Crit. Care Med. 2011, 183, 788–824. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021; pp. 9992–10002. [Google Scholar]
Tan, M.; Le, Q. EfficientNet: Rethinking model scaling for convolutional neural networks. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; Chaudhuri, K., Salakhutdinov, R., Eds.; PMLR: Cambridge, MA, USA, 2019; Volume 97, pp. 6105–6114. [Google Scholar]
Tan, M.; Le, Q. Efficientnetv2: Smaller models and faster training. In Proceedings of the 38th International Conference on Machine Learning, Virtual, 18–24 July 2021; Meila, M., Zhang, T., Eds.; PMLR: Cambridge, MA, USA, 2021; Volume 139, pp. 10096–10106. [Google Scholar]
Bao, H.; Dong, L.; Piao, S.; Wei, F. Beit: BERT pre-training of image transformers. arXiv 2021, arXiv:2106.08254. [Google Scholar]
Fadaei, S.; Amirfattahi, R.; Ahmadzadeh, M.R. A new content-based image retrieval system based on optimized inte-gration of DCD, wavelet and curvelet features. IET Image Process. 2017, 11, 89–98. [Google Scholar] [CrossRef]
Dubey, S.R.; Singh, S.K.; Singh, R.K. Rotation and scale invariant hybrid image descriptor and retrieval. Comput. Electr. Eng. 2015, 46, 288–302. [Google Scholar] [CrossRef]
Talib, A.; Mahmuddin, M.; Husni, H.; George, L.E. A weighted dominant color descriptor for content-based image retrieval. J. Vis. Commun. Image Represent. 2013, 24, 345–360. [Google Scholar] [CrossRef]
Jhanwar, N.; Chaudhuri, S.; Seetharaman, G.; Zavidovique, B. Content-based image retrieval using motif co-occurrence matrix. Image Vis. Comput. 2004, 22, 1211–1220. [Google Scholar] [CrossRef]
Lin, C.-H.; Chen, R.-T.; Chan, Y.-K. A smart content-based image retrieval system based on color and texture feature. Image Vis. Comput. 2009, 27, 658–666. [Google Scholar] [CrossRef]
ElAlami, M.E. A novel image retrieval model based on the most relevant features. Knowl.-Based Syst. 2011, 24, 23–32. [Google Scholar] [CrossRef]
Murala, S.; Maheshwari, R.P.; Balasubramanian, R. Local tetra patterns: A new feature descriptor for content-based image retrieval. IEEE Trans. Image Process. 2012, 21, 2874–2886. [Google Scholar] [CrossRef]
Kundu, M.K.; Chowdhury, M.; Bulo, S.R. A graph-based relevance feedback mechanism in content-based image retrieval. Knowl.-Based Syst. 2015, 73, 254–264. [Google Scholar] [CrossRef]
Yildizer, E.; Balci, A.M.; Jarada, T.N.; Alhajj, R. Integrating wavelets with clustering and indexing for effective content-based image retrieval. Knowl.-Based Syst. 2012, 31, 55–66. [Google Scholar] [CrossRef]
Hamroun, M.; Lajmi, S.; Nicolas, H.; Amous, I. ISE: Interactive image search using visual content. In 20th International Conference, ICEIS 2018, Funchal, Madeira, Portugal, 21–24 March 2018; Springer: Berlin/Heidelberg, Germany, 2018; pp. 253–261. [Google Scholar]

Figure 1. Conceptual architecture of our approach.

Figure 2. Indexing phase.

Figure 3. DWConv stands for Depthwise Convolution. The values 1 × 1/3 × 3 indicate the size of the kernel used. BN refers to Batch Normalization. H, W, and F represent the height, width, and depth of the tensor, respectively. The multiplier indicates the number of repetitions of the layers, ranging from 1 to 4. (Figure adapted from the original EfficientNet study).

Figure 4. Proposed Architecture: The input size is 224 × 224 × 3, with pre-trained EfficientNet B1 weights on ImageNet. This is followed by a zero-padding layer, a convolutional layer with a 3 × 3 kernel, then a GAP layer, a Dropout of 0.2, a Dense layer with 1024 units, and finally a classification layer (a Dense layer with 30 classes).

Figure 5. Block diagram of the set-based network. The CNN used is EfficientNetB1, while the ViT used is BEiT.

Figure 6. Schematic representation of the model training process, including making predictions using the KNN machine learning algorithm to determine the optimal K value based on the highest F1-macro score. The identified parameters are saved and later used for predictions on the test data.

Figure 7. An overview of the model application for test image classification. The input images are fed into the model, where they are encoded to generate feature vectors. The network then measures the distances between these feature vectors and all training images. Using the KNN algorithm, the predicted class for the input image is determined based on its proximity to the training samples.

Figure 8. Concept weighting.

Figure 9. XML file representing the weighting of concepts in the videos.

Figure 10. Similarity between concepts. (a) XML file representing the similarity between concepts. (b) Representation of inter-concept links.

Figure 11. Ontology construction. (a) XML file representing the different contexts. (b) Semantic concept network.

Figure 12. Excerpt from our ontology, result of the indexing phase.

Figure 13. Query Expansion Approach.

Figure 14. Presentation of the CBIR Architecture.

Figure 15. Retrieval by Textual Query.

Figure 16. Result of the “news” Query.

Figure 17. Accuracy and Loss Curves for EfficientNet, ViT, and EfficientNet + ViT models.

Figure 18. Comparison of the Performance and Average Precision Between Existing Methods and the Proposed Method.

Figure 19. Experimental Model.

Figure 20. Ablation Study.

Figure 21. F1-score and mAP for different concepts.

Table 1. EfficientNet Base Network: Architecture B0.

Feature	1	2	3	4	5	6	7	8	9
Operations	Cv 3 × 3	MBConv1 k3 × 3	MBConv6 k3 × 3	MBConv6 k5 × 5	MBConv6 k3 × 3	MBConv6 k5 × 5	MBConv6 k5 × 5	MBConv6 k3 × 3	Cv 1 × 1/Pool/FC
Output size	224 × 224	112 × 112	112 × 112	56 × 56	28 × 28	14 × 14	14 × 14	7 × 7	7 × 7
#Channels	32	16	24	40	80	112	192	320	1280
#Layers	1	1	2	2	3	3	4	1	1

Table 2. Descriptor vector.

Concepts	Description (Descriptor Vector)
Actor	One or more television or movie actors or actresses
Adult	Shots showing a person over the age of 18
Airplane	Shots of an airplane
Airplane Flying	An airplane flying in the sky
Animal	Shots depicting an animal (no humans)
Asian people	People of Asian ethnicity
Table: Excerpt from ontology concepts

Table 3. Comparison of the Performance and Average Precision of Existing Methods and the Proposed Method.

Corel-1K REF	Africa	Beach	Building	Bus	Dinosaur	Elephant	Flower	Horse	Mountain	Food	Average
[109]	72.4	51.15	59.55	92.35	99.9	72.7	92.25	96.6	55.75	72.35	76.5
[110]	45.25	39.75	37.35	74.1	91.45	30.4	85.15	56.8	29.25	36.95	52.64
[111]	68.3	54.0	56.15	88.8	99.25	65.8	89.1	80.25	52.15	73.25	72.7
[112]	70.3	56.1	57.1	87.6	98.7	67.5	91.4	83.4	53.6	74.1	73.98
[113]	54.95	39.4	39.6	84.3	94.7	36.0	85.85	57.5	29.45	56.7	57.85
[114]	49.95	71.25	30.1	79.75	92.05	59.45	99.5	82.25	54.6	20.2	63.91
[115]	73.05	59.35	61.1	69.15	99.15	80.1	80.15	89.1	58.0	74.5	74.36
[116]	68.95	41.1	74.3	64.4	99.55	56.65	86.55	93.2	55.15	77.95	71.78
[117]	59.9	50.85	50.15	94.0	97.6	46.65	87.5	76.5	35.25	56.25	65.47
[118]	88.5	79.5	67.65	100	100	93.1	100	100	77.75	89.3	89.5
Our system	100	100	100	100	100	100	100	100	100	100	100

Table 4. Ablation Study Table.

ID	Semantic Concepts	Low-Level Features	EfficientNet	ViT	EfficientNet + ViT	EfficientNet + ViT + Knowledge Graph
1	Airplane	0.23	0.45	0.55	0.90	0.95
2	Anchorperson	0.43	0.88	0.77	0.97	0.99
3	Basketball	0.16	0.16	0.34	0.88	0.93
4	Bicycling	0.33	0.56	0.55	0.78	0.89
5	Boat_Ship	0.40	0.61	0.71	0.78	0.89
6	Bridges	0.15	0.35	0.31	0.88	0.94
7	Bus	0.09	0.17	0.22	0.66	0.77
8	Car_Racing	0.03	0.03	0.03	0.55	0.75
9	Cheering	0.02	0.11	0.11	0.76	0.79
10	Computers	0.21	0.57	0.45	0.79	0.85
11	Dancing	0.08	0.08	0.12	0.72	0.84
12	Demonstration_Or_Protest	0.13	0.35	0.39	0.85	0.90
13	Explosion_Fire	0.08	0.18	0.14	0.79	0.89
14	Government−Leader	0.35	0.57	0.66	0.93	0.98
15	Instrumental_Musician	0.40	0.8	0.86	0.95	0.98
16	Kitchen	0.32	0.55	0.59	0.95	0.98
17	Motorcycle	0.11	0.19	0.33	0.70	0.79
18	Office	0.32	0.4	0.53	0.90	0.97
19	Old_People	0.10	0.21	0.34	0.65	0.88
20	Press_Conference	0.09	0.23	0.29	0.55	0.78
21	Running	0.12	0.24	0.25	0.78	0.89
22	Telephones	0.11	0.29	0.25	0.78	0.89
23	Throwing	0.01	0.05	0.11	0.55	0.75
24	Flags	0.23	0.33	0.29	0.77	0.79
25	Hill	0.17	0.25	0.35	0.89	0.95
26	Lakes	0.21	0.21	0.34	0.85	0.93
27	Quadruped	0.33	0.55	0.65	0.94	0.98
28	Soldiers	0.24	0.31	0.43	0.80	0.89
29	Studio_With_Anchorperson	0.4	0.77	0.88	0.92	0.97
30	Traffic	0.14	0.27	0.44	0.82	0.95

Table 5. Results of the hyperparameters, including precision, sensitivity, F1 score, and mAP.

Concept	Precision	Sensitivity	F1-Score	mAP 0.5	mAP [0.5:0.95]
Airplane	0.95	0.92	0.94	0.96	0.75
Anchorperson	0.99	0.97	0.98	0.99	0.80
Basketball	0.93	0.90	0.92	0.94	0.72
Bicycling	0.89	0.87	0.88	0.91	0.70
Boat_Ship	0.89	0.85	0.87	0.90	0.68
Bridges	0.94	0.90	0.92	0.95	0.74
Bus	0.77	0.75	0.76	0.80	0.60
Car_Racing	0.75	0.72	0.73	0.78	0.58
Cheering	0.79	0.76	0.77	0.81	0.63
Computers	0.85	0.83	0.84	0.87	0.69
Dancing	0.84	0.81	0.83	0.86	0.67
Demonstration_Or_Protest	0.90	0.87	0.89	0.92	0.73
Explosion_Fire	0.89	0.86	0.88	0.91	0.70
Government−Leader	0.98	0.96	0.97	0.99	0.79
Instrumental_Musician	0.98	0.95	0.96	0.98	0.78
Kitchen	0.98	0.94	0.96	0.98	0.77
Motorcycle	0.79	0.76	0.77	0.82	0.64
Office	0.97	0.94	0.95	0.98	0.76
Old_People	0.88	0.85	0.86	0.90	0.71
Press_Conference	0.78	0.75	0.76	0.80	0.62
Running	0.89	0.86	0.88	0.91	0.70
Telephones	0.89	0.85	0.87	0.90	0.69
Throwing	0.75	0.72	0.73	0.78	0.57
Flags	0.79	0.75	0.77	0.81	0.60
Hill	0.95	0.91	0.93	0.96	0.75
Lakes	0.93	0.89	0.91	0.94	0.73
Quadruped	0.98	0.96	0.97	0.99	0.79
Soldiers	0.89	0.86	0.88	0.91	0.70
Studio_With_Anchorperson	0.97	0.94	0.95	0.98	0.76
Traffic	0.95	0.91	0.93	0.96	0.75
Avg	0.95	0.91	0.93	0.96	0.74

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Hamroun, M.; Sauveron, D. A Hybrid Deep Learning and Knowledge Graph Approach for Intelligent Image Indexing and Retrieval. Appl. Sci. 2025, 15, 10591. https://doi.org/10.3390/app151910591

AMA Style

Hamroun M, Sauveron D. A Hybrid Deep Learning and Knowledge Graph Approach for Intelligent Image Indexing and Retrieval. Applied Sciences. 2025; 15(19):10591. https://doi.org/10.3390/app151910591

Chicago/Turabian Style

Hamroun, Mohamed, and Damien Sauveron. 2025. "A Hybrid Deep Learning and Knowledge Graph Approach for Intelligent Image Indexing and Retrieval" Applied Sciences 15, no. 19: 10591. https://doi.org/10.3390/app151910591

APA Style

Hamroun, M., & Sauveron, D. (2025). A Hybrid Deep Learning and Knowledge Graph Approach for Intelligent Image Indexing and Retrieval. Applied Sciences, 15(19), 10591. https://doi.org/10.3390/app151910591

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Hybrid Deep Learning and Knowledge Graph Approach for Intelligent Image Indexing and Retrieval

Abstract

1. Introduction

2. Related Works

2.1. Content-Based Retrieval

2.2. Semantic-Based Retrieval

2.3. Multimodal Fusion-Based Retrieval

2.4. Deep Learning-Based Retrieval

2.5. Discussion and Justification of the Choice

3. Intelligent and Generic Approach for Multimedia Indexing and Retrieval

3.1. Global View

3.2. Proposed Architecture

3.2.1. Data Preprocessing

3.2.2. Transition from Level 0 to Level 1 of the Knowledge Graph

3.2.3. Weighting of Concepts at the First Level of the Graph

3.2.4. Incorporation of Semantic Relations via Word Embeddings

3.2.5. Attention-Based Weighting for Better Contextualization

3.2.6. Our Ontology, the Result of the Indexing Phase

4. Retrieval Phase

4.1. Retrieval by Textual Query

4.2. Image Query Retrieval

4.3. User Interface (e.g., Textual Query)

5. Experimentation

5.1. Precision Values

5.2. Image Query

5.3. Ablation Study

5.4. Other Metrics

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI