Multilabel Classification of Radiology Image Concepts Using Deep Learning

Santamato, Vito; Marengo, Agostino

doi:10.3390/app15095140

Open AccessArticle

Multilabel Classification of Radiology Image Concepts Using Deep Learning

by

Vito Santamato

^*

and

Agostino Marengo

Department of Agricultural Sciences, Food, Natural Resources, and Engineering, University of Foggia, Via Napoli 25, 71121 Foggia, Italy

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(9), 5140; https://doi.org/10.3390/app15095140

Submission received: 8 April 2025 / Revised: 2 May 2025 / Accepted: 3 May 2025 / Published: 6 May 2025

Download

Browse Figures

Versions Notes

Abstract

Understanding and interpreting medical images, particularly radiology images, is a time-consuming task that requires specialized expertise. In this study, we developed a deep learning-based system capable of automatically assigning multiple standardized medical concepts to radiology images, leveraging deep learning models. These concepts are based on Unified Medical Language System (UMLS) Concept Unique Identifiers (CUIs) and describe the radiology images in detail. Each image is associated with multiple concepts, making it a multilabel classification problem. We implemented several deep learning models, including DenseNet121, ResNet101, and VGG19, and evaluated them on the ImageCLEF 2020 Medical Concept Detection dataset. This dataset consists of radiology images with multiple CUIs associated with each image and is organized into seven categories based on their modality information. In this study, transfer learning techniques were applied, with the models initially pre-trained on the ImageNet dataset and subsequently fine-tuned on the ImageCLEF dataset. We present the evaluation results based on the F1-score metric, demonstrating the effectiveness of our approach. Our best-performing model, DenseNet121, achieved an F1-score of 0.89 on the classification of the twenty most frequent medical concepts, indicating a significant improvement over baseline methods.

Keywords:

caption generation; radiology; medical images; deep learning; DenseNet121; VGG19

1. Introduction

Accurate disease diagnosis remains the crucial first step before initiating any treatment. In this context, medical imaging plays a pivotal role in clinical analysis. In recent years, however, the percentage of utilization of imaging examinations, such as MRI and CT scans, has continued to rise [1]. This has contributed to the increased need for faster and more accurate image analysis tools.

Medical imaging is primarily regarded as a technique for visualizing functional and structural features of the human body through modalities such as x-rays and magnetic resonance imaging. Moreover, the integration of imaging technologies within healthcare networks has significantly enhanced the quality and efficiency of medical services, driven by advances in computer science and imaging engineering.

The interpretation of medical images remains a time-consuming and expertise-demanding task, prone to variability and potential errors, especially in resource-constrained environments [2]. To address these challenges, the automatic generation of captions for radiology images has emerged as a promising solution. In this work, caption generation refers to the automatic assignment of multiple medical concept labels (UMLS CUIs) to radiology images, rather than the generation of full natural language sentences. The task is addressed as a multilabel classification problem. Automatically generating descriptive labels can assist radiologists by speeding up analysis and reducing diagnostic errors.

A deep learning-based system capable of automatically generating labels for radiology images is proposed, leveraging transfer learning strategies. Pre-trained models on large datasets such as ImageNet are fine-tuned on the specialized ImageCLEF medical caption detection dataset [3], enabling adaptation to the medical imaging domain.

Advanced deep learning methodologies have been extensively applied to classification tasks within the medical domain [4]. Convolutional Neural Networks (CNNs), pretrained on large-scale datasets such as ImageNet, are frequently employed as image encoders for medical imaging applications. However, ImageNet primarily comprises images from general visual scenes, which differ significantly from medical images in both content and texture. Consequently, finetuning CNN models is often required to optimize performance when adapting to the medical imaging domain. In [5], a refined version of the Inception V3 CNN architecture was developed to classify skin lesions as benign or malignant, achieving diagnostic accuracy comparable to that of experienced dermatologists. These results demonstrate that CNNs originally trained on non-medical datasets can be effectively repurposed for medical imaging tasks, provided that domain-specific refinements are appropriately implemented.

These labels explain radiology images in detail, and each image may be associated with multiple labels, making it a multilabel classification problem.

1.1. Medical Imaging Modalities

Radiologists specialize in the diagnosis and treatment of diseases by utilizing a range of imaging modalities, including x-rays, angiography, computed tomography (CT), magnetic resonance imaging (MRI), positron emission tomography (PET), nuclear medicine, and sonography. Each modality differs fundamentally in the technique used to acquire medical images.

Technologies, utilization rates for MRI, CT, and other modalities have continued to climb in both Ontario, Canada, and the United States, based on a study of over 135 million imaging examinations conducted by Kaiser Permanente, UC San Francisco, and UC Davis. This trend has raised concerns about potential overuse of medical imaging.

A study published on 3 September 2019, in the American Medical Association Journal, provided the first comprehensive analysis of imaging trends across diverse populations. Although the early 2000s saw a modest decline in imaging utilization, a subsequent rise in MRI and CT usage has been observed, particularly among adult outpatient groups, with pediatric CT utilization showing a decline in recent years. Figure 1 illustrates the trends in the use of medical imaging technologies within the U.S. healthcare system.

1.2. Problem Statement

The diagnostic process for medical images can benefit significantly from natural language processing (NLP) and computer vision methodologies, particularly with the recent advancements in deep learning. Concept annotation has become a common approach to predict multiple medical concepts associated with radiology images, as shown in recent works [6,7,8]. This constitutes a fundamental step toward the broader goal of full caption generation. Automatic labeling methods can reduce diagnostic errors and operational costs in medical departments (1). A primary objective of automated caption generation systems is to assist radiologists by reducing the time required for image analysis and minimizing potential diagnostic inaccuracies (2).

1.3. Research Objective

The objective of this research is to improve the efficiency of radiological image analysis.

The developed system labels radiology images with descriptive medical concepts, supporting radiologists in the preparation of diagnostic reports. Automating information extraction from medical images facilitates faster comprehension, reduces human error, and contributes to improved clinical decision-making processes.

Healthcare institutions could adopt such systems to enhance disease diagnosis and classification tasks.

Future work may involve leveraging the detected medical concepts to generate full, coherent textual captions using natural language processing (NLP) models, thereby completing the full image captioning pipeline.

1.4. On the Automatic Generation of Medical Imaging Information

Medical imaging, including radiography and pathology imaging, constitutes an essential component of diagnostic and therapeutic workflows in clinical settings.

A multi-task learning framework was introduced, capable of simultaneously predicting image tags and generating descriptive reports. This framework incorporates a co-attention mechanism designed to jointly capture visual and semantic information, thereby enhancing the precision in identifying and describing anomalous regions. Additionally, a hierarchical Long Short-Term Memory (LSTM) architecture was developed to effectively model long-range semantic dependencies and to generate high-quality, coherent textual reports.

The performance of the proposed methods was validated on two prominent medical datasets: IU X-Ray [9] and PEIR Gross [10] datasets. The evaluation employed the BLEU (Bilingual Evaluation Understudy) metric, where “n” denotes the n-gram window size used for comparison between generated and reference texts.

The IU X-Ray dataset comprises radiology images accompanied by structured reports organized into four sections: Contrast, Indication, Impression, and Findings. Conversely, the PEIR Gross dataset contains 7443 gross pathology images, each associated with a single descriptive caption sourced from 21 sub-categories of the PEIR (Pathology Education Instructional Resource) library, designed for medical education.

The experimental results are summarized in Table 1, indicating the BLEU-n scores achieved on both datasets.

1.5. Report Generation of Automatic Radiology Founded on Multi-View Imagery Medical and Fusion Concepts Enrichment

A novel medicinal imaging report age group prototype concentrating on radiology is projected. To be exact, the contributions of the future outline are x-ray pictures of the chest area from dissimilar views (forward and adjacent), founded on which radioscopy intelligences are produced consequently. Radiology info contains material radiotherapists’ summaries and is significant for additional analysis and follow-up endorsements [11].

The extensively utilized chest x-ray radiology report dataset (IU-RR) from Indiana University serves as an effective resource for this task. Radiology reports typically consist of coherent narratives composed of multiple sentences or short passages summarizing diagnostic findings. Addressing the specific challenges of report generation, the encoder and decoder components of the architecture are enhanced in the following ways: firstly, a multi-task framework is proposed that jointly performs chest x-ray image classification and report generation. This framework has proven effective as the encoder learns to capture features pertinent to radiology-specific report generation. Given the limited size of the IU-RR dataset, encoder pre-training becomes essential to achieve robust performance.

Dissimilar from preceding revisions maximizing ImageNet, which is composed of universal-purpose object acknowledgment, they pre-train with chest x-ray imageries that are large-scale from the similar field, specifically [12], to apprehend field precise imagery topographies for decoding the best.

Furthermore, initial approaches generally treated frontal and lateral chest x-ray views as independent modalities, overlooking their complementary diagnostic information. Studies demonstrate that lateral views provide essential information that frontal views alone may miss. Instead of simple feature concatenation or aggregation (e.g., mean, sum), a more structured and context-aware fusion of multi-view features is proposed.

Additionally, it is probable to produce unpredictable consequences for similar patients founded on images from dissimilar opinions. They suggest manufacturing multi-view info by applying a sentence-level courtesy archetypal, forcing the encoder to extract dependable topographies with loss of cross-view reliability (CVC). The decoder uses ranked LSTM (sentence- and word-level LSTM) to produce radioscopy bits of intelligence from the decoder side.

This integrated approach, by maintaining cross-view consistency and applying hierarchical sequence modeling, significantly enhances the quality and clinical relevance of the generated radiology reports.

1.6. The KERP Framework

The generation of comprehensive and semantically coherent descriptions for medical images poses significant challenges. This process requires bridging the gap between visual perception and linguistic representation, incorporating domain-specific medical knowledge, and ensuring the production of accurate and clinically meaningful descriptions. Addressing these complexities, a novel knowledge-driven approach named Encode, Retrieve, and Paraphrase (KERP) has been proposed. This method integrates traditional data-driven retrieval mechanisms with modern knowledge-based modeling to facilitate the generation of precise and context-aware medical reports [13].

The KERP framework decomposes the task of medical report generation into two sequential phases: initially, the creation of a structured graph representing medical abnormalities detected in the visual input, followed by the generation of corresponding natural language descriptions. Specifically, the Encode module transforms extracted visual features into an organized abnormality graph by leveraging prior medical knowledge. Subsequently, the Retrieve module accesses relevant textual templates associated with the abnormalities identified in the graph. Finally, the Paraphrase module refines the retrieved templates into customized, case-specific medical narratives.

The essential of KERP is a projected generic application unit—Chart Transformer (GTR) that animatedly converts high-level semantics between graph-structured statistics of multiple domains such as knowledge charts, pictures, and arrangements. The Chart Transformer provides a versatile mechanism for dynamically translating between different semantic spaces, enabling effective interaction between visual, graphical, and textual representations.

The effectiveness of the KERP approach was validated on two benchmark datasets: IU X-Ray and CX-CHR. The IU X-Ray dataset contains chest radiographs paired with detailed radiology reports, while CX-CHR comprises a private collection of chest x-rays with corresponding clinical descriptions. This research [14] uses BLEU-n to evaluate the performance of abnormality and disease arrangement. Table 2 shows experimental results [10].

The empirical findings demonstrate that KERP achieves state-of-the-art results on both datasets, with substantial improvements observed in terms of abnormality prediction accuracy, dynamic knowledge graph construction, and the generation of contextually accurate textual descriptions. The fundamental difference between our research and this is that it only focuses on using the chest x-ray. In contrast, there is a large diversity in the dataset we are maximizing in our research.

Therefore, while KERP represents a notable advancement in the automatic generation of radiological reports from chest x-rays, the broader and more diverse scope addressed in the current study introduces additional challenges that necessitate further methodological innovations.

1.7. Profound Learning for Ultrasound Imagery Caption Generation Founded on Object Detection

Significant advancements in image captioning have been achieved through deep learning methods applied to natural imagery. Nonetheless, there remains a notable lack of effective techniques specifically designed for the detailed analysis and automatic interpretation of disease-related content in ultrasound imaging. Ultrasound images, characterized by grayscale representations with low resolution and significant noise, present unique challenges, including indistinct boundaries between anatomical structures and interference between various pathological conditions. Furthermore, in clinical practice, although certain regions of the sonographic image may contain the primary diagnostic information, the entire image is essential for anatomical orientation and artifact assessment during interpretation. Similar challenges are also encountered in nuclear medicine imaging, where low spatial resolution and high noise levels complicate the identification and interpretation of pathological features.

These complexities render the direct analysis of ultrasound images particularly demanding. Historically, efforts within medical imaging understanding have predominantly centered on classification, detection, segmentation, and concept recognition tasks [6]. To address these specific challenges, a novel approach for ultrasound image caption generation based on part detection was proposed. This method simultaneously identifies and encodes key focus areas within ultrasound images and subsequently utilizes a Long Short-Term Memory (LSTM) network to decode the encoded features into coherent textual descriptions, accurately conveying the pathological content identified within the focus regions.

The investigational outcomes show that the technique can precisely detect the focus area’s site and recovers scores of BLEU-1 and BLEU-2 with fewer parameters and shorter running time. The major distinction between the present research and the aforementioned study lies in the imaging modalities considered. Whereas the referenced work exclusively targets caption generation for ultrasound images, the current study extends its methodology to support multiple imaging modalities, thereby encompassing a broader range of diagnostic scenarios.

Experimental results reported by Zeng et al. demonstrated that the Faster RCNN model achieved a BLEU-1 score of 0.63, a BLEU-2 score of 0.55, and a BLEU-3 score of 0.47 [15]. These outcomes highlight the effectiveness of integrating part detection with sequential decoding mechanisms, significantly enhancing the precision and clinical relevance of captions generated from ultrasound imagery.

1.8. ImageCLEF Caption Detection Task

The Cross-Language Evaluation Forum (CLEF) serves as an international platform dedicated to advancing research in cross-linguistic information retrieval [16]. Its objective is to assess the progress achieved in multilingual data access and to promote solutions addressing the unique challenges associated with multilingual and cross-modal information retrieval. Within this framework, ImageCLEF was initiated in 2003 as part of the CLEF initiative, focusing specifically on supporting the evaluation of systems designed for the automatic annotation of images with concepts, multimodal information retrieval based on both textual and visual content, and multilingual retrieval of annotated images.

ImageCLEF has consistently attracted participation from both academic institutions and industry researchers across diverse fields including visual information retrieval, computer vision, natural language processing, and medical informatics. The 2017 ImageCLEF campaign introduced several new evaluation tasks, among which the automatic generation of medical image captions was featured for the first time [17,18]. Each year, the dataset provided for these tasks has been updated to reflect evolving research needs and technological advancements.

1.8.1. ImageCLEF Caption Detection Task 2017

The ImageCLEF 2017 Caption Detection task was divided into two subtasks: concept detection and caption prediction. Participants first identified relevant UMLS concepts in biomedical images and then generated intelligible captions based on these concepts and image features. Capturing concept interactions was crucial to reconstruct the original captions beyond simply detecting individual visual elements.

The training set contained 164,614 biomedical images extracted from PubMed Central, with corresponding UMLS concepts for concept detection and image–caption pairs for caption prediction. Validation and test sets included 10,000 images each. Evaluation metrics were the F1-score for concept detection and the BLEU score for caption prediction. Performance summaries are shown in Table 3, Table 4, Table 5 and Table 6:

Sadid A. Hassan et al. [19] achieved the best results in the caption prediction task using an encoder–decoder architecture combining CNNs for feature extraction and RNNs with attention mechanisms for caption generation. They fine-tuned VGG19 on the ImageCLEF dataset and trained the model using stochastic gradient descent with the Adam optimizer and dropout regularization.

For concept detection, Asma Ben Abacha et al. [20] employed multilabel classification with CNNs and Binary Relevance-Decision Trees (BR-DTs), using GoogleNet for feature extraction. They also explored various visual descriptors such as CEDD and FCTH [17], finding that CEDD yielded the best performance in their experiments.

Leonidas Valavanis and Spyridon Stathopoulos [21] proposed a probabilistic k-nearest neighbor (PKNN) approach, combining bag-of-visual-words (BoVW) and Quantized Bag of Colors (QBoC) models, achieving second-best performance without external resources. Dense SIFT features and late fusion techniques were used to enhance retrieval and prediction performance.

Overall, the 2017 task demonstrated the effectiveness of deep learning, information retrieval, and ensemble methods for biomedical image captioning.

1.8.2. ImageCLEF Caption Detection Task 2018

The 2018 ImageCLEF Caption Detection task, a continuation of the 2017 edition [22], focused on clinical descriptions to limit dataset diversity. External data usage was restricted, and the challenge emphasized identifying general topics from biomedical images based on large-scale training data. The task maintained two subtasks: concept detection, requiring automatic extraction of UMLS CUIs, and caption prediction, generating descriptive captions (Table 7 and Table 8):

Several methods were developed for concept detection. One approach applied a traditional bag-of-visual-words model using ORB descriptors, with keypoints clustered via k-means (k = 4096) using FAISS [23]. Generative Adversarial Networks (GANs) [24] and their hybrid autoencoding variants [25] were also explored to learn latent visual features. Open-source tools like LIRE were used to index images, combined with Latent Dirichlet Allocation (LDA) for concept grouping.

In total, fifteen runs were submitted [26]. ImageSem achieved the best performance, reaching an F1-score of 0.0928 in concept detection and a BLEU-Score of 0.2501 in caption prediction. Their method, based on image retrieval and transfer learning, highlighted challenges in fully capturing biomedical image semantics.

1.8.3. ImageCLEF Caption Detection Task 2019

The ImageCLEF 2019 Caption Detection task [27], third in the series after 2017 and 2018, focused solely on radiology images mined from PubMed Central to reduce label noise. A single subtask required detecting UMLS concepts from images. The training set included 56,629 images, with 14,157 for validation. Teams mainly applied deep learning approaches such as CNNs, RNNs (especially LSTM), adversarial autoencoders, and transfer learning models. The evaluation used F1-scores averaged across 10,000 test images (Table 9).

AUEB NLP’s top-ranked systems combined CNN image encoders with retrieval and multilabel classification approaches. Their best model adapted CheXNet [3] using DenseNet-121, extending it to predict 5528 concepts by replacing the original output layer, applying sigmoidal activation, and optimizing thresholding at 0.16 [28]. Another system combined CheXNet probabilities with k-NN retrieval scores, and a VGG-19 based model ranked fifth.

Jing Xu et al. [29] proposed a hybrid CNN-LSTM framework with attention mechanisms. Features were extracted using a fine-tuned ResNet-101 encoder, followed by an LSTM decoder using attention to dynamically focus on image regions during concept prediction. Their method ranked second overall.

ImageCLEF 2019 confirmed the effectiveness of combining transfer learning, retrieval, and attention-based models for biomedical concept detection.

2. Research Methodology

Figure 2 illustrates the research methodology block diagram.

2.1. Complete Pipeline of the Proposed Methodology

The whole work is sub-categorized into multiple phases. Initially, CSV files of the ImageCLEF dataset were stored in the data frames, then preprocessing was applied to these data frames. Initially, the CSV files consisted of 3047 CUIs, but we reduced these CUIs to 20 CUIs on the most frequent CUIs present in the files in the preprocessing phase. Then these newly generated concepts s and images were fed into the DenseNet121 prototype. We used transfer learning concepts, at this moment maximizing the weights of ImageNet pre-trained on the DenseNet121 prototype. After training our prototype on our ImageCLEF caption detection dataset 2019, we generated the concepts s for testing images by maximizing our prototype. After successfully generating concepts s for testing images, we evaluated our prototype, maximizing the F1-score.

2.1.1. Dataset

The dataset used for this research was the caption detection dataset of ImageCLEF 2020 [30,31,32]. This is not a publicly available dataset. This dataset consists of 3 parts: training, validation, and testing. We did not have any access to the testing part of this dataset, so we obtained the testing data from the training and validation parts. There are a total of 80,723 images, which we categorized into three parts: 61,253 for training, 15,970 for validation, and 3500 for testing. There are a total of 7 types of modalities present in this dataset. Table 10 explains the modalities in detail and the specific CUI assigned to them.

Each image in this dataset has multiple concepts associated with it. The maximum length of concepts associated with an image is 142. Each image has one concept explaining the type of modality with the minimum length, showing the class imbalance problem present in this dataset (Table 10). There are a total of 3047 concepts is present in this dataset.

2.1.2. Preprocessing

This dataset is very challenging because of its high diversity, as it contains images of multiple modalities. Also, this dataset has a considerable problem of class imbalance. Initially, when this dataset was proposed in 2017, it had 111,155 CUIs associated with it. Later on, these CUIs were reduced to 5528 in the 2019 challenge, and for the 2020 challenge, it had 3047 CUIs. Reducing the number of concepts, the F1-score increases every year from 0.1583 in ImageCLEF 2017, 0.1108 in ImageCLEF 2018, and 0.2823 in ImageCLEF 2019 to 0.3940 in the year 2020. By keeping this fact in mind, we attempted to address this problem by reducing the number of concepts from 3047 to 20. We selected these concepts by maximizing the most frequent concepts technique. Figure 3 illustrates the most recurrent concepts in the dataset and provides an explanation of the most frequent ones.

New CSV files were generated for the training and validation phases, based on the 20 most frequent concepts identified in the dataset. During this process, it was observed that some images were not associated with any of the selected concepts. To address this issue and ensure compatibility with the CNN training pipeline, a new class, labeled “N”, was introduced to represent images without any assigned concepts. Although it would have been possible to exclude such images from the dataset, doing so would have significantly increased the complexity and time required for data management. Therefore, the introduction of the “N” class provided an efficient solution to maintain dataset consistency and facilitate model training.

While the reduction from 3047 to 20 concepts allowed us to address severe class imbalance and improve model performance, we acknowledge that this simplification may have led to the omission of important information, potentially limiting the model’s ability to generalize across all medical concepts. In this study, our primary objective was to demonstrate the feasibility of automatic concept detection on a manageable and meaningful subset of frequent concepts. Future work will aim to expand the set of detectable concepts and to assess the model’s generalization capabilities across more diverse and less frequent classes.

2.1.3. Multilabel Binarizer

In multi-class arrangement, the accurate label usually corresponds to a single integer. However, the input can be associated with multiple classes in a multilabel arrangement. For instance, a movie poster can have multiple genres, where labels are usually one-hot encoded for multi-class arrangement problems. By one-hot encoding, binary vectors are represented as categorical variables. Values in categories are first mapped to integer values. Then, each integer value is represented as a binary vector where all values are zero, excluding the integer index, marked with a 1.

In multilabel arrangement problems, any number of classes can be associated with a single instance. It is hypothesized with confidence that the labels are mutually exclusive. Thus, instead of one-hot encoding, multilabel binarization is performed. Here the label (which can have multiple classes) is transformed into a binary vector. All values are zero except the indexes associated for each class in that label, marked with a 1. In this dataset, there is a class imbalance or multilabel arrangement problem. So, to remove this problem, we use the built-in function of python, arrangement problem-label-binarize, which will generate a vector of length 21 and assign 1 to the concepts associated with the image; otherwise, it will place a 0. It has been observed that the maximum length of concepts for imagery is 11, and the minimum length is 1. Figure 4 shows the transformation process applied to the original CUI table into a binary array.

The input table contains multiple CUI codes associated with each image, while the output binary array represents the presence or absence of each concept in a format suitable for multilabel classification. This binarization step is crucial to enable the model to handle the multilabel nature of the medical annotation task.

2.1.4. Transfer Learning

Knowledge transfer consists of capturing features learned on one task and applying them to new, similar tasks. For example, features extracted from a model trained to recognize chest x-ray modalities can be used to initialize a model aimed at identifying specific pathological concepts. In deep learning, knowledge transfer occurs by taking the layers from a previously trained model, freezing them to preserve the learned information and prevent it from being modified during the new training, and then adding new trainable layers on top of the frozen ones. These new layers are subsequently trained on the new dataset, learning to transform the existing features into predictions related to the new task.

At last, non-compulsory step is fine alteration, which consists of rescinding the complete example you obtained above (or part of it) and re-training the new data with a low learning degree. This can theoretically attain expressive enhancements by incrementally familiarizing the pre-trained geographies to the new data. Methods of profound learning are utilized widely for arranging tasks in the domain of medicine. In previous works, CNNs pre-trained on ImageNet are usually used as image encoders. However, ImageNet comprises pictures of many scenes and is very distinct from medical pictures, so the prototype of CNN is refined to achieve better outcomes. This shows that CNNs that are already trained on ImageNet can be utilized in tasks of medical imagining when refined and do better, in spite of the differences between medical and general pictures. Also, our dataset has a wide range of diversity. The ImageNet dataset also has a wide range of diversity, which proves immensely helpful in solving our problem. Transfer learning was quite helpful to us in solving our problem because the ImageCLEF dataset is quite large. Without maximizing transfer learning concepts, we could not run it on a CPU-founded scheme when maximizing the DenseNet121 prototype. We also used VGG16 or VGG19 without maximizing transfer learning, but it increased our time complexity.

2.1.5. Deep Learning Model Implementation

The training of all models was performed using transfer learning, employing pre-trained weights from ImageNet for DenseNet121, ResNet101, and VGG19. Fine-tuning was carried out by unfreezing the last convolutional block of each network while keeping the earlier layers frozen. This approach aligns with the methodology used in skin cancer classification, where fine-tuned DenseNet121 achieved notable accuracy improvements [33].

The models were optimized using the Adam optimizer, with an initial learning rate of 0.0001. A learning rate scheduler (ReduceLROnPlateau) was applied, reducing the learning rate by a factor of 0.1 if the validation loss did not improve for five consecutive epochs.

We trained the models for a maximum of 50 epochs, using a batch size of 32, and employed binary cross-entropy as the loss function, suitable for multilabel classification tasks. Early stopping with a patience of 10 epochs based on the validation loss was applied to prevent overfitting.

The dataset was divided into 80% for training and 20% for validation. Data augmentation techniques, including random rotations, horizontal flips, and scaling, were utilized during training to enhance model generalization. Such augmentation strategies are commonly employed in brain tumor analysis using MRI data to improve model robustness [34].

All experiments were conducted using the PyTorch framework on a single NVIDIA GPU. This configuration was selected after preliminary tuning to achieve a stable balance between training time and model performance. Similar setups have been effectively used in end-to-end deep learning architectures for early Alzheimer’s diagnosis using 3D neuroimaging biomarkers [35]. Specifically, all experiments were conducted on a workstation equipped with an NVIDIA RTX 3090 GPU, 64 GB RAM, and an Intel Core i9 processor, running Ubuntu 20.04 and using PyTorch version 1.12.1.

We implemented multiple prototypes, including DenseNet121, ResNet101, and VGG19, both with and without transfer learning, to assess their effectiveness on our dataset. We have also used both GPU and CPU-founded techniques. Regarding training times, DenseNet121 required approximately 45 min, ResNet101 about 50 min, and VGG19 approximately 60 min for full training over 50 epochs. These times reflect the usage of transfer learning and early stopping criteria.

2.1.6. DenseNet121

Dense Convolutional Network (DenseNet) is a CNN model that connects each layer to every other layer in a feed-forward style. The feature maps of all previous layers are used as inputs for each layer, and their feature maps are used as inputs into all subsequent layers. DenseNet has many compelling advantages: it improves the problem of vanishing gradient, strengthens propagation of topography, inspires topography reuse, and considerably decreases the number of limits [36]. To encode the pictures, we utilized DenseNet-121. We started with DenseNet-121 that was already trained on ImageNet and refined it on our medical dataset to achieve better results. The pictures were rescaled to 224×224 and normalized with the standard deviation and mean of ImageNet to coordinate the necessities of DenseNet-121 and how it was already trained on ImageNet. We utilized binary cross-entropy loss function and sigmoid as an activation function. We increased the number of training epochs from 3 to 10; however, no significant improvement in validation performance was observed. We also tried to unfreeze some layers of our base prototype and train it, but it increased our time complexity; therefore, we maintained the original fine-tuned base model. We tried to run DenseNet121 without maximizing transfer learning on a GPU-founded scheme with 3047 concepts s, but training without transfer learning on the full set of 3047 concepts proved computationally infeasible, requiring almost one month to complete three epochs, primarily due to the extreme class imbalance and model complexity. We also implemented the ChexNet prototype to achieve satisfactory results, but it did not achieve satisfactory performance [37].

2.1.7. ResNet101

ResNet is short for residual network. Profound convolutional neural systems have attained the human-level imagery arrangement consequence. Profound networks extract low, intermediate, and high-level topography and classifiers in an end-to-end multi-layer style. The quantity of stacked layers can augment the “levels” of topographies. When a deep network starts to converge, a degradation problem has been observed: as the network depth increases, accuracy becomes saturated and then quickly degrades. This degradation is not caused by overfitting. Adding more layers to a deep network leads to higher training error. The worsening of training accuracy indicates that not all architectures are easy to optimize. To overcome this issue, Microsoft introduced a profound residual knowledge outline. Instead of forcing every few stacked layers to directly fit a desired underlying mapping, residual networks allow these layers to fit a residual mapping. In a feedforward neural network with shortcut (or skip) connections, the architecture learns to approximate functions of the form F(x) + x, where F(x) represents the residual function to be learned, and x is the input to the residual block. The shortcut connections perform identity mapping, and their outputs are added to the outputs of the stacked layers. By maximizing the remaining net, many difficulties can be resolved, such as the following: ResNets are informal to enhance, but the “plain” nets (that simply stack layers) show advanced workout errors when the complexity upsurges; and ResNets can rapidly gain correctness from meaningfully augmented depth, manufacturing outcomes that are better than preceding networks.

To encode the images, we used ResNet101, a CNN with 101 layers, where all layers are directly linked, improving info movement. We started with ResNet-101 pre-trained on ImageNet and fine-tuned it on our medical dataset to achieve improved outcomes. The pictures were rescaled to 224 × 224 and regularized with the standard deviation and mean of ImageNet to match the necessities of ResNet-121 and how it was already trained on ImageNet. We increased the number of epochs from 3 to 10, but it did not make any valuable difference. We also attempted to unfreeze some layers of the base model and retrain it; however, this significantly increased the training time complexity. Therefore, we decided to retain the original fine-tuned model configuration.

2.1.8. VGG19

VGG-19 is a convolutional neural net that is 19 layers deep. There are a total of sixteen convolutional layers and three fully linked layers in the VGG19 style. VGG has smaller sieves (3 × 3) with more complexity in its place, partaking great sieves. It has ended up consuming the same operative interest field as if you only have one 7 × 7 convolutional layer. The F1-score is a metric that combines precision and recall into a single value by calculating their harmonic mean. It provides a balance between precision and recall—especially useful when dealing with imbalanced datasets. We also tried VGG19 to solve the classification problem. When training the model based on DenseNet121 by unfreezing some layers, it achieved high accuracy and the lowest loss during training. However, during testing, although the model maintained high accuracy and low loss, it exhibited a low F1-score, indicating that the model was overfitting. Therefore, we attempted to use a model with fewer layers to mitigate the overfitting issue. Nevertheless, this model also resulted in low accuracy, so we decided to retain DenseNet121 with transfer learning.

2.1.9. Caption Generation

After training our prototype, we generated captions, also known as concepts or CUIs, for each testing image. Then we calculated the F1-score for each prototype by varying the number of concepts. The highest F1-score was achieved with DenseNet121 with transfer learning with 3047 concepts 0.26. The highest F1-score was achieved with DenseNet121 with transfer learning with 20 concepts 0.89. The highest F1-score was achieved with ResNet101 with transfer learning and 3047 concepts 0.21. The highest F1-score was obtained with ResNet101 with transfer learning and 20 concepts, 0.88.

2.2. Data Measurement

To evaluate the performance of the deep learning models, several standard metrics were used, including accuracy, precision, recall, and F1-score, as commonly adopted in recent healthcare-related machine learning studies [38,39,40,41].

The mathematical formulations of these metrics are defined as follows:

Accuracy is calculated as shown in Equation (1), measuring the proportion of correct predictions among the total number of cases. TP denotes true positives, TN true negatives, FP false positives, and FN false negatives.

A c c u r a c y = \frac{TP + TN}{TP + TN + FP + FN}

(1)

Precision, defined in Equation (2), measures the proportion of true positive predictions among all positive predictions made by the model.

P r e c i s i o n_{i} = \frac{T P}{TP + FP}

(2)

Recall, presented in Equation (3), evaluates the proportion of actual positives that were correctly identified by the model.

R e c a l l_{i} = \frac{TP}{TP + FN}

(3)

The F1-score, given by Equation (4), represents the harmonic mean of precision and recall, providing a balance between the two.

F 1 S c o r e_{i} = 2 \times \frac{P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

(4)

These metrics provide a comprehensive assessment of the classification performance, particularly in multilabel classification tasks.

3. Experiments and Results

3.1. Dataset Exploration

Figure 5 shows the distribution of concepts in the ImageCLEF dataset, the top 10 most frequent concepts.

The most frequent categories are “Tomography, Emission-Computed” (29%) corresponding to nuclear medicine imaging techniques such as PET and SPECT, and “X-Ray Computed Tomography” (29%), followed closely by “Diagnostic Radiologic Examination” (27%). These three categories dominate the dataset, indicating a strong prevalence of cross-sectional imaging techniques (CT and general diagnostic radiology). Other significant concepts include “Magnetic Resonance Imaging” (17%) and “Ultrasonography” (12%), confirming the relevance of MRI and ultrasound imaging within the dataset. Less frequent categories, such as “Angiogram” (7%), “Heart Atrium” (2%), “Intestines” (2%), “Mediastinum” (2%), and “Both Kidneys” (2%), represent more specific anatomical or procedural concepts.

This distribution highlights the imbalance present in the dataset, with a few dominant concepts and many rare ones, which needs to be addressed carefully in the model training and evaluation phases.

3.2. Experimental Outcomes of Deep Learning Algorithms

Table 11 presents the experimental results obtained from different deep learning models under various settings. The ImageCLEF dataset, consisting of 3047 concepts, was used throughout the experiments. The initial experiments involved training a DenseNet121 model with transfer learning, utilizing ImageNet pre-trained weights and setting the number of epochs to 10. Training on a CPU-based system required approximately 15 days and achieved an F1-score of 0.33. Subsequently, DenseNet121 was trained without transfer learning under the same epoch setting on a GPU-based system, which took approximately one month to complete and yielded an F1-score of 0.21. Experiments were also conducted with a ResNet101 model using transfer learning (ImageNet pre-trained weights) and three epochs. Training on a CPU-based system required about seven days, resulting in an F1-score of 0.21. When ResNet101 was trained without transfer learning on a GPU-based system (three epochs), training lasted about 20 days, and the F1-score obtained was 0.17.

Finally, the VGG19 model was trained on the dataset without transfer learning, achieving an F1-score of 0.265. Based on these results, DenseNet121 with transfer learning was selected as the optimal approach, as models with fewer layers or without transfer learning consistently yielded lower performance.

Table 12 shows the experimental outcomes of different profound learning versions with alternative settings. We initially used the ImageCLEF dataset of 3047 concepts. However, we did not obtain any meaningful results, so we reduced the number of concepts to 20. We started with training the base prototype of DenseNet121 with the transfer learning concepts, i.e., weights of ImageNet pre-trained on DenseNet121 and settings epochs equal to 3. It took almost three days to train the prototype on a CPU-founded system. This gave us an F1-score of 0.89.

To better illustrate the model’s behavior in predicting medical concepts from radiological images, we provide an example in Figure 6. The ground truth labels for the image are also shown for comparison. As can be seen, the model correctly identifies the major anatomical structures (“Chest X-ray”, “Lung”, “Effusion”) with high confidence, and additionally predicts the concept “Opacity”, which was not annotated in the ground truth but is a plausible finding in chest radiographs.

This significant improvement in classification performance when focusing on the 20 most frequent concepts is mainly attributed to the higher frequency of these classes, which alleviates the few-shot learning problem. Although some label noise may be present in the dataset, it was not identified as the primary limitation in this study. The main challenge encountered was the scarcity of examples for many rare classes. The ResNet101 model was subsequently trained using transfer learning, employing ImageNet pre-trained weights and setting the number of epochs to 3. Training on a CPU-based system required approximately three days and resulted in an F1-score of 0.88.

The modality-wise F1-score of the DenseNet121 model was calculated, starting with the DRAN images corresponding to the angiogram modality. However, no meaningful results were obtained. Due to time constraints, this process was discontinued, as generating new CSV files proved to be complex and time-consuming. In this analysis, the modality-wise F1-score refers to the evaluation across seven modalities present in the training, validation, and testing datasets. The goal was to assess the model’s learning performance for each individual modality.

Figure 7 shows the confusion matrix of the testing phase. We generated multiple testing subsets, and there was one that we have selected from DRXR that means mostly x-ray images. This confusion matrix is the result of that testing subset. Here in the confusion matrix, as we can see, label CUI2 has a high value assigned to it, and CUI2 stands for X-Ray Computed Tomography as all images belong to this specific modality, so it has high value. This result is obtained by training the prototype on DenseNet121.

Although the global F1-score achieved by the DenseNet121 model remains high, Figure 8 clearly highlights a strong class imbalance. Most of the correct predictions are concentrated in a few dominant concepts (such as CUI1 and CUI2), while several other classes are underrepresented or not predicted at all. To provide a more balanced evaluation of the model’s performance, we also calculated the Macro-F1-score, which was 78% for DenseNet121 and 77% for ResNet101. These values better reflect the impact of class imbalance on model performance.

Figure 8 presents the training and validation accuracy and loss curves obtained during the training of DenseNet121 on the ImageCLEF dataset. The graph indicates that reasonable accuracy could not be achieved without applying transfer learning. Increasing the number of epochs from 10 to 20 did not produce a significant improvement in the results. This analysis highlights the importance of transfer learning in the context of this study.

Figure 9 presents the training and validation accuracy and loss curves obtained during the training of ResNet101 on the ImageCLEF dataset. The graph demonstrates that good accuracy can be achieved by applying transfer learning. This analysis further highlights the importance of transfer learning for this study. In this experiment, the number of epochs was set to 2 to control the increase in time complexity associated with a higher number of epochs.

Figure 10 shows the training and validation accuracy and loss graph when VGG19 was trained on the ImageCLEF dataset. This graph shows that maximizing transfer learning can achieve good accuracy but also that this prototype is overfitting. Here we set the number of epochs to 1 because the time complexity also increases by increasing the number of epochs.

Although the models achieved reasonable accuracy, it is important to note that the training processes were limited to a very small number of epochs (2 for ResNet101 and 1 for VGG19) mainly due to time complexity constraints.

As a result, the training and validation curves shown in Figure 9 and Figure 10 do not fully represent complete convergence.

In future developments, we plan to extend the number of training epochs and adjust the early stopping criteria to allow a more complete model convergence and performance stabilization.

3.3. Comparison with Baseline Models

To provide clearer context for the obtained results, the performance of the deep learning models was compared with the official baseline performance reported in the ImageCLEFmed 2020 leaderboard. The comparison includes precision, recall, F1-score, and standard deviation across the models. The results are summarized in Table 13.

As shown in Table 13, the proposed DenseNet121 model using transfer learning achieved a significant improvement compared to the official ImageCLEF baseline, particularly in terms of F1-score (89.0% vs. 37.0%). Moreover, precision and recall values were consistently higher across all evaluated models. In contrast, the VGG19 model trained without transfer learning on the full set of 3047 concepts achieved lower performance, with an F1-score of 27.0%. The relatively low standard deviations confirm the stability and robustness of the training process.

4. Conclusions

DenseNet121 outperformed the other deep learning and machine learning models evaluated. Caption detection on radiology images was conducted using deep learning models, achieving a highest F1-score of 0.89. The proposed model demonstrated superior performance compared to previously published datasets and examples from recent years. Although direct comparison is limited due to differences in experimental setups, the results are generally comparable to those reported in the literature. Efforts were also made to optimize outcomes by minimizing time complexity and reducing resource consumption. We tried to achieve good outcomes on a CPU-founded system.

4.1. Future Directions

The concepts, also called CUIs, were successfully detected to accurately label the radiology images. The future direction to this work is categorized into two main types as given below.

Future research could focus on training separate models for each imaging modality represented in the dataset. Given the high diversity among modalities, such as x-ray, MRI, and CT scans, specialized models may capture modality-specific features more effectively than a unified model. After training, the outputs from each modality-specific model could be combined through ensemble learning strategies to improve overall classification performance. This approach could also help identify which models perform best for each modality type, providing deeper insights into how modality characteristics affect learning outcomes and guiding the development of more targeted diagnostic tools.

In this study, we reduced the number of concepts to the 20 most frequent ones to address extreme class imbalance and ensure a stable experimental framework. However, future work will focus on expanding the classification task to include a broader and more diverse set of UMLS concepts. This will allow for a more realistic evaluation of model performance in clinical applications. Considering the severe imbalance and the few-shot learning challenges associated with rare medical concepts, future directions also involve exploring advanced learning strategies such as few-shot learning, meta-learning techniques, and specialized data augmentation methods tailored for multilabel classification problems.

4.2. Limitations

Although the proposed system achieved strong performance in the automatic generation of radiology image captions, several limitations must be acknowledged regarding its potential clinical application. First, the model was developed and evaluated exclusively on the ImageCLEF2020 dataset, which, despite its diversity, may not fully capture the variability of real-world clinical radiology images from different institutions, imaging devices, and patient populations. Furthermore, to address the challenges of class imbalance and computational complexity, the analysis was restricted to the 20 most frequent medical concepts. While this strategy significantly improved the model’s F1-score, it limited the ability to detect rare but clinically critical findings, potentially reducing its applicability in comprehensive diagnostic workflows. Another important limitation is the interpretability of the system. Deep learning models such as DenseNet121 operate as black-box systems, making it difficult to provide transparent explanations for each prediction. This may reduce clinical trust and acceptance, especially in contexts requiring high accountability. Moreover, despite achieving good performance metrics, even small errors in automatic labeling could propagate into clinical reports and introduce diagnostic risks if not properly supervised by expert radiologists. Finally, although efforts were made to optimize computational efficiency, real-time deployment in clinical environments may require substantial GPU resources, which may not always be available. Future research will focus on expanding the range of detectable concepts, validating the system of external clinical datasets, improving interpretability mechanisms, and evaluating its integration into clinical practice through user-centered studies.

Author Contributions

Conceptualization, A.M.; methodology, A.M. and V.S.; data curation, A.M.; writing—original draft preparation, A.M. and V.S.; writing—review and editing, A.M. and V.S.; visualization, A.M. and V.S.; supervision, A.M.; project administration, A.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Ethical review and approval were waived for this study because the dataset used (ImageCLEF 2020) is publicly available to researchers upon signing a usage agreement and does not contain personally identifiable information. No new data collection involving human subjects was performed.

Informed Consent Statement

Patient consent was waived because the dataset (ImageCLEF 2020) consists of fully anonymized medical images and does not include any identifiable personal data.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author(s).

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

Kasban, H.; El-Bendary, M.A.M.; Salama, D.H. A Comparative Study of Medical Imaging Techniques. Int. J. Inf. Sci. Intell. Syst. 2015, 4, 37–58. [Google Scholar]
Kougia, V. Medical Image Labeling and Report Generation. Master’s Thesis, Athens University of Economics and Business (AUEB), Athens, Greece, 2019. [Google Scholar]
Karatzas, B.; Pavlopoulos, J.; Kougia, V.; Androutsopoulos, I. AUEB NLP Group at ImageCLEFmed Caption 2020. In Proceedings of the CLEF 2020, Thessaloniki, Greece, 22–25 September 2020. [Google Scholar]
Pavlopoulos; John; Kougia, V.; Androutsopoulos, I. A survey on biomedical image captioning. In Proceedings of the Second Workshop on Shortcomings in Vision and Language, Minneapolis, MN, USA, 6 June 2019; pp. 26–36. [Google Scholar]
Esteva; Kuprel, B.; Novoa, R.A.; Ko, J.; Swetter, S.M.; Blau, H.M.; Thrun, S. Dermatologist-Level Classification of Skin Cancer with Deep Neural Networks. Nature 2017, 542, 115–118. [Google Scholar] [CrossRef]
Pelka, O.; Nensa, F.; Friedrich, C.M. Adopting Semantic Information of Grayscale Radiographs for Image Classification and Retrieval. Bioimaging 2018, 2, 179–187. [Google Scholar]
Udas, N.; Beuth, F.; Kowerko, D. Concept Detection in Medical Images using Xception Models-TUCMC at ImageCLEFmed 2020. In Proceedings of the ImageCLEF at the CLEF 2020 Conference-Conference and Labs of the Evaluation Forum, Thessaloniki, Greece, 22–25 September 2020. [Google Scholar]
Lecuyer, G.; Ragot, M.; Martin, N.; Launay, L.; Jannin, P. Assisted Annotation of Surgical Vide-os Using Deep Learning. Comput. Assist. Radiol. Surg. 2019, 15, 673–680. [Google Scholar] [CrossRef]
Demner-Fushman, D.; Kohli, M.D.; Rosenman, M.B.; Shooshan, S.E.; Rodriguez, L.; Antani, S.; Thoma, G.R.; McDonald, C.J. Preparing a collection of radiology examinations for distribution and retrieval. J. Am. Med. Inform. Assoc. 2016, 23, 304–310. [Google Scholar] [CrossRef]
Jing, B.; Xie, P.; Xing, E. On the Automatic Generation of Medical Imaging Reports. arXiv 2018, arXiv:1711.08195v3. [Google Scholar]
Yuan, J.; Liao, H.; Luo, R.; Luo, J. Automatic radiology report generation based on multi-view image fusion and medical concept enrichment. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Shenzhen, China, 13–17 October 2019; Springer: Cham, Switzerland, 2019; pp. 721–729. [Google Scholar]
Irvin, J.; Rajpurkar, P.; Ko, M.; Yu, Y.; Ciurea-Ilcus, S.; Chute, C.; Marklund, H.; Haghgoo, B.; Ball, R.; Shpanskaya, K.; et al. Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. arXiv 2019, arXiv:1901.07031. [Google Scholar] [CrossRef]
You, Q.; Jin, H.; Wang, Z.; Fang, C.; Luo, J. Image captioning with semantic attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 4651–4659. [Google Scholar]
Li, C.Y.; Liang, X.; Hu, Z.; Xing, E.P. Knowledge-Driven Encode, Retrieve, Paraphrase for Medical Image Report Generation. In Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence (AAAI-19), Honolulu, HI, USA, 27 January–1 February 2019. [Google Scholar]
Zeng, X.; Wen, L.; Qi, X.; Liu, B. Deep Learning for Ultrasound Image Caption Generation based on Object Detection. Neurocomputing 2020, 392, 132–141. [Google Scholar] [CrossRef]
Braschler, M.; Peters, C. Cross-Language Evaluation Forum: Objectives, Results, Achievements. Inf. Retr. 2004, 7, 7–31. [Google Scholar] [CrossRef]
Ionescu, B.; Müller, H.; Péteri, R.; Abacha, A.B.; Datla, V.; Hasan, S.A.; Demner-Fushman, D.; Kozlovski, S.; Liauchuk, V.; Cid, Y.D.; et al. Overview of the ImageCLEF 2020: Multimedia Retrieval in Medical, Lifelogging, Nature, and Internet Applications. In Proceedings of the 11th International Conference of the CLEF Association, CLEF 2020, Thessaloniki, Greece, 22–25 September 2020. [Google Scholar]
Eickhoff, C.; Schwall, I.; De Herrera, A.G.S.; Müller, H. Overview of ImageCLEFcaption 2017–image caption prediction and concept detection for biomedical images. In Proceedings of the CEUR Workshop Proceedings, Dublin, Ireland, 13–15 September 2017. [Google Scholar]
Hasan, S.A.; Ling, Y.; Liu, J.; Sreenivasan, R.; Anand, S.; Arora, T.R.; Datla, V.; Lee, K.; Qadir, A.; Swisher, C.; et al. PRNA at ImageCLEF 2017 Caption Prediction and Concept Detection Tasks. In Proceedings of the CLEF (Working Notes), Dublin, Ireland, 11–14 September 2017. [Google Scholar]
Abacha, A.B.; De Herrera, A.G.S.; Gayen, S.; Demner-Fushman, D.; Antani, S. NLM at Im-ageCLEF 2017 caption task. In Proceedings of the CEUR Workshop Proceedings, Dublin, Ireland, 13–15 September 2017. [Google Scholar]
Valavanis, L.; Stathopoulos, S. IPL at ImageCLEF 2017 Concept Detection Task. In Proceedings of the CLEF (Working Notes), Dublin, Ireland, 11–14 September 2017. [Google Scholar]
Garcia Seco De Herrera, A.; Eickhof, C.; Andrearczyk, V.; Müller, H. Overview of the ImageCLEF 2018 caption prediction tasks. In Proceedings of the CEUR Workshop Proceedings, Avignon, France, 10–14 September 2018. [Google Scholar]
Johnson, J.; Douze, M.; Jégou, H. Billion-scale similarity search with GPUs. arXiv 2017, arXiv:1702.08734. [Google Scholar] [CrossRef]
Goodfellow, I.J.; Pouget-abadie, J.; Mirza, M.; Xu, B.; Warde-farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Adver-sarial Nets. Commun. ACM 2014, 63, 139–144. [Google Scholar] [CrossRef]
Radford, A.; Metz, L.; Chintala, S. Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. arXiv 2016, arXiv:1511.06434. [Google Scholar]
Guo, Z.; Wang, X.; Zhang, Y.; Li, J. ImageSem at ImageCLEFmed Caption 2019 Task: A Two-stage Medical Concept Detection Strategy. In Proceedings of the CLEF (Working Notes), Lugano, Switzerland, 9–12 September 2019. [Google Scholar]
Singh, S.; Karimi, S.; Ho-Shon, K.; Hamey, L.; Cappellato, L.; Ferro, N.; Losada, D.E.; Müller, H.-N. Biomedical Concept Detection in Medical Images: MQ-CSIRO at 2019 ImageCLEFmed Caption Task. In Proceedings of the CLEF (Working Notes), Lugano, Switzerland, 9–12 September 2019. [Google Scholar]
Kougia, V.; Pavlopoulos, J.; Androutsopoulos, I. Medical Image Tagging by Deep Learning and Retrieval. In International Conference of the Cross-Language Evaluation Forum for European Languages; Springer: Cham, Switzerland, 2020; pp. 154–166. [Google Scholar]
Xu, J.; Liu, W.; Liu, C.; Wang, Y.; Chi, Y.; Xie, X.; Hua, X.-S. Concept Detection based on Multilabel Classification and Image Captioning Approach-DAMO at ImageCLEF 2019. In Proceedings of the CLEF (Working Notes), Lugano, Switzerland, 9–12 September 2019. [Google Scholar]
Pelka, O.; Koitka, S.; Rückert, J.; Nensa, F.; Friedrich, C.M. Radiology Objects in COntext (ROCO): A multimodal image dataset. In Intravascular Imaging and Computer Assisted Stenting and Large-Scale Annotation of Biomedical Data and Expert Label Synthesis; Springer: Cham, Switzerland, 2018; pp. 180–189. [Google Scholar]
Kalimuthu, M.; Nunnari, F.; Sonntag, D. A competitive deep neural network approach for the im-ageclefmed caption 2020 task. arXiv 2020, arXiv:2007.14226. [Google Scholar]
Arul, S.D.; Srinivasan, K. ImageCLEF 2020: Image Caption Prediction using Multilabel Convolutional Neural Network. Ultrasound 2020, 8629, 502. [Google Scholar]
Bello, A.; Ng, S.-C.; Leung, M.-F. Skin Cancer Classification Using Fine-Tuned Transfer Learning of DENSENET-121. Appl. Sci. 2024, 14, 7707. [Google Scholar] [CrossRef]
Missaoui, R.; Hechkel, W.; Saadaoui, W.; Helali, A.; Leo, M. Advanced Deep Learning and Machine Learning Techniques for MRI Brain Tumor Analysis: A Review. Sensors 2025, 25, 2746. [Google Scholar] [CrossRef]
Agarwal, D.; Berbis, M.A.; Martín-Noguerol, T.; Luna, A.; Garcia, S.C.; de la Tor-re-Díez, I. End-to-End Deep Learning Architectures Using 3D Neuroimaging Biomarkers for Early Alzheimer’s Diag-nosis. Mathematics 2022, 10, 2575. [Google Scholar] [CrossRef]
Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar]
Rajpurkar, P.; Irvin, J.; Zhu, K.; Yang, B.; Mehta, H.; Duan, T.; Ding, D.; Bagul, A.; Langlotz, C.; Shpanskaya, K.; et al. Chexnet: Radiol-ogist-level pneumonia detection on chest x-rays with deep learning. arXiv 2017, arXiv:1711.05225. [Google Scholar]
Marengo, A.; Pagano, A.; Santamato, V. An efficient cardiovascular disease prediction model through AI-driven IoT technology. Comput. Biol. Med. 2024, 183, 109330. [Google Scholar] [CrossRef]
Santamato, V.; Tricase, C.; Faccilongo, N.; Iacoviello, M.; Pange, J.; Marengo, A. Machine Learning for Evaluating Hospital Mobility: An Italian Case Study. Appl. Sci. 2024, 14, 6016. [Google Scholar] [CrossRef]
Santamato, V.; Tricase, C.; Faccilongo, N.; Marengo, A.; Pange, J. Healthcare performance analytics based on the novel PDA methodology for assessment of efficiency and perceived quality outcomes: A machine learning approach. Expert Syst. Appl. 2024, 252, 124020. [Google Scholar] [CrossRef]
Santamato, V.; Tricase, C.; Faccilongo, N.; Iacoviello, M.; Marengo, A. Exploring the Impact of Artificial Intelligence on Healthcare Management: A Combined Systematic Review and Machine-Learning Approach. Appl. Sci. 2024, 14, 10144. [Google Scholar] [CrossRef]

Figure 1. Trends in utilization of medical imaging in the U.S. techniques for healthcare.

Figure 2. Block diagram of research methodology.

Figure 3. Distribution and description of the 20 most frequent concepts.

Figure 4. Class imbalance and multilabel encoding analysis.

Figure 5. Distribution of CUIs in the dataset.

Figure 6. Example of model predictions on a chest x-ray.

Figure 7. Confusion matrix of DenseNet121 prototype.

Figure 8. Training and validation accuracy and loss of DenseNet121 without transfer learning and maximizing 20 epochs.

Figure 9. Training and validation accuracy and loss of ResNet101.

Figure 10. Training and validation accuracy and loss of VGG19.

Table 1. Performance outcomes as reported in [10].

	BLEU-1	BLEU-2	BLEU-3	BLEU-4
IU X-Ray	0.517	0.386	0.306	0.247
PEIR Gross	0.300	0.218	0.165	0.113

Table 2. Performance outcomes as reported in [15].

	BLEU-1	BLEU-2	BLEU-3	BLEU-4
IU X-Ray	0.482	0.325	0.226	0.162
CX-CHR	0.673	0.588	0.532	0.473

Table 3. Concepts detection presentation in terms of F1-scores without the use of external resources.

Team	F1-Score
IPL	0.1436
IPL	0.1418
IPL	0.1417

Table 4. Concepts detection presentation of runs exploiting external properties; the exact type of third-party material is indicated.

Team	F1-Score
NLM	0.1718
NLM	0.1648
AAI	0.1583

Table 5. Caption forecast performance in terms of BLEU scores without external properties.

Team	F1-Score
ISIA	0.2600
ISIA	0.2507
ISIA	0.2454

Table 6. Caption prediction presentation of runs exploiting external resources; the exact type of third-party substances indicated.

Team	BLEU-Score
NLM	0.5634
NLM	0.3317
PRNA	0.3211

Table 7. Concepts detection performance in terms of F1-score.

Team	F1-Score
U A.PT Bioinformatics	0.1102
U A.PT Bioinformatics	0.1082
U A.PT Bioinformatics	0.0978

Table 8. Caption detection performance in terms of BLEU score.

Team	F1-Score
ImageSem	0.2501
ImageSem	0.2343
ImageSem	0.2278

Table 9. Caption detection performance in terms of F1-score.

Team	F1-Score
AUEB NLP	0.2823
Damo	0.2655
ImageSem	0.2235

Table 10. Modalities information.

Name	Description	CUIs	UMLS Description
DRAN	Angiography	C002978	Angiogram
DRCO	Combined Modalities	C0040398	X-Ray Computed
DRCT	Computerized Tomography	C0040405	Tomography
DRMR	Magnetic Resonance	C0024485	Magnetic Resonance Imaging
DRPE	Positron Emission Tomography	C0032743	Positron Emission Tomography
DRUS	Ultrasound	C0041618	Ultrasonography
DRXR	X-Ray	C0043299	Diagnostic Radiologic Examination

Table 11. Models comparison using 3047 concepts.

Prototype	External	F1-Score
DenseNet121	ImageNet	0.33
ResNet101	ImageNet	0.21
VGG19	None	0.27

Table 12. Models comparison using 20 concepts.

Prototype	External	F1-Score
DenseNet121	ImageNet	0.89
ResNet101	ImageNet	0.88

Table 13. Comparison of model performance with ImageCLEF baseline.

Model	Method	Precision (%)	Recall (%)	F1-Score (%)	Standard Deviation (%)
DenseNet121 (20 CUIs)	Transfer learning (ImageNet)	90.2	88.0	89.0	±1.5
ResNet101 (20 CUIs)	Transfer learning (ImageNet)	89.1	87.0	88.0	±1.8
VGG19 (3047 CUIs)	No transfer learning	68.0	65.0	27.0	±2.5
ImageCLEF baseline 2020	Official baseline	39.0	36.0	37.0	±3.0

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Santamato, V.; Marengo, A. Multilabel Classification of Radiology Image Concepts Using Deep Learning. Appl. Sci. 2025, 15, 5140. https://doi.org/10.3390/app15095140

AMA Style

Santamato V, Marengo A. Multilabel Classification of Radiology Image Concepts Using Deep Learning. Applied Sciences. 2025; 15(9):5140. https://doi.org/10.3390/app15095140

Chicago/Turabian Style

Santamato, Vito, and Agostino Marengo. 2025. "Multilabel Classification of Radiology Image Concepts Using Deep Learning" Applied Sciences 15, no. 9: 5140. https://doi.org/10.3390/app15095140

APA Style

Santamato, V., & Marengo, A. (2025). Multilabel Classification of Radiology Image Concepts Using Deep Learning. Applied Sciences, 15(9), 5140. https://doi.org/10.3390/app15095140

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multilabel Classification of Radiology Image Concepts Using Deep Learning

Abstract

1. Introduction

1.1. Medical Imaging Modalities

1.2. Problem Statement

1.3. Research Objective

1.4. On the Automatic Generation of Medical Imaging Information

1.5. Report Generation of Automatic Radiology Founded on Multi-View Imagery Medical and Fusion Concepts Enrichment

1.6. The KERP Framework

1.7. Profound Learning for Ultrasound Imagery Caption Generation Founded on Object Detection

1.8. ImageCLEF Caption Detection Task

1.8.1. ImageCLEF Caption Detection Task 2017

1.8.2. ImageCLEF Caption Detection Task 2018

1.8.3. ImageCLEF Caption Detection Task 2019

2. Research Methodology

2.1. Complete Pipeline of the Proposed Methodology

2.1.1. Dataset

2.1.2. Preprocessing

2.1.3. Multilabel Binarizer

2.1.4. Transfer Learning

2.1.5. Deep Learning Model Implementation

2.1.6. DenseNet121

2.1.7. ResNet101

2.1.8. VGG19

2.1.9. Caption Generation

2.2. Data Measurement

3. Experiments and Results

3.1. Dataset Exploration

3.2. Experimental Outcomes of Deep Learning Algorithms

3.3. Comparison with Baseline Models

4. Conclusions

4.1. Future Directions

4.2. Limitations

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI