In this study, a dataset consisting of 13,879 melanoma images (11,879 training and 2000 test data points) was used. In the first stage, DenseNet121, InceptionV3, ViT, and Xception models were trained separately, and the classification performance of each was evaluated in detail (on the test dataset). Then, different weights were assigned based on the classification accuracy rates of these models, and the averaging Ensemble learning method was applied. Thus, an Ensemble model expected to combine the strengths of different architectures was obtained, and the results were compared. In the next stage, CBIR operation was performed independently with each model, and in the last stage, it was aimed to further improve the CBIR performance by using models combined with the feature-level fusion approach. In this context, both the classification and CBIR performances of the four models in question were compared, and it was shown that higher classification and CBIR results could be achieved by combining the strengths obtained through the Ensemble methods.
4.1. Performance Metrics
Accuracy measurements from model performance evaluation metrics were used to compare results and make better choices. Computation accuracy, precision, recall, F1-score, and the ROC curve are important metrics used in the evaluation of classification models, particularly in ML and DL. These metrics help assess the performance of a model in predicting outcomes. They are calculated based on four fundamental values: true positives (
TP), true negatives (
TN), false positives (
FP), and false negatives (
FN) [
42]. The confusion matrix presented in
Table 1’s metrics assesses the classification performance. In addition, the CBIR performance of the models was measured separately using feature vectors. Accuracy measurements from model performance evaluation metrics were used to compare results and make better choices. Computation accuracy, precision, recall, F1-score, and the ROC curve are important in Ensemble learning. The CBIR performance of the Ensemble model was evaluated with the developed multi-model approach. Extensive experiments were conducted to examine the effects of the used parameters and the developed models on classification and CBIR performance, and the obtained results are presented comparatively.
The overall accuracy of the model’s predictions is measured by accuracy. This is mathematically calculated in Equation (3):
The precision of the model is determined by dividing all of its positive predictions by the percentage of true positive forecasts. It indicates the model’s ability to avoid false positives. This is mathematically calculated in Equation (4):
The percentage of accurate positive predictions among all real positive events in the dataset is measured by recall. It indicates the model’s ability to capture all positive instances. This is mathematically calculated in Equation (5):
The harmonic mean of recall and precision is known as the F1-score. It offers a measure that strikes a compromise between recall and precision by taking false negatives and positives into account. This is mathematically calculated in Equation (6):
The Area Under the Curve (AUC) and Receiver Operating Characteristic (ROC) curve are graphical tools for assessing how well categorization algorithms work. The area under the ROC curve, or AUC, indicates how well a model performs when it is closer to 1. Comprehension of a classification model’s performance requires a comprehension of these metrics. While accuracy gives a general idea of the model’s correctness, precision, recall, and F1-score provide information about how well the model can capture all positive instances, produce accurate positive predictions, and balance precision and recall.
The metrics used for the performance evaluation of the CBIR method are explained below:
PR Curve: This is a graphical method widely used to measure the performance of CBIR systems. It shows the relationship between precision and recall values. This curve expresses how accurately the system retrieves the relevant images in the dataset (precision) and how many of all relevant images it can retrieve (recall).
AP: This summarizes the relationship between precision and recall values in the ranking of the retrieval results obtained for a specific query image as a single numerical value. A high AP value indicates that the system can successfully rank the relevant images.
mAP: This is the average of AP values calculated for multiple query images. This value is a single metric that summarizes the performance of the overall CBIR system. A high mAP value indicates that the CBIR system consistently provides successful results.
Cosine Similarity: This is a mathematical metric used to measure the similarity of feature vectors generated between two images. Cosine similarity takes values between 0 and 1. As the value approaches 1, this means that the features of the two images are similar, and therefore, the system successfully retrieves similar images. CBIR performance evaluation metrics play an important role in presenting the effectiveness of the models and the success rate of the developed feature-level fusion method in a detailed and comparative manner.
4.2. Performance Evaluations
In the performance evaluation of the developed melanoma classification method, the effectiveness was determined by tests performed on the models used. In this respect, the performance evaluation was made based on the accuracy rates of the models. In addition, the precision, recall, and F-score metrics of the models were calculated using the complexity matrix for a more detailed performance evaluation.
The training process of all models was carried out using fixed hyperparameters in order to obtain comparable results. In this context, 32 was selected as the batch_size, 40 epochs for the training period, and 0.001 for the learning rate. The training process using fixed hyperparameters provided the opportunity to compare the performance of different architectures objectively, while allowing the obtained results to reveal the differences specific to the model architecture more clearly.
First of all, the necessary optimizations were made with the TL method, and the classification processes were performed with the DenseNet121, InceptionV3, ViT, and Xception architectures. The confusion matrices of the performance obtained by these 4 models are shown in
Figure 9. The ROC curves of these models are shown in
Figure 10. The results in terms of different performance metrics obtained through confusion matrices are given in
Table 2.
As given in
Table 2, all performance metrics obtained from the confusion matrices of the four models are evaluated holistically. In addition to providing high-performance metrics such as 94.50% accuracy, 93.90% precision, 95.04% recall, and 94.47% F1-score, the DenseNet121 model also demonstrated its classification power with an AUC value of 0.99 obtained in the ROC curve analysis. Low misclassification rates in the confusion matrix indicate almost complete discrimination of both benign and malignant samples. The InceptionV3 model has 91.20% accuracy, 93.20% precision, 89.62% recall, and 91.37% F1-score values, and the ROC curve analysis reveals that the model exhibits sufficient discrimination with an AUC of 0.97; however, as seen in the confusion matrix, more false negatives (
FN) were observed, especially for the malignant class. The Xception model offers a balanced performance with 93.80% accuracy, 93% precision, 94.51% recall, and 93.75% F1-score, and shows high discrimination in the ROC curve with an AUC value of 0.98. The distribution in the confusion matrix shows that Xception has a low error rate in distinguishing benign and malignant samples. The ViT model, on the other hand, exhibits lower performance compared to other models with 88.25% accuracy, 90.50% precision, 86.60% recall, and 88.50% F1-score; while the ROC curve analysis gives an AUC value of 0.95, the higher false negative rate observed in the confusion matrix suggests that the model may be inadequate in the detection of malignant cases, which are of critical importance especially in clinical applications. In light of these data, performance metrics supported by both ROC curve and confusion matrix analyses show that the DenseNet121 and Xception models provide superior and clinically reliable results for melanoma detection. Especially in applications where the false negative rate is critical, the high sensitivity and specificity values of these models provide advantages for their potential clinical integration.
After the classification operations were performed with 4 TL models, the aim was to create a single combined model by assigning different weights to each model based on the classification performances of individual models with the averaging Ensemble learning method. This approach aims to reduce the impact of the weaknesses of individual models while highlighting their strengths by blending the performance metrics obtained by architectures such as DenseNet121, InceptionV3, ViT, and Xception separately. Thus, the Ensemble model has the potential to offer higher overall accuracy and consistency by minimizing the classification errors observed in most of the models used alone. The confusion matrix and ROC curve of the Ensemble learning model are shown in
Figure 11, and the performance metrics are given in
Table 3.
The Ensemble learning approach shows a significant improvement when compared to the performance metrics of individual models. Among the individual models, the highest accuracy is 94.50% with DenseNet121, and the highest recall is around 95% in DenseNet121 and Xception models, while the Ensemble model stands out with 95.25% accuracy and 96.22% recall values. In addition, the 94.20% precision and 95.20% F1-score values of the Ensemble model generally exceed the metrics of all individual models, thus minimizing the classification errors and providing a more balanced and reliable distinction between benign and malignant samples. These findings reveal the potential of the Ensemble learning strategy to reduce false negative rates and increase the overall classification success, which are critical in clinical applications, by compensating for individual model errors.
In this study, in order to examine the benefits of the CBIR technique in medical image analysis, the comparative CBIR performances of four different TL-based architectures, namely DenseNet121, InceptionV3, ViT, and Xception, were evaluated for melanoma detection. In the CBIR operations performed on the feature vectors obtained by the models, 20 queries were randomly selected for each model and evaluated with different metrics such as PR curves, mAP, AP, five similar images, and cosine similarity scores of the images. Thus, the ability of each architecture to detect similar images and the potential to minimize error rates in clinical applications were revealed. This approach evaluates the similarity-based search performances of different DL architectures from a holistic perspective and provides important findings for determining the most appropriate model in medical image analysis.
Figure 12 shows the PR curves and AP values given for one of twenty random queries for the CBIR performance of DenseNet121, InceptionV3, ViT, and Xception models.
Figure 13 shows five similar images and the resulting query scores. The CBIR performance results of the models for the mAP values of 20 queries are given in
Table 4.
When examining
Figure 13, DenseNet121 and Xception models generally retrieve visually similar lesions correctly, while InceptionV3 and ViT produced erroneous matches in some cases. For example, InceptionV3, in the fourth image (a lesion with an irregular, low-contrast border and subtle color variegation), prioritized global color histograms and overlooked fine border details, matching it to a benign example. ViT, with its patch-based attention mechanism, overfocused on specular highlights and glare artifacts in the same image; these high-frequency patterns overshadowed the true tissue texture. Both models also confused highly heterogeneous pigmented lesions—characterized by a “speckled” appearance—with benign samples, generating incorrect similarities.
These findings suggest two avenues for improvement:
Preprocessing to remove hair and glare artifacts, for example, using inpainting or DullRazor.
Employing boundary-aware descriptors (e.g., explicit contour or texture filters) that favor irregular edges over bulk color statistics.
This analysis aims to expose the models’ limitations and guide future work toward more robust CBIR approaches.
Table 4.
CBIR performance results of the models for the mAP values of 20 queries.
Table 4.
CBIR performance results of the models for the mAP values of 20 queries.
| Models | Mean Average Precision (mAP) |
|---|
| DenseNet121 | 0.9496 |
| InceptionV3 | 0.7922 |
| ViT | 0.7539 |
| Xception | 0.9171 |
DenseNet121, with its high AP value of 0.9698 obtained in the sample query and its mAP value of 0.9496 calculated over 20 queries, shows that it produces detailed and deep feature maps that represent the significant malignant features in the query image with high accuracy. The high initial precision value in the PR curve and the consistent performance in the wide recall range indicate that DenseNet121 clearly distinguishes between classes by minimizing the FP rate.
The Xception model also offers a similarly strong performance; the mAP value of 0.9171 shows that the model can efficiently separate features thanks to the depth-separated convolutional layers. The PR curve shows a slight decrease as recall increases while maintaining a high precision value; this shows that the model ranks the correct malignant examples in the top ranks during retrieval.
InceptionV3, with its mAP value of 0.7922, exhibits a significant performance decrease compared to other models. The PR curve shows a sharp decrease in precision in high-recall regions. This situation showed that the model may be insufficient in distinguishing benign and malignant features, and proportionally, the model returned one benign image for the malignant query during retrieval. This situation technically expresses that the distinction between the two classes in the model’s feature space is more blurred.
ViT offers the lowest performance with 0.7539 mAP. Transformer architecture generally requires a larger dataset. In PR curves, the significant decrease in precision, especially with the increase in recall, shows that ViT cannot demonstrate the expected discriminatory power in feature representation.
In summary, DenseNet121 and Xception exhibit reliable performance in clinically critical retrieval tasks by producing detailed and discriminatory features for sample malignant query images with high mAP values and stable PR curves. On the other hand, InceptionV3 and ViT show insufficient feature discrimination, especially with decreases in precision in high-recall regions and the presence of benign samples in the retrieval results despite the sample malignant query. These technical data highlight the importance of choosing models that provide high discrimination and low false positive rates in CBIR applications.
After evaluating the CBIR performance of individual models, the proposed feature-level fusion approach was applied to combine the strengths of DenseNet121, InceptionV3, ViT, and Xception models. Within the scope of this method, an enriched Ensemble feature vector was created by combining the feature vectors obtained from each model horizontally. The CBIR performance of the multi-model fusion approach was evaluated on 20 randomly selected query images, and mAP values were measured using cosine similarity in the query–gallery similarity calculation.
Figure 14 shows the PR curves and AP values given for one and then all twenty random queries for the CBIR performance of the multi-model fusion approach.
Figure 15 shows 5 similar image retrievals and scores for the CBIR performance of the multi-model fusion approach. In addition, the mAP values of the 20 queries for the CBIR performance of the multi-model fusion approach are given in
Table 5.
With the developed multi-model (feature-level) fusion, a rich and detailed representation was provided in the CBIR task by using the strengths of four different architectures, such as DenseNet121, InceptionV3, ViT, and Xception. The specially defined get_feature_model function was used to obtain the outputs of individual models in the intermediate layers before the last classification layer. In this way, the deep feature vectors produced by each model could be used as discriminative representations of objects instead of the classification process. With the feature-level fusion method, the feature vectors obtained from each model were directly combined (concatenation) horizontally. The technical advantages of this approach are as follows:
Each model can capture different features of the image thanks to different architectural structures and learning capacities. While DenseNet121 and Xception provide high-discrimination features, ViT and InceptionV3 can provide different information at certain scales or textures. This complementary structure enables a broad perspective that a single model cannot capture.
Combining features obtained from different architectures allowed the model to learn both general and specific discriminative features. This allowed the system to produce consistent and high-performance results against different types of queries.
In the high-dimensional feature space formed after the concatenation process, similarity measurements (cosine similarity) naturally highlighted the effect of more discriminative components. Thus, the features produced by models with higher performance indirectly became dominant in the collective representation.
The obtained results clearly show how strong the CBIR performance of the proposed model is. The fact that only the AP value of Query 12 out of 20 randomly selected queries is below 0.90 and the average mAP value calculated for all queries is approximately 0.9538, which is higher than the individual models, shows that the Ensemble approach offers high accuracy and effective retrieval success in most cases. When looking at the PR curves, it is observed that most queries maintain high precision values in a wide recall range and exhibit a very stable performance. This is an indication that the feature-level fusion approach, which successfully integrates the complementary features of different deep learning architectures, and the cosine similarity used as the similarity measure, works effectively.
The fact that the cosine similarity scores obtained in the result returned by the model for the sample query image reach almost 1.0 reveals that the system succeeds in positioning similar images in the same vector space and thus can rank the most similar images to the query image from the dataset with high accuracy.
In this study, performance metrics such as classification accuracy and mean Average Precision (mAP) were reported based on a single, fixed train/test split. Due to this experimental setup, traditional statistical significance testing (e.g., McNemar’s test or paired
t-tests) was not performed, as such methods require multiple independent trials or resampling to yield meaningful variance estimates. Because the Kaggle Melanoma Cancer Image dataset has a fixed structure with a specific training/test split, we maintained the same layout for comparison with the previous literature. Therefore, implementing k-fold cross-validation was technically not feasible. However, to compensate for this shortcoming, our model’s performance is reported in detail using other metrics such as accuracy, precision, recall, F1-score (
Table 2 and
Table 3), ROC curves, confusion matrix, and mAP (
Table 4 and
Table 5). All of these metrics provide comprehensive insights into the model’s performance in both classification and content-based image retrieval. Additionally, to statistically demonstrate the reliability of the performance metrics in the test data, confidence intervals were calculated using bootstrap sampling on the obtained test results. Therefore, the 95% confidence interval (CI) was calculated based on the bootstrap sampling value (
n = 100 iterations) using Equation (7), based on the accuracy and mAP metrics. The constant 1.96 represents the z-score for the 95% confidence level.
x is the mean value, and
s indicates the standard deviation.
Table 6 shows the variances of the accuracy performance of different models according to varying std values using Bootstrap sampling.
According to
Table 6, the classification accuracy of the Ensemble model was 95.25% (95% CI: 95.18–95.32%), indicating high consistency across bootstrap samples. In contrast, the ViT model showed more variability with an accuracy of 88.25% (95% CI: 88.10–88.40%). Similarly, using Equation (7), 95% confidence intervals were calculated for the mAP values obtained by the different models, and the results are shown in
Table 7.
According to
Table 7, the mAP of the proposed Ensemble system was 0.9538 with a narrow 95% confidence interval (CI: 0.9532–0.9544), indicating high consistency in retrieval performance. In comparison, the ViT model yielded a lower and more variable mAP of 0.7539 (95% CI: 0.7521–0.7557), reflecting greater instability in precision across bootstrap samples.
Recent advancements in hybrid and prompt-based AI models have introduced novel opportunities for improving performance in data-scarce medical image analysis tasks. For instance, the MHKD framework demonstrates how multi-step hybrid knowledge distillation can be leveraged to maintain diagnostic accuracy even in low-resolution whole-slide images, which is highly relevant to practical clinical settings with suboptimal image quality [
43]. Similarly, the work by Zhang et al. on vision–language models shows that combining image features with semantic prompts can significantly enhance nuclei segmentation and classification, indicating the potential of cross-modal learning in histopathological contexts [
44]. Furthermore, the low-shot learning strategy explored in [
45] opens new avenues for reducing dependency on large annotated datasets. Integrating such prompt-based pre-training into melanoma classification systems could facilitate better generalization and performance in low-resource scenarios, particularly when dealing with rare lesion subtypes. These methodologies suggest promising directions for future extensions of our approach, especially in enhancing the robustness and scalability of ensemble-based classification and CBIR systems in dermatological applications.
In this study, we utilized only the publicly available “Melanoma Cancer Image” dataset from Kaggle to develop and evaluate our model. While this dataset enabled us to achieve high classification accuracy, it may not adequately capture real-world diversity in terms of skin tone variation, lighting conditions, imaging artifacts, and device-specific differences. As a future direction, we plan to perform external validation using dermoscopic images collected from the ISIC (International Skin Imaging Collaboration) archive and multiple clinical centers. This will allow us to further assess the generalizability and robustness of the proposed model in more heterogeneous and realistic environments. Although our ensemble and CBIR results are encouraging on the publicly available Kaggle melanoma dataset, they are derived from a single-center, retrospective dataset with standardized imaging conditions. To ensure generalizability and clinical relevance, external validation on heterogeneous datasets capturing variability in dermatoscopic devices, patient skin types, and real-world imaging artifacts is necessary [
46].
In practice, our dual-task ensemble could be deployed as a plugin for existing dermoscopy workstations, streaming live dermoscopic video to provide both an on-screen malignancy probability and, alongside each capture, a ranked display of visually similar past cases [
47]. Such real-time decision support could accelerate diagnosis and boost confidence, especially for less experienced clinicians. Looking ahead, prospective, multicenter validation studies directly comparing our system’s predictions and retrievals against board-certified dermatologists, as well as integration trials within electronic health record environments to fully assess workflow impact and patient outcomes, could be conducted.
It is important to note that the proposed Ensemble-based model is designed as a diagnostic support system for melanoma detection rather than an all-encompassing solution ready for real-world deployment. Consequently, this study does not evaluate system-level performance aspects such as inference time and memory consumption. While the ensemble fusion of multiple models improves classification and retrieval accuracy, it also introduces architectural complexity that may not be optimal for real-time applications. In future work, we plan to address these limitations by applying model compression techniques such as pruning and quantization, along with deployment optimizations through TensorRT or ONNX. These steps will help reduce computational load and enhance the feasibility of the proposed system in practical clinical environments.