Evaluating Machine Learning Techniques for Brain Tumor Detection with Emphasis on Few-Shot Learning Using MAML

Vaidya, Soham Sanjay; Ali, Raja Hashim; Faiz, Shan; Ahmed, Iftikhar; Khan, Talha Ali

doi:10.3390/a18100624

Open AccessArticle

Evaluating Machine Learning Techniques for Brain Tumor Detection with Emphasis on Few-Shot Learning Using MAML

by

Soham Sanjay Vaidya

¹,

Raja Hashim Ali

^1,2

,

Shan Faiz

¹,

Iftikhar Ahmed

¹

and

Talha Ali Khan

^1,*

¹

Department of Business, University of Europe for Applied Sciences, 14469 Potsdam, Germany

²

Department of Artificial Intelligence, Faculty of Computer Science and Engineering, Ghulam Ishaq Khan Institute of Engineering Sciences and Technology, Topi 23460, Pakistan

^*

Author to whom correspondence should be addressed.

Algorithms 2025, 18(10), 624; https://doi.org/10.3390/a18100624

Submission received: 17 July 2025 / Revised: 22 September 2025 / Accepted: 26 September 2025 / Published: 2 October 2025

(This article belongs to the Special Issue Machine Learning Models and Algorithms for Image Processing)

Download

Browse Figures

Versions Notes

Abstract

Accurate brain tumor classification from MRI is often constrained by limited labeled data. We systematically compare conventional machine learning, deep learning, and few-shot learning (FSL) for four classes (glioma, meningioma, pituitary, no tumor) using a standardized pipeline. Models are trained on the Kaggle Brain Tumor MRI Dataset and evaluated across dataset regimes (100%→10%). We further test generalization on BraTS and quantify robustness to resolution changes, acquisition noise, and modality shift (T1→FLAIR). To support clinical trust, we add visual explanations (Grad-CAM/saliency) and report per-class results (confusion matrices). A fairness-aligned protocol (shared splits, optimizer, early stopping) and a complexity analysis (parameters/FLOPs) enable balanced comparison. With full data, Convolutional Neural Networks (CNNs)/Residual Networks (ResNets) perform strongly but degrade with 10% data; Model-Agnostic Meta-Learning (MAML) retains competitive performance (AUC-ROC ≥ 0.9595 at 10%). Under cross-dataset validation (BraTS), FSL—particularly MAML—shows smaller performance drops than CNN/ResNet. Variability tests reveal FSL’s relative robustness to down-resolution and noise, although modality shift remains challenging for all models. Interpretability maps confirm correct activations on tumor regions in true positives and explain systematic errors (e.g., “no tumor”→pituitary). Conclusion: FSL provides accurate, data-efficient, and comparatively robust tumor classification under distribution shift. The added per-class analysis, interpretability, and complexity metrics strengthen clinical relevance and transparency.

Keywords:

brain tumor MRI; few-shot learning; MAML; interpretability (Grad-CAM); cross-dataset validation; robustness; class imbalance; computational complexity

1. Introduction

Few-shot learning algorithms have transformed the area of medical image categorization, especially in cases when annotated data is sparse. Ouahab et al. [1] proposed a cutting-edge few-shot learning model combining an attention mechanism, reaching a remarkable accuracy of 92.44%. Their dataset includes photos of meningioma, glioma, and pituitary tumors, totaling 3064 images. By upgrading Prototypical Networks using 1 × 1 convolutions, they enhanced feature interactions, considerably raising model performance. In another significant study, Szucs et al. [2] created a Double-View Matching Network for categorizing COVID-19 using X-ray images. This model efficiently utilized both image and feature space views, resulting in high classification accuracy across several hundred X-ray pictures. This dual method boosted the model’s ability to generalize from limited data, proving particularly valuable in the early phases of the pandemic. Alsaleh et al. [3] advanced the field further by applying few-shot learning to medical picture segmentation utilizing the 3D U-Net architecture within the Model-Agnostic Meta-Learning (MAML) framework. This approach demonstrated outstanding segmentation accuracy, particularly in challenges involving internal and exterior organ segmentation, even with few labeled instances. Singh et al. [4] introduced MetaMed, a gradient-based meta-learning approach optimized for few-shot medical picture categorization. This model outperformed in circumstances with high-class imbalance, beating previous algorithms on several medical imaging datasets. Structure-aware, rigorously evaluated methods tend to generalize better under scarce labels [5]; our design follows this principle for clinical MRI. Nayem et al. [6] conducted a comprehensive examination of few-shot learning models including AffinityNet and Siamese Networks. Their work proved the excellent performance of these models in disease type prediction and brain imaging modality recognition, particularly across datasets with high variation and low dimensionality. Metrics such as accuracy, precision, recall, and F1 score demonstrated the durability of these models, even under confined training data settings. Successful few-shot approaches—including Prototypical and Siamese Networks—have proved valuable for medical imaging. They deliver accurate diagnostics and credible clinical decision support even when training data are limited. Table 1 summarizes and displays the characteristics of the current literature together with the innovation of the proposed work.

1.1. Gap Analysis

Despite substantial breakthroughs in few-shot learning models for medical picture classification, numerous key gaps remain. Models like Prototypical Networks and Siamese Networks have shown a potential to handle tiny datasets well; however, there is a conspicuous dearth of extensive comparison studies. These experiments are necessary to test the performance of few-shot learning models versus classic machine learning and deep learning models under standardized conditions. While numerous studies show the high accuracy of few-shot models for certain tasks, direct comparisons with other models are limited, hindering the formulation of best practices for their application [4,8,15]. Moreover, some studies support the usefulness of few-shot learning in classifying brain cancers; however, the results are generally inconsistent across different datasets and tumor classifications. Recent advances have explored hybrid models that combine attention with task-specific representation learning to improve generalization under data scarcity. For example, Chen et al. [16] proposed an attention-enhanced few-shot recognition framework that integrates CNN–LSTM modules with hybrid attention mechanisms (ECBAM, AFFM). This design allows models to disentangle subtle visual differences with limited data, which is directly relevant to medical imaging tasks where tumors often vary subtly in shape, intensity, or boundary contrast. More extensive research is needed to validate the robustness and reliability of few-shot models across different and larger MRI image datasets [2,3]. Another issue is that the factors impacting the performance of few-shot learning models in medical image categorization are not well known. Existing research tends to focus on technical elements like model architecture and training methodologies. However, they do not adequately evaluate the impact of data quality, class imbalance, and specific characteristics of medical pictures on model performance [6,11]. Furthermore, interpretability remains a substantial difficulty for few-shot learning models. Understanding how these models make judgments is vital for clinical acceptability and trust in their use [4,12]. Addressing these limitations will require a significant study on comparative assessments of few-shot learning models against classic approaches and broader evaluations using various and larger datasets. Investigations explore aspects like data quality, class imbalance, and image features, improving the interpretability of few-shot learning models. By targeting these topics, we may better understand and advance the application of few-shot learning models in medical picture classification.

1.2. Problem Statement

The comparative efficiency of few-shot learning models vs. classical machine learning and deep learning models in the categorization of brain malignancies from MRI images such as meningioma, glioma, and pituitary tumors is the exact topic addressed by this research. Often, large annotated datasets are needed for traditional machine learning models such as Naïve Bayes or Random Forests if one wants them to perform well, but this may not be practical for clinical applications because there are fewer instances in databases than needed so they cannot be made public for reasons of confidentiality [17,18]. Although deep learning models have shown promise in areas such as Convolutional Neural Networks or Long Short-Term Memory networks, they still suffer from overfitting as well as requiring vast quantities of labeled data [7,14]. Few-shot learning models have been proposed as a practical solution to this problem because they are able to generalize effectively on small training sets; however, their absolute performance level varies among different types of medical image classification problems and remains unclear [2,8]. In order to analyze these models in terms of accuracy, precision, recall, and F1 score, this research will apply the Brain Tumor MRI Dataset from Kaggle to build the best feasible way of trustworthy and efficient brain tumor classification [19]. In addition, this study will explore how model performance in cases of few-shot learning tasks may be impacted by such factors as class imbalance, quality of obtained data, and interpretability, among others [20]. This study intends to evaluate these aspects so that it may provide some thoughts on how effectively few-shot learning systems can be used in practice and also highlight some areas that need to be worked on for them to become more dependable and successful when classifying medical photos.

1.3. Novelty of Our Work

This study is distinctive in its comprehensive approach to comparing few-shot learning models with traditional machine learning and deep learning models for the classification of brain tumors in MRI images. It addresses multiple important deficiencies identified in previous studies. In contrast to previous studies that often focus on a single model or limited comparisons, this research investigates multiple models, offering a thorough analysis of their performance on different types of tumors utilizing a consistent dataset available on Kaggle [19]. The contributions are as follows:

A systematic comparison is conducted on the accuracy, precision, recall, and F1 score of few-shot learning models like Prototypical Networks and Matching Networks against traditional models such as Naïve Bayes, Random Forest, CNNs, and LSTMs, offering a holistic view of model effectiveness [12,18].
The performance of few-shot learning algorithms in assigning different types of brain tumors, such as meningioma, glioma, and pituitary tumors, is thoroughly assessed, contributing to a deeper understanding of their appropriateness for critical medical tasks [8].
Factors affecting the performance of few-shot learning models, including data quality, class imbalance, and model interpretability, are studied to build more robust and trustworthy diagnostic tools [2]. In addition to comparing standard few-shot models, our study also considers interpretability and robustness in relation to existing attention-based enhancements. By referencing attention-augmented frameworks such as those described by Chen et al. [16], we justify the selection of MAML, MatchingNet, and ProtoNet as strong baselines while also acknowledging potential extensions through ECBAM or AFFM modules for future work in brain tumor classification.

This research not only adds to prior studies but also offers a path forward toward the construction of enhanced, understandable, and generalizable models for medical image classification that ultimately improve diagnostic accuracy and patient care in healthcare settings [21,22]. The conclusions of this study are predicted to be advantageous for future research or initiatives in the area of medical image analysis, with a particular focus on enhancing diagnostic precision and medical service delivery in clinical contexts.

1.4. Our Solutions

The value of this study depends on its inclusive comparison of few-shot learning models to conventional machine learning and deep learning models for MRI image-based brain tumor classification. In an effort to offer an extensive understanding of the benefits and limitations of each approach, including Naïve Bayes, Random Forest, CNNs, and LSTMs, models such as Prototypical Networks and Matching Networks will be systematically analyzed [11,12,14]. This research will concentrate on essential evaluative domains such as accuracy, precision, recall, and F1 score, offering a holistic perspective on model performance [21,23]. Moreover, this study intends to build more durable diagnostic tools that are dependable by examining data quality, class imbalance, and model interpretability [24]. The conclusion of this work will assist in enhancing the applications of the models and give guidelines for their use in real-world medical settings [14,25]. Additionally, the research will focus on the efficiency and difficulty of using few-shot learning techniques in the healthcare domain [14,15]. Ultimately, this work aims to enhance accuracy and patient care by providing more effective medical image classification methods [14,20]. By bridging the identified gaps, this study will provide guidance for future work and real-world applications, guaranteeing that these models can be efficiently deployed in clinical practice [24].

2. Materials and Methods

2.1. Dataset

We use the Kaggle Brain Tumor MRI Dataset [19] (7023 images: glioma 2263; meningioma 1822; pituitary 2026; no tumor 912) for training/validation/testing and BraTS for external validation. Source heterogeneity and class imbalance motivate balanced metrics and robustness checks.

Data preparation processes, including resizing, are advantageous in enhancing the classification models, as emphasized by Gu et al. [15] and Alsubai et al. [14]. The breadth and versatility of this data collection enables thorough testing and provides deeper insights into the performance of various classification algorithms [22,26]. The diversity of the validation dataset is obvious from the example photographs of the four brain tumor kinds given in Figure 1.

2.2. Overall Workflow

The initial phase of this research comprises the preprocessing of the “Brain Tumor MRI Dataset” from Kaggle, which contains 7023 MRI pictures classified into four categories: glioma, meningioma, no tumor, and pituitary tumor [19]. Initially, the photos are resized to a uniform size to ensure consistency across the dataset, as the original image sizes differ. This resizing is crucial for retaining model correctness and facilitating effective training [14,15]. The dataset is then split into training, validation, and test sets in varied proportions (100%, 75%, 50%, 25%) to evaluate the models under diverse data availability scenarios. This stage ensures that the models are durable and can generalize effectively even with minimal data [14,20]. Data augmentation techniques, such as rotation, flipping, and scaling, are applied to enhance the dataset. These tactics increase the models’ capacity to generalize [11].

In the next stage, various models are implemented utilizing the provided dataset. Several approaches, including machine learning, deep learning, and few-shot learning models, are applied. The initial step involves examining standard machine learning algorithms, with an emphasis on models like Naïve Bayes, Random Forest, SVM, and KNN. These models assist in defining baseline performance metrics due to their simplicity and ease of interpretation [18]. These models are selected because they are straightforward and easy to read, offering a great starting point for comparisons. Next, deep learning models such as Convolutional Neural Networks (CNNs), Long Short-Term Memory (LSTM), and ResNet are applied. These models excel in image classification tasks because they can automatically extract and learn features from the data [7,12]. Additionally, few-shot learning models like Prototypical Networks, Matching Networks, SimpleShot, Reptile Networks, and MAML are researched, particularly because they are successful when training data is limited—a common circumstance in medical datasets [8]. Each model undergoes rigorous training and optimization on the training set, with careful tuning of hyperparameters to achieve optimal performance [2]. To validate that the models are robust and not overfitting, cross-validation procedures are performed [25].

In the final phase of this study, the trained models are evaluated using the test set, with metrics such as accuracy, precision, recall, F1 score, ROC curve, AUC score, and Classification Report [22,26]. This detailed evaluation helps highlight the strengths and weaknesses of each model across many scenarios. The impact of characteristics like data quality, class imbalance, and interpretability on model performance is also studied, offering essential insights for practical application in clinical scenarios [24]. Subsequently, the models are compared to establish the most effective strategy for classifying brain tumors. Additionally, the efficiency of each model is evaluated by analyzing both training and inference times, confirming its appropriateness for real-world applications [21]. The results from this study aim to lead future work to raise accuracy and improve patient care quality [22,26]. This extensive review will assist in selecting the best models for practical deployment in medical imaging, ensuring they deliver both high performance and operational feasibility [4,20].

In conclusion, the methodology covers thorough procedures including preprocessing, training, and analyzing multiple machine learning, deep learning, and few-shot models on the Brain Tumor Dataset to discover the best successful classification method.

The methodology shown in Figure 2 spans model creation, dataset curation, and the subsequent training and evaluation stages. This methodical strategy is aimed at tackling fundamental research concerns, such as accuracy, model interpretability, and dataset variety, and advancing the area of medical picture categorization.

2.3. Experimental Settings

Selected for their stability and extensive use in machine learning and deep learning research, these software packages provide a comprehensive range of tools for data processing, visualization, and model training. Python 3.8.’s adaptability and the plethora of libraries for machine learning and deep learning projects influenced the choice. The precise network configurations for the models employed in this work are provided below. These configurations span classic machine learning models, deep learning models, few-shot learning models, and meta-learning models. Standard configurations for initial hyperparameters, including kernel types and optimization techniques, are applied in standard ML models such as Random Forest, Logistic Regression, and SVM (linear, polynomial, and RBF kernels). Models were evaluated with several kernels—linear, polynomial, and RBF, for example—to determine the best settings for classifying brain cancers. To ensure reproducibility, the Random Forest model was configured with 100 estimators and a random state of 42. The Adam optimizer with given settings was used to optimize deep learning models like ANN, CNN, LSTM, and ResNet. CNNs were developed with convolutional layers followed by Max Pooling, flatten, dense, and activation layers to efficiently learn spatial features from MRI images from the Brain Tumor Dataset [19].

Few-shot learning models, like Prototypical Networks and Matching Networks, were constructed to meet the difficulty of minimal annotated data. These models were implemented with the Adam optimizer and particular loss functions including negative log-likelihood loss and cross-entropy loss to increase performance in low-data scenarios. Meta-learning models like MAML and Reptile enhanced learning efficiency and adaptability over numerous tasks by employing tactics like gradient updates and quick adaptation. The Model-Agnostic Meta-Learning (MAML) approach, utilized in this work, is based on a Convolutional Neural Network (CNN). The CNN architecture starts with an input layer optimized to receive 64 × 64 grayscale images. Layers of convolution and Max Pooling then assist in extracting hierarchical features. The architecture features a Conv2D layer with 32 filters and a (3 × 3) kernel followed by a Max Pooling 2D layer. This pattern is repeated with another Conv2D layer comprising 64 filters followed by another Max Pooling 2D layer. The recovered characteristics are then flattened and sent via a dense layer with 128 neurons triggered by ReLU before reaching the output layer with a softmax activation function to classify the input into one of four categories: pituitary, no tumor, meningioma, and glioma.

A ResNet model was also utilized, which combines a series of residual blocks meant to address the vanishing gradient issue and enable the training of deeper networks. Each residual block has two convolutional layers with batch normalization and ReLU activation routines. An identity mapping allows the input to each block to be immediately added to the output, assuring continuous transmission of the gradient across the network. A downsampling layer is employed when the input and output dimensions do not match. The design incorporates many residual blocks, each doubling the number of filters while halving the spatial dimensions. An adaptive average pooling layer and a fully connected layer suited for classifying the images into the designated categories follow. Both models are trained and rated across many measures, including accuracy, F1 score, MCC, balanced accuracy, and AUC-ROC, using the Adam optimizer. This ensures a complete performance analysis. The MAML algorithm’s ability to adapt to novel jobs with insufficient data is further increased by applying k-fold cross-validation, yielding trustworthy validation accuracy estimates.

These setups provide a comprehensive framework, enabling a robust comparison and extensive assessment of diverse techniques for brain tumor categorization. Each model’s design was fine-tuned to optimize performance, guaranteeing that the outcomes of this research are both reliable and useful to real-world medical diagnosis.

A table showing hyperparameter values for classical machine learning models is presented in Table 2. Table 3 presents hyperparameter settings for deep learning models, whereas Table 4 offers hyperparameter values for few-shot learning models. Table 5 contains extra hyperparameters for deep learning and few-shot learning models.

The MAML algorithm’s network architecture utilizing a CNN as the foundation model is presented in Figure 3.

2.4. Evaluation Metrics

2.4.1. Test Accuracy

Test accuracy measures the proportion of correctly classified brain tumor instances out of the total instances in the test set. The test accuracy is shown in Equation (1).

Test Accuracy = \frac{Correctly Classified Brain Tumors}{Total Brain Tumors in Test Set}

(1)

2.4.2. F1 Score

The F1 score is the harmonic mean of precision (PBT) and recall (RBT), providing a balanced measure that considers both false positives and false negatives for brain tumor classification. The F1 score is shown in Equation (2).

F 1 Score = 2 \times \frac{PBT \times RBT}{PBT + RBT}

(2)

2.4.3. Matthews Correlation Coefficient (MCC)

The Matthews correlation coefficient (MCC) assesses the quality of binary classifications in the context of brain tumor classification. In this formula, true positives (TPs), true negatives (TNs), false positives (FPs), and false negatives (FNs) reflect the counts of correctly and incorrectly classified cases of brain tumors. The MCC is shown in Equation (3).

MCC = \frac{(TP \times TN) - (FP \times FN)}{\sqrt{(TP + FP) (TP + FN) (TN + FP) (TN + FN)}}

(3)

2.4.4. Balanced Accuracy

Balanced accuracy is the average of sensitivity (Sen) and specificity (Spec) obtained on each class of brain tumors. The balanced accuracy is shown in Equation (4).

Balanced Accuracy = \frac{Sen + Spec}{2}

(4)

2.4.5. AUC Score and ROC Curve

The Area Under Curve (AUC) score represents the area under the Receiver Operating Characteristic (ROC) curve, which plots the True-Positive Rate (TPR) against the False-Positive Rate (FPR) for brain tumor classification. The AUC score is represented as the integral of the ROC curve, as shown in Equation (5).

AUC = \int ROC Curve for Brain Tumor Classification

(5)

2.4.6. Precision (Micro) and Recall (Micro)

These metrics are aggregated across all brain tumor classes to compute the average. Here, total true positives (TP_total), total false positives (FP_total), and total false negatives (FN_total) are used. The equations are shown in Equations (6) and (7).

Precision (Micro) = \frac{{TP}_{total}}{{TP}_{total} + {FP}_{total}}

(6)

Recall (Micro) = \frac{{TP}_{total}}{{TP}_{total} + {FN}_{total}}

(7)

2.4.7. Confusion Matrix

The confusion matrix provides a summary of prediction results on brain tumor classification. It shows the count of true positives (TPs), true negatives (TNs), false positives (FPs), and false negatives (FNs) for brain tumor classification, as shown in Table 6.

2.4.8. Training Accuracy and Validation Accuracy

These metrics measure the proportion of correctly classified brain tumors in the training (training accuracy) and validation sets (validation accuracy), respectively, as shown in Equations (8) and (9).

Train Acc = \frac{Correct Tumors in Train Set}{Total Tumors in Train Set}

(8)

Valid Acc = \frac{Correct Tumors in Valid Set}{Total Tumors in Valid Set}

(9)

2.4.9. Interpretability Analysis

To strengthen the clinical relevance and transparency of our study, we conducted an interpretability analysis of both deep learning and few-shot learning models. We employed two widely adopted techniques: Gradient-weighted Class Activation Mapping (Grad-CAM) and saliency maps. Grad-CAM was used for CNN, ResNet, and MAML-lite models, while saliency maps were applied to LSTM, MatchingNet, and ProtoNet models. These methods highlight regions of MRI images that contribute most strongly to a model’s decision, thereby allowing the identification of potential causes of misclassification. For each of the four classes (glioma, meningioma, pituitary tumor, and no tumor), representative samples were selected to visualize the model’s attention. Importantly, we also analyzed the critical case highlighted by the reviewer: the frequent misclassification of “no tumor” cases as “pituitary tumor” under the 10% dataset condition. The interpretability maps provide insight into whether models rely on clinically relevant structures or on confounding regions. This step is crucial for evaluating the trustworthiness of machine learning predictions in medical applications.

2.5. Cross-Dataset Validation

To evaluate the generalization ability of our models, we conducted additional experiments using the BraTS dataset as an external validation set. While the Kaggle Brain Tumor MRI Dataset was used for model training, performance was assessed on BraTS to simulate real-world deployment where training and testing data originate from different sources. The BraTS dataset includes multi-institutional MRI scans with heterogeneous acquisition protocols and higher variability compared to Kaggle. We applied the same preprocessing pipeline (resizing, normalization, and augmentation) to ensure compatibility. All models (CNN, LSTM, ResNet, MAML-lite, MatchingNet, ProtoNet, and Reptile) were trained exclusively on Kaggle data and evaluated on BraTS test samples. This design enabled us to directly assess cross-dataset generalization performance.

2.6. Data Variability Analysis

To further assess robustness, we examined how models respond to common sources of data variability that occur in clinical imaging workflows. These included the following:

Resolution variability: Images were rescaled to simulate differences in scanner resolution (e.g., 128 × 128 vs. 64 × 64).

Acquisition noise: Gaussian noise with varying signal-to-noise ratios (SNR 20 dB, 15 dB, 10 dB) was added to mimic scanner noise.

Modality shift: We evaluated model sensitivity by testing CNN, ResNet, and MAML on FLAIR sequences from BraTS, while training was performed on T1-weighted Kaggle MRI scans.

The preprocessing pipeline ensured consistent normalization, and performance was compared across baseline (unaltered images) and perturbed datasets.

2.7. Training Protocol and Fairness Across Models

To ensure fairness in comparison, we applied a consistent training protocol across all models, with adaptations only when necessary due to model architecture, as follows:

Epochs: Each model was initially trained for up to 100 epochs. However, to prevent overfitting and to ensure a fair computational budget, early stopping with patience of 10 epochs (based on validation loss) was applied across all models.
Learning rates: All deep learning and few-shot models were trained with the Adam optimizer (learning rate = 0.001) unless otherwise specified in the original model design (e.g., MAML and Reptile additionally tested with SGD at 0.01 for meta-updates). Learning rate schedules (step decay) were used consistently.
Batch sizes: A mini-batch size of 32 was used for the CNN, LSTM, and ResNet, while episodic training was employed for few-shot learning models (5-way, 5-shot, batch size = 32 tasks per episode).
Data splits: Models were trained on identical training/validation/test splits across dataset sizes (100%, 75%, 50%, 25%, 10%) to ensure comparability.
Regularization: Dropout ( $p = 0.5$ ) and L2 regularization ( $λ = 1 \times 10^{- 4}$ ) were applied to all deep learning models.

This uniform training setup ensured that performance differences arise primarily from the models themselves rather than from disparities in training strategy.

2.8. Dataset Preprocessing and Tumor Boundary Preservation

All images in the Kaggle and BraTS datasets originally varied in size and intensity range, which could introduce bias in feature extraction. The preprocessing pipeline was applied identically across all models to maintain fairness. To standardize inputs, we applied the following preprocessing steps:

Resizing: All MRI slices were resized to 128 × 128 pixels for computational efficiency and consistency across models. To minimize the risk of boundary distortion for small tumors, we preserved aspect ratio using padding where necessary before resizing. This ensured that anatomical proportions were not skewed.
Normalization: Intensity values were normalized to the range [0, 1] after per-image min–max scaling. This step preserved local contrast while removing inter-scan brightness variations.
Noise removal and augmentation: Median filtering was applied selectively to reduce scanner noise. Augmentation included rotation (±15°), horizontal flipping, and zooming (up to 10%) to improve generalization without altering structural integrity of tumors.
Boundary preservation check: To ensure that preprocessing did not erase or blur tumor edges, we performed qualitative visual inspection with Grad-CAM overlays on original vs. resized inputs. Tumor boundaries remained identifiable, even for small meningiomas and gliomas.

2.9. Computational Complexity and Model Efficiency

To evaluate real-world feasibility, we conducted a complexity analysis of the main models. Key indicators included the number of trainable parameters and the approximate number of floating-point operations (FLOPs) required per inference. As with edge detectors tuned to target domains [27], our cross-dataset results underscore the value of domain adaptation for clinical robustness. Results are summarized in Table 7.

The results show that while ResNet has the highest capacity, few-shot models achieve competitive accuracy with significantly fewer parameters and FLOPs, making them better suited for resource-constrained environments such as clinical imaging systems.

2.10. Dataset Class Distribution

The Kaggle Brain Tumor MRI Dataset contains 7023 images distributed across four classes (Table 8).

This distribution indicates moderate class imbalance, with the no-tumor class underrepresented relative to tumor classes. To mitigate imbalance effects, data augmentation (rotation, flipping, zooming) was applied uniformly across classes. In addition, balanced accuracy and MCC scores were reported alongside accuracy to ensure fair evaluation.

3. Results

3.1. Performance Comparison of Few-Shot Learning Models and Traditional Models

Few-shot learning models, notably MAML, exhibit remarkable performance even with minimal data. As an example, MAML demonstrates a remarkable accuracy of 89% and an AUC-ROC score of 0.9968, highlighting its capacity to sustain excellent performance even with a small amount of data. In contrast, standard machine learning models like XGBoost and Random Forest also demonstrate great performance with the whole dataset, with accuracies of 95.50% and 93.75%, respectively. However, as the dataset size lowers to 10%, their accuracy declines to 81% for XGBoost and 82% for Random Forest, illustrating their robustness but also their vulnerability to data reduction. Deep learning models such as CNN demonstrate a great performance with entire datasets, obtaining 93.82% accuracy and an AUC-ROC of 0.9894. However, similar to traditional models, their performance drastically diminishes with smaller datasets; for instance, CNN accuracy falls to 76.90% when employing just 10% of the data. ResNet, another deep learning model, demonstrates similar trends, obtaining 96% test accuracy with the complete dataset but declining to 75% with 10% data. These results indicate that while deep learning models outperform with ample data, few-shot learning models like MAML can deliver competitive performance even in data-constrained contexts, making them particularly beneficial in cases when data is sparse.

AUC-ROC scores: AUC-ROC scores further stress the discriminative properties of these models across diverse dataset sizes. For instance, ResNet and XGBoost retain good discriminative ability with AUC-ROC scores of 0.9894 and 0.9968, respectively, even with less data. MAML also displays superior performance, particularly in few-shot learning settings with low data availability, consistently maintaining an AUC-ROC score above 0.959 across varied dataset sizes. These AUC-ROC scores for various models and dataset sizes are visually shown in Figure 4A.

Balanced accuracy: The balanced accuracy of deep learning models like the CNN and ResNet is outstanding with full datasets, with scores such as 0.9375 and 0.96, respectively. However, as dataset size diminishes, their performance declines; for example, the CNN’s balanced accuracy drops to 0.75 with only 10% of the data. In contrast, established machine learning models like Random Forest and XGBoost demonstrate higher stability across various dataset sizes, sustaining balanced accuracies of 0.86 and 0.79, respectively, even with low data. This higher generalization capacity in low-data scenarios makes traditional models more robust when data is insufficient. The balanced accuracies of different models across various dataset sizes are illustrated in Figure 4B.

MCC scores: The Matthews correlation coefficient (MCC) provides a complete perspective of the models’ precision and recall across different dataset sizes. ResNet and XGBoost regularly achieve high MCC scores, such as 0.88 and 0.87, across all dataset sizes, suggesting their durability. MAML also performs well, particularly with smaller datasets, attaining an MCC score of 0.72 with just 10% of the data. This robustness is critical for medical imaging applications where data may be restricted. The MCC scores of several models across different dataset sizes are displayed in Figure 4C.

Test accuracy: Traditional machine learning models like Random Forest and XGBoost maintain decent accuracy across all dataset sizes, with scores of 81% and 82% for XGBoost and Random Forest, respectively, even with only 10% of the data. In contrast, deep learning models like CNN and ResNet see a noticeable loss in test accuracy as the dataset size decreases; for instance, the CNN’s test accuracy reduces from 93.82% with the whole dataset to 76.90% with only 10% of the data. MAML, however, confirms its promise in few-shot learning scenarios by retaining competitive performance with low data, attaining test accuracies of 74% even with 10% of the dataset. The test accuracies of several models across different dataset sizes are displayed in Figure 4D.

3.2. Classifications for Different Dataset Sizes for Deep Learning and Few-Shot Learning Models

Deep learning models: The performance of deep learning models varies substantially with dataset size. At 10% dataset size, CNN models struggle with certain classifications, particularly with “no tumor” cases, resulting in a decline in accuracy to 76.90%. However, the model performs better for glioma and meningioma diagnoses, as illustrated in Figure 5A. As the dataset size increases to 25%, the accuracy of the CNN improves to 82%, although certain misclassifications continue, especially in discriminating pituitary tumors from non-tumors. With a 50% dataset size, the CNN’s performance continues to improve, obtaining an accuracy of 89%, successfully diagnosing glioma, meningioma, and pituitary tumors with fewer errors, as seen in Figure 5B. At 75% dataset size, the CNN achieves outstanding accuracy of 92.5%, correctly detecting most tumor kinds, including “no tumor” cases, as illustrated in Figure 5C. Finally, at a full 100% dataset size, the CNN displays exceptional performance, with a test accuracy of 93.82%, successfully classifying all tumor kinds, as shown in Figure 5D.

Few-shot learning models: Few-shot learning models like MAML display competitive performance even with reduced dataset sizes. At a 10% dataset size, MAML obtains an accuracy of 0.74%, performing well in diagnosing glioma and meningioma but struggling with “no tumor” cases, as demonstrated in Figure 6A. As the dataset size increases to 25%, MAML’s performance improves, obtaining an accuracy of 0.88%, with fewer errors and greater distinction between tumor kinds. At a 50% dataset size, MAML maintains high performance with an accuracy of 0.89%, accurately recognizing pituitary tumors, meningioma, and glioma, indicating its robustness even with little data, as illustrated in Figure 6B. With a 75% dataset size, MAML achieves an accuracy of 0.90%, accurately diagnosing all tumor types, including “no tumor” cases, as illustrated in Figure 6C. At a full 100% dataset size, MAML demonstrates exceptional accuracy of 0.96%, accurately diagnosing all tumor classifications, as illustrated in Figure 6D.

3.3. Performance of Traditional Machine Learning Models Across Dataset Sizes

Traditional models improved predictably with more data (Figure 7). With 10% of the training set, a linear SVM reached ≈70% accuracy and frequently confused “no tumor” with pituitary (Figure 7A), while Random Forest achieved ≈82%. At 25%, SVM and Random Forest rose to ≈80% and ≈85%, respectively. At 50%, XGBoost and Logistic Regression reached ≈90% and ≈88% (Figure 7B). The SVM (RBF) variant further improved to ≈91% across classes, including “no tumor”. At 75%, Decision Tree and Gaussian Naïve Bayes approached ≈92% (Figure 7C). With the full dataset, SVM and Random Forest attained ≈95% and ≈93% (Figure 7D).

3.4. Performance of Few-Shot Learning Models in Brain Cancer Classification

We evaluated few-shot learning (FSL) models for brain tumor MRI classification across training set sizes, using AUC-ROC as the primary metric. MAML consistently outperformed other FSL approaches, achieving an AUC-ROC of 0.996 with 100% of the data and remaining strong at 0.992, 0.987, 0.979, and 0.959 with 75%, 50%, 25%, and 10% of the data, respectively. Prototypical Network and Matching Network showed moderate performance (0.878 and 0.906 at 100%; 0.867 and 0.711 at 75%). Reptile performed well (0.960 at 100%; 0.920 at 10%). By contrast, SimpleShot underperformed (0.750 at 100%) with larger declines at smaller data sizes. These results underscore the importance of selecting an appropriate FSL architecture for medical imaging. AUC-ROC scores across dataset sizes are reported in Table 9.

3.5. Key Determinants of Few-Shot Learning Model Performance

Several critical factors influence the effectiveness of few-shot learning models in medical image classification, particularly across varying dataset sizes:

Model architecture: The architecture of MAML significantly impacts its performance, allowing rapid adaptation to new tasks. For example, MAML’s architecture enables it to achieve high accuracy across various dataset sizes, consistently outperforming other models.
Feature extraction: The quality of feature extraction, typically reliant on CNN backbone models, is a major determinant of performance. Variations in the feature extraction process can lead to differences in model performance, as observed in the varying AUC-ROC scores of different models.
Episodic training: Essential for models like Prototypical and Matching Networks, episodic training is crucial for learning from limited data. This training approach helps these models perform well even with small datasets.
Distance metrics: The choice of distance metric (e.g., Euclidean distance for Prototypical Networks) affects the model’s ability to discriminate among cancer types. The effectiveness of these metrics is reflected in the model’s AUC-ROC scores.
Data augmentation: Techniques used to augment data, such as those in the AugmentedDataset class, enhance model generalization despite minimal data. This is particularly important for maintaining model performance across different dataset sizes.
Support set handling: How models handle the support set, including prototype computation in Prototypical Networks, significantly influences performance. Proper handling of the support set can lead to higher accuracy and better generalization.
Feature normalization: Feature normalization improves generalization across samples and classes, as demonstrated by SimpleShot. Normalized features lead to more stable performance metrics across different dataset sizes.
Model complexity: Balancing model complexity with generalization capability is critical for optimal performance, as seen in various architectures. A more complex model may perform better with more data but could struggle with smaller datasets.
Task complexity: The inherent difficulty of MRI tumor classification underscores the need for robust models capable of distinguishing between different cancer types. The task complexity requires models to have high accuracy and precision to be effective.

These characteristics collectively characterize the performance of few-shot learning models in medical image categorization, showing the relevance of incorporating these variables throughout model construction and deployment. The important parameters impacting few-shot learning model performance are listed in Table 10.

3.6. Interpretability Analysis Results

The interpretability results demonstrate clear differences between models in terms of their decision-making processes.

3.6.1. Deep Learning Models

CNN Grad-CAM maps (Figure 8) show that for glioma and meningioma, the network attends to tumor regions, validating correct classifications. However, in the 10% dataset case, CNN activations for “no tumor” images often overlap with the pituitary region, leading to false positives for pituitary tumors. ResNet Grad-CAM maps similarly highlight tumor regions with stronger localization consistency than the CNN. LSTM saliency maps indicate temporal activation patterns but show greater dispersion, suggesting less precise localization.

3.6.2. Few-Shot Learning Models

MAML-lite Grad-CAM results (Figure 9) reveal strong attention to tumor regions in correctly classified glioma images, but in some “no tumor” cases, activations shifted toward the pituitary region, explaining observed confusion. MatchingNet saliency maps occasionally emphasized background structures, contributing to errors in meningioma cases (Figure 9). ProtoNet saliency visualizations highlighted correct tumor regions in glioma but sometimes misfocused on meningioma areas, leading to cross-class misclassifications (Figure 9).

Overall, these analyses provide direct visual evidence of the models’ strengths and limitations. In particular, they confirm that dataset reduction (10% scenario) amplifies confusion between “no tumor” and “pituitary tumor” due to overlapping activation regions.

3.7. Visualization of Model Decision Features

To address interpretability concerns, we incorporated visual explanations of how models make predictions. Specifically, we generated Grad-CAM maps for CNN, ResNet, and MAML-lite, and saliency maps for LSTM, MatchingNet, and ProtoNet. These heatmaps illustrate the image regions that contributed most to the model’s decision for each class (glioma, meningioma, pituitary tumor, and no tumor). For glioma and meningioma, CNN and ResNet heatmaps consistently highlighted tumor boundaries, validating that the models rely on clinically relevant regions. For no-tumor cases misclassified as pituitary tumor (under 10% dataset), both CNN and MAML-lite heatmaps showed strong activation in the pituitary region, explaining the confusion. LSTM saliency maps revealed broader distributed activations, while MatchingNet occasionally focused on background textures, leading to errors in meningioma classification. ProtoNet showed reasonable localization but sometimes confused glioma and meningioma boundaries when they overlapped anatomically. Representative examples are shown in Figure 8 and Figure 9, where heatmaps are overlaid on MRI slices to enhance clinical interpretability.

3.8. Cross-Dataset Validation Results

The external validation experiments demonstrated that few-shot learning models retain stable performance across heterogeneous datasets. MAML maintained robust performance with an AUC-ROC of 0.952 and balanced accuracy of 0.82 on BraTS, closely aligned with its Kaggle performance (AUC-ROC 0.9969 with 100% data and 0.9595 with 10% data). This consistency highlights its strong generalization capacity. For the CNN and ResNet, both models experienced more substantial performance drops. CNN accuracy decreased from 93.82% (Kaggle full data) to 79.5% (BraTS) and ResNet from 96% to 83.2%. Misclassifications were more common in “no tumor” cases, reflecting distributional differences across datasets. LSTM achieved moderate generalization with an accuracy of 78.4%, indicating sensitivity to dataset shift. For other few-shot models (ProtoNet, MatchingNet, Reptile), performance was lower than MAML but still competitive, with AUC-ROC scores ranging from 0.81 to 0.88.

3.9. Impact of Data Variability on Model Performance

The analysis revealed that variability in resolution, noise, and modality has different impacts on traditional deep learning and few-shot learning approaches. With variability in resolution, CNN and ResNet accuracies decreased by ≈7–10% when tested on

64 \times 64

rescaled inputs, while MAML maintained a smaller drop (≈4%), suggesting stronger adaptability to resolution changes. With acquisition noise, all models showed performance degradation under low-SNR images. CNN accuracy dropped from 93.8% to 81.5% at SNR = 10 dB and ResNet from 96% to 84.1%, while MAML achieved 87.2%, showing relative robustness. With a modality shift (T1 → FLAIR), deep models trained on Kaggle T1 images misclassified FLAIR inputs more frequently, with accuracy drops >15%. In contrast, MAML retained higher generalization (AUC-ROC 0.91 vs. 0.996 on T1), outperforming baseline models.

3.10. Per-Class Performance and Error Analysis

To provide deeper insight into model behavior, we analyzed per-class performance using confusion matrices and ROC curves. Confusion matrices for CNN, LSTM, MAML-lite, traditional ML methods, and MatchingNet are provided in Figure 10 and Figure 11. Similarly, confusion matrices for some representative traditional machine learning methods are shown in Figure 12. These illustrate per-class classification performance and highlight systematic misclassifications. For example, the CNN tended to confuse “no tumor” with “pituitary tumor” (61 cases), whereas MAML-lite showed more balanced predictions across classes.

While the CNN achieved strong overall accuracy, confusion matrix analysis revealed frequent misclassification between no-tumor and pituitary tumor cases (e.g., 61 no-tumor cases misclassified as pituitary). The ROC curve indicated a micro-average AUC of 0.78, highlighting sensitivity to class imbalance. LSTM achieved higher micro-average AUC (0.85) and better separation across classes, but confusion matrices showed confusion between glioma and meningioma cases. Few-shot learning improved balance across classes, with more consistent predictions for glioma and pituitary tumor. However, misclassifications still occurred in meningioma vs. glioma (48 and 42 cases misclassified, respectively). MatchingNet achieved robust classification of glioma and no tumor, though pituitary and meningioma cases occasionally overlapped (e.g., 45 gliomas misclassified as meningioma, 36 meningiomas misclassified as glioma). Overall, per-class analysis shows that while averages suggest strong performance, errors are not uniformly distributed. Deep learning models (CNN, LSTM) are more prone to no tumor ↔ pituitary confusion, whereas few-shot models reduce this bias but face challenges separating glioma and meningioma due to boundary similarity.

4. Discussion

Across matched training protocols and dataset regimes, few-shot learning (FSL)—especially MAML—maintained high discrimination under label scarcity, while standard deep models (CNN/ResNet) were competitive with full data but degraded more strongly as data shrank. Traditional ML remained stable but underperformed compared to the best DL/FSL at scale. These results support FSL as a data-efficient alternative for brain tumor MRI when annotations are limited.

External validation (train: Kaggle; test: BraTS) and variability stress tests (down-resolution, acquisition noise, T1→FLAIR shift) highlighted distribution-shift sensitivity for all models. MAML showed smaller cross-dataset drops and comparatively better tolerance to reduced resolution and added noise, whereas modality shift remained challenging across methods. This pattern suggests meta-learning confers greater adaptability to heterogeneous clinical data sources, though multimodal training and domain adaptation will be needed to further mitigate shift. Results from noisy, imbalanced domains [28] reinforce our emphasis on balanced metrics and robustness checks.

Grad-CAM and saliency maps confirmed that correct predictions center on tumor regions, while systematic errors are explained by misplaced attention (e.g., “no tumor” scans activating near the pituitary, and boundary overlap driving glioma ↔ meningioma confusion). Per-class confusion matrices revealed these biases explicitly, clarifying where additional data, augmentation, or class-specific regularization would be most effective.

A uniform protocol (shared splits, optimizer/learning rate, early stopping, and regularization) aligned computational budgets across models, supporting a fair comparison. Performance differences therefore reflect algorithmic properties: meta-updates and episodic training in FSL improve adaptation with few labels; deep backbones excel at scale; and classical models provide robust baselines under scarcity but plateau earlier.

Complexity analysis shows FSL baselines achieve strong accuracy with fewer parameters/FLOPs than deeper backbones, favoring deployment on resource-constrained systems. Limitations include reliance on public datasets, residual modality-shift errors, and no prospective clinical workflow evaluation. Future work will integrate attention modules (e.g., ECBAM/AFFM) into FSL, leverage multimodal MRI, and evaluate domain adaptation to improve small-lesion sensitivity and cross-site robustness. As shown for Parkinson’s classification with GA-based feature selection [29], principled model design mitigates small-sample risks; our standardized protocol and meta-learning results extend this insight to brain tumor MRI.

Future Directions

This article’s research focused on the comparative study of multiple machine learning models for brain tumor categorization. Using MRI images, this study began with a complete literature analysis to grasp state-of-the-art approaches in medical image categorization, directing the selection of suitable models for this experiment. Among the primary models investigated were basic machine learning models, deep learning architectures like CNNs, and few-shot learning models, including Prototypical Networks and Model-Agnostic Meta-Learning (MAML). The data used for this study was sourced from Kaggle and consisted of MRI images sorted into glioma, meningioma, pituitary tumor, and non-tumor categories. To ensure the quality and homogeneity of the photographs, substantial data preprocessing was conducted. The models were evaluated based on different criteria, including accuracy, precision, recall, and F1 score, with particular focus on their performance with little training data. The results emphasized the usefulness of few-shot learning models in settings when annotated data is scarce, displaying considerable gains in classification accuracy over standard techniques. This research stresses the potential of few-shot learning models in boosting the accuracy and efficiency of brain tumor classification, delivering vital insights towards optimizing machine learning systems for medical diagnostics.

5. Conclusions

In conclusion, this study illustrates the tremendous potential of few-shot learning models, particularly MAML, in enhancing medical image categorization, especially in scenarios with minimal annotated data. Compared to standard and deep learning models, few-shot learning models demonstrate higher adaptability and robust performance across varied dataset sizes, making them particularly relevant for medical diagnostics. The findings emphasize the importance of model selection and optimization based on specific clinical requirements and data availability. While classic models like XGBoost and Random Forest perform well with large datasets, their accuracy declines with smaller datasets, highlighting the significant advantage of few-shot learning methods. Future research should focus on addressing the limitations of the current study, expanding datasets, improving model interpretability, and developing hybrid approaches to further enhance diagnostic accuracy and clinical applicability. Additionally, deploying these models in real-world clinical settings and evaluating their impact on diagnostic workflows and patient outcomes will be crucial. Overall, this research provides a foundational step towards integrating advanced machine learning techniques into healthcare, with the aim of improving diagnostic accuracy, patient outcomes, and healthcare efficiency. By advancing these approaches, substantial progress can be made in personalized treatment and the overall enhancement of healthcare delivery.

Author Contributions

Conceptualization, S.S.V. and R.H.A.; methodology, S.S.V.; software, T.A.K. and S.F.; validation, I.A.; formal analysis, I.A.; writing—original draft preparation, S.S.V., S.F. and R.H.A.; writing—review and editing, R.H.A. and S.S.V.; supervision, R.H.A. and T.A.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

All data repositories used in the project are linked.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Ouahab, A.; Ahmed, O.B. ProtoMed: Prototypical networks with auxiliary regularization for few-shot medical image classification. Image Vis. Comput. 2025, 154, 105337. [Google Scholar] [CrossRef]
Szucs, G.; Németh, M. Double-view matching network for few-shot learning to classify covid-19 in X-ray images. Infocommun. J. 2021, 13, 26–34. [Google Scholar] [CrossRef]
Alsaleh, A.M.; Albalawi, E.; Algosaibi, A.; Albakheet, S.S.; Khan, S.B. Few-Shot Learning for Medical Image Segmentation Using 3D U-Net and Model-Agnostic Meta-Learning (MAML). Diagnostics 2024, 14, 1213. [Google Scholar] [CrossRef] [PubMed]
Singh, R.; Bharti, V.; Purohit, V.; Kumar, A.; Singh, A.K.; Singh, S.K. MetaMed: Few-shot medical image classification using gradient-based meta-learning. Pattern Recognit. 2021, 120, 108111. [Google Scholar] [CrossRef]
Ali, R.H.; Muhammad, S.A.; Arvestad, L. GenFamClust: An accurate, synteny-aware and reliable homology inference algorithm. BMC Evol. Biol. 2016, 16, 120. [Google Scholar] [CrossRef] [PubMed]
Nayem, J.; Hasan, S.S.; Amina, N.; Das, B.; Ali, M.S.; Ahsan, M.M.; Raman, S. Few shot learning for medical imaging: A comparative analysis of methodologies and formal mathematical framework. In Data Driven Approaches on Medical Imaging; Springer: Berlin/Heidelberg, Germany, 2023; pp. 69–90. [Google Scholar] [CrossRef]
Baranwal, S.K.; Jaiswal, K.; Vaibhav, K.; Kumar, A.; Srikantaswamy, R. Performance analysis of brain tumour image classification using CNN and SVM. In Proceedings of the 2020 Second International Conference on Inventive Research in Computing Applications (ICIRCA), Coimbatore, India, 15–17 July 2020; pp. 537–542. [Google Scholar] [CrossRef]
Cai, A.; Hu, W.; Zheng, J. Few-shot learning for medical image classification. In Proceedings of the International Conference on Artificial Neural Networks, Bratislava, Slovakia, 15–18 September 2020; pp. 441–452. [Google Scholar] [CrossRef]
Smith, B.J.; Hillis, S.L. Multi-reader multi-case analysis of variance software for diagnostic performance comparison of imaging modalities. In Proceedings of the Medical Imaging 2020: Image Perception, Observer Performance, and Technology Assessment, Houston, TX, USA, 6–9 July 2020; Volume 11316, pp. 94–101. [Google Scholar] [CrossRef]
Jun, Y.; Shin, H.; Eo, T.; Hwang, D. Joint deep model-based MR image and coil sensitivity reconstruction network (joint-ICNet) for fast MRI. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 5270–5279. [Google Scholar] [CrossRef]
Fasihi, M.S.; Mikhael, W.B. Brain tumor grade classification using LSTM neural networks with domain pre-transforms. In Proceedings of the 2021 IEEE International Midwest Symposium on Circuits and Systems (MWSCAS), Lansing, MI, USA, 9–11 August 2021; pp. 529–532. [Google Scholar] [CrossRef]
Brindha, P.G.; Kavinraj, M.; Manivasakam, P.; Prasanth, P. Brain tumor detection from MRI images using deep learning techniques. Iop Conf. Ser. Mater. Sci. Eng. 2021, 1055, 012115. [Google Scholar] [CrossRef]
Seymour, Z.A.; Chan, J.W.; McDermott, M.W.; Grills, I.; Ye, H.; Kano, H.; Lehocky, C.A.; Jacobs, R.C.; Lunsford, L.D.; Chytka, T.; et al. Adverse radiation effects in volume-staged radiosurgery for large arteriovenous malformations: A multiinstitutional study. J. Neurosurg. 2021, 136, 503–511. [Google Scholar] [CrossRef] [PubMed]
Alsubai, S.; Khan, H.U.; Alqahtani, A.; Sha, M.; Abbas, S.; Mohammad, U.G. Ensemble deep learning for brain tumor detection. Front. Comput. Neurosci. 2022, 16, 1005617. [Google Scholar] [CrossRef] [PubMed]
Gu, C. Enhancing medical image classification with convolutional neural networks through transfer learning: A comprehensive review. Appl. Comput. Eng. 2024, 35, 280–284. [Google Scholar] [CrossRef]
Chen, H.; Zendehdel, N.; Leu, M.C.; Yin, Z. A gaze-driven manufacturing assembly assistant system with integrated step recognition, repetition analysis, and real-time feedback. Eng. Appl. Artif. Intell. 2025, 144, 110076. [Google Scholar] [CrossRef]
Mondal, A.; Cambria, E.; Das, D.; Hussain, A.; Bandyopadhyay, S. Relation extraction of medical concepts using categorization and sentiment analysis. Cogn. Comput. 2018, 10, 670–685. [Google Scholar] [CrossRef]
Upadhyay, A.; Palival, U.; Jaiswal, S. Early brain tumor detection using random forest classification. In Proceedings of the Innovations in Bio-Inspired Computing and Applications: Proceedings of the 10th International Conference on Innovations in Bio-Inspired Computing and Applications (IBICA 2019), Gunupur, India, 16–18 December 2019; Springer: Cham, Switzerland, 2021; pp. 258–264. [Google Scholar] [CrossRef]
Brain Tumor MRI Dataset. Available online: https://www.kaggle.com/datasets/masoudnickparvar/brain-tumor-mri-dataset (accessed on 18 July 2025).
Yang, S.; Zhu, F.; Ling, X.; Liu, Q.; Zhao, P. Intelligent health care: Applications of deep learning in computational medicine. Front. Genet. 2021, 12, 607471. [Google Scholar] [CrossRef] [PubMed]
Huang, S.C.; Pareek, A.; Jensen, M.; Lungren, M.P.; Yeung, S.; Chaudhari, A.S. Self-supervised learning for medical image classification: A systematic review and implementation guidelines. NPJ Digit. Med. 2023, 6, 74. [Google Scholar] [CrossRef] [PubMed]
Ahsan, M.M.; Luna, S.A.; Siddique, Z. Machine-learning-based disease diagnosis: A comprehensive review. Healthcare 2022, 10, 541. [Google Scholar] [CrossRef] [PubMed]
Mall, P.K.; Singh, P.K.; Srivastav, S.; Narayan, V.; Paprzycki, M.; Jaworska, T.; Ganzha, M. A comprehensive review of deep neural networks for medical image processing: Recent developments and future opportunities. Healthc. Anal. 2023, 4, 100216. [Google Scholar] [CrossRef]
Rana, M.; Bhushan, M. Machine learning and deep learning approach for medical image analysis: Diagnosis to detection. Multimed. Tools Appl. 2023, 82, 26731–26769. [Google Scholar] [CrossRef]
Li, M.; Jiang, Y.; Zhang, Y.; Zhu, H. Medical image analysis using deep learning algorithms. Front. Public Health 2023, 11, 1273253. [Google Scholar] [CrossRef]
Furizal, F.; Ma’arif, A.; Rifaldi, D. Application of machine learning in healthcare and medicine: A review. J. Robot. Control (JRC) 2023, 4, 621–631. [Google Scholar] [CrossRef]
Ahmed, T.; Maaz, A.; Mahmood, D.; ul Abideen, Z.; Arshad, U.; Ali, R.H. The YOLOv8 Edge: Harnessing Custom Datasets for Superior Real-Time Detection. In Proceedings of the 2023 18th International Conference on Emerging Technologies (ICET), Peshawar, Pakistan, 6–7 November 2023; pp. 38–43. [Google Scholar]
Mashhood, A.; ul Abideen, Z.; Arshad, U.; Ali, R.H.; Khan, A.A.; Khan, B. Innovative Poverty Estimation through Machine Learning Approaches. In Proceedings of the 2023 18th International Conference on Emerging Technologies (ICET), Peshawar, Pakistan, 6–7 November 2023; pp. 154–158. [Google Scholar]
Iftikhar, M.; Ali, N.; Ali, R.H.; Bais, A. Classification of Parkinson Disease with Feature Selection using Genetic Algorithm. In Proceedings of the 2023 IEEE Canadian Conference on Electrical and Computer Engineering (CCECE), Regina, SK, Canada, 24–27 September 2023; pp. 522–527. [Google Scholar]

Figure 1. Representative MRI slices from the four study classes—glioma, meningioma, pituitary tumor, and no tumor—illustrating appearance variability used for validation.

Figure 2. Stepwise Study workflow: data acquisition (Kaggle), labeling, preprocessing (resize/normalize/augment), model training (traditional ML, deep learning, few-shot learning), evaluation (accuracy, balanced accuracy, F1, MCC, AUC-ROC), model selection, and test-time inference for clinical decision support.

Figure 3. Schematic of the MAML (Model-Agnostic Meta-Learning) pipeline with a CNN backbone: stacked convolution + pooling blocks, feature flattening, dense layer, and softmax output over four classes.

Figure 4. Aggregate performance across dataset sizes (100%, 75%, 50%, 25%, 10%): (A) AUC-ROC, (B) balanced accuracy, (C) Matthews correlation coefficient (MCC), and (D) test accuracy for traditional ML, deep learning, and few-shot learning models. Error modes and per-class shifts are detailed in later figures.

Figure 5. Representative classifications by deep learning models across data regimes: (A) 10%, (B) 50%, (C) 75%, and (D) 100% training data. Accuracy improves with data scale, with persistent difficulty for “no tumor” in low-data settings.

Figure 6. Representative classifications by few-shot learning (e.g., MAML) across data regimes: (A) 10%, (B) 50%, (C) 75%, and (D) 100%; performance remains competitive under label scarcity.

Figure 7. Representative classifications by traditional machine learning models across data regimes: (A) 10%, (B) 50%, (C) 75%, and (D) 100%; SVM and Random Forest improve with more data but show early confusion of “no tumor” with pituitary.

Figure 8. Interpretability for deep models: Grad-CAM (CNN, ResNet) and saliency (LSTM) highlight decision-relevant regions. Under the 10% dataset, CNN activations on “no tumor” cases often concentrate near the pituitary, explaining false positives. Warmer colors (red/yellow) denote stronger contributions to the predicted class, while cooler colors (blue) indicate weaker contributions.

Figure 9. Interpretability for few-shot models: Grad-CAM (MAML-lite) and saliency (MatchingNet, ProtoNet, RelationNet) reveal tumor-focused attention on correct cases and background/pituitary activation on specific errors, clarifying cross-class confusions. Warmer colors (red/yellow) indicate regions with higher contribution to the predicted class, while cooler colors (blue) indicate lower contribution.

Figure 10. Confusion matrices for deep learning models: (left) CNN and (right) LSTM on the four-class task. Numbers indicate counts of test samples, with color intensity proportional to frequency (darker = more). Dotted diagonal lines represent perfect classification (all true cases predicted correctly). Common errors include “no tumor” misclassified as pituitary and glioma ↔ meningioma overlap.

Figure 11. Confusion matrices for few-shot learning models: (left) MAML-lite and (right) MatchingNet. Numbers show test-sample counts, with color shading scaled to frequency. Dotted diagonal lines mark perfect classification. Few-shot methods improved balance across classes but still show glioma ↔ meningioma confusions in difficult cases.

Figure 12. Confusion matrices for representative traditional machine learning models (e.g., SVM/RBF, Random Forest, XGBoost, Logistic Regression). Numbers represent sample counts per cell, with color intensity reflecting frequency. Dotted diagonal lines denote perfect classification. These highlight strengths under ample data and limitations under scarcity, including frequent no tumor ↔ pituitary confusions in low-data settings.

Table 1. Summary of related work in medical image analysis and few-shot learning: datasets, methods, headline results, contributions, and noted limitations.

Year	Author	Dataset	Method	Results and Contribution	Limitations
2024	Alsaleh et al. [3]	3D Medical	3D U-Net, MAML	High segmentation accuracy, improving few-shot learning in 3D medical imaging.	Limited availability of labeled data affects training.
2023	Nayem et al. [6]	Varied Medical	Affinity, Siamese	Achieved excellent performance in prediction tasks using affinity-based approaches.	Initial model configuration limits flexibility.
2020	Baranwal et al. [7]	REMBRANDT	SVM, CNN	Accuracy: 82.38% (M), 83.01% (G), 95.27% (P). Improved accuracy in brain tumor classification.	Requires a large dataset for effective training.
2020	Cai et al. [8]	Medical Images	Few-shot, attention	Accuracy: 92.44%. Demonstrated better performance than traditional methods in medical image classification.	Struggles with heterogeneous data.
2020	Smith et al. [9]	Medical Datasets	Self-supervised	Improved robustness in medical image analysis. Applied successfully in medical imaging tasks.	Limited validation across diverse datasets.
2021	Jun et al. [10]	MRI Images	Random Forest	Accuracy: 96.3%. Achieved high accuracy in MRI image classification.	Dataset quality and size limitations affect generalizability.
2021	Mikhael Fasihi [11]	Medical Datasets	LSTM, DWT, DCT	Accuracy: 86.98%. Utilized domain-specific transforms for classification.	High computational complexity.
2021	Brindha et al. [12]	MRI Images	ANN, CNN	Achieved high detection accuracy, improving detection rates in MRI-based diagnosis.	Requires significant computational resources.
2021	Seymour et al. [13]	Varied Medical	Meta, few-shot	Effective classification using advanced meta-learning techniques.	Requires careful tuning of hyperparameters.
2022	Alsubai et al. [14]	Medical Datasets	CNN, LSTM	Demonstrated robust detection by leveraging spatial and temporal features.	High training complexity.
2024	Gu et al. [15]	Large Datasets	Transfer, MobileNet	Successfully avoided overfitting and enhanced generalization across large datasets.	Dependence on pre-trained models.
2024	Proposed study	Brain MRI	Few-shot, DL, ML	Achieved 89.00% accuracy, AUC-ROC: 0.9968 for MAML, showcasing potential in brain tumor classification.	Focus is primarily on brain tumors.

Table 2. Hyperparameter configurations for traditional machine learning baselines (SVM—linear/polynomial/RBF, Random Forest, Logistic Regression, XGBoost).

Model	Kernel/Type	Additional Configurations
SVM linear kernel	Linear	C = default, gamma = “auto”, probability = True
SVM polynomial	Polynomial	degree = 3, C = 1.0, gamma = “scale”, probability = True
SVM RBF kernel	RBF	C = 1.0, gamma = “scale”, probability = True
Random Forest	-	n_estimators = 100, random_state = 42
Logistic Regression	-	max_iter = 10,000
XGBoost	-	n_estimators = 100, max_depth = 3, random_state = 42

Table 3. Hyperparameter configurations for deep learning models (ANN, CNN, LSTM, ResNet), including optimizer and loss settings.

Model	Kernel/Type	Optimizer	Loss Function	Epochs
ANN	Dense	Adam	sparse_categorical_crossentropy	10
CNN	Convolutional	Adam	sparse_categorical_crossentropy	10
LSTM	LSTM	Adam	sparse_categorical_crossentropy	10
ResNet	Custom	Adam	CrossEntropyLoss	25

Table 4. Hyperparameter configurations for few-shot and meta-learning models (Prototypical Networks, Matching Networks, SimpleShot, MAML, Reptile).

Model	Type	Optimizer	Loss Function	Epochs
Prototypical Networks	Few-shot	Adam	Cross-entropy loss	100
Matching Networks	Few-shot	Adam	-ve log-likelihood loss	100
SimpleShot	Few-shot	N/A	N/A	N/A
MAML	Meta-learning	SGD, Adam	Sparse categorical cross-entropy	10
Reptile	Meta-learning	Adam	Sparse categorical cross-entropy	10

Table 5. Additional architectural details for all models (e.g., layer stacks, prototype computation, similarity metrics, meta-learning updates) used to ensure reproducibility.

Model	Additional Configurations
ANN	Layers: flatten, dense (128, ReLU), dense (4, softmax)
CNN	Layers: Conv2D, MaxPooling2D, flatten, dense, activation (ReLU/softmax)
LSTM	Input shape: (1, features), Layers: LSTM (128), dense (128, ReLU)
ResNet	Custom blocks, adaptive average pooling, modified fc layer for 4 classes
Prototypical Networks	Prototypes computed as mean of support features
Matching Networks	LSTM-based feature encoding, softmax over similarities
SimpleShot	Cosine similarity, feature normalization
MAML	Meta-learning with rapid adaptation
Reptile	Gradient updates with full dataset adaptation

Table 6. Confusion matrix definition.

	Predicted Positive	Predicted Negative
Actual Positive	True positive	FN
Actual Negative	FP	TN

Table 7. Computational complexity of all models—trainable parameters and floating-point operations (FLOPs) per inference/episode—supporting deployment feasibility analyses.

Model	Trainable Parameters	FLOPs (per Inference/Episode)
CNN	∼2.3 M	1.8 GFLOPs
LSTM	∼3.1 M	2.2 GFLOPs
ResNet	∼11.2 M	4.5 GFLOPs
MAML-lite	∼2.6 M	2.0 GFLOPs
MatchingNet	∼2.9 M	2.1 GFLOPs
ProtoNet	∼2.5 M	1.9 GFLOPs

Table 8. Class distribution in the Kaggle Brain Tumor MRI Dataset (n = 7023): glioma, meningioma, pituitary tumor, and no tumor; moderate imbalance motivates balanced metrics and augmentation.

Class	Number of Images
Glioma	2263
Meningioma	1822
Pituitary tumor	2026
No tumor	912
Total	7023

Table 9. AUC-ROC scores for few-shot learning models (MAML, Prototypical Network, Reptile, MatchingNet, SimpleShot) across dataset sizes (100% → 10%), highlighting robustness under label scarcity.

Few-Shot Learning Model	100% Data	75% Data	50% Data	25% Data	10% Data
MAML	0.9969	0.9919	0.9868	0.9791	0.9595
Prototypical Network	0.8780	0.8668	0.8779	0.8986	0.8471
Reptile Model	0.9600	0.9700	0.9600	0.9500	0.9200
Matching Network	0.9057	0.6889	0.7115	0.7544	0.9112
SimpleShot	0.7500	0.7600	0.7700	0.7400	0.7200

Table 10. Key determinants of few-shot model performance in medical imaging, including architecture, feature extraction, episodic training, distance metrics, augmentation, and task complexity.

Parameter	Description	Impact
Model architecture	Affects model’s adaptability.	Enables rapid adaptation.
Feature extraction	Quality of feature extraction.	Major determinant of performance.
Extraction variations	Differences in feature quality.	Causes varied model performance.
Episodic training	Training for small data learning.	Crucial for performance improvement.
Distance metrics	Metric for measuring distance.	Affects discrimination ability.
Quick adaptation	MAML’s adaptability to tasks.	Leads to superior performance.
Data augmentation	Techniques to augment data.	Enhances generalization.
Support set handling	Management of support set.	Influences performance significantly.
Feature normalization	Normalizing features.	Improves generalization.
Model complexity	Balance of complexity and generalization.	Critical for optimal performance.
Task complexity	Difficulty of MRI tumor classification.	Requires robust differentiation.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Vaidya, S.S.; Ali, R.H.; Faiz, S.; Ahmed, I.; Khan, T.A. Evaluating Machine Learning Techniques for Brain Tumor Detection with Emphasis on Few-Shot Learning Using MAML. Algorithms 2025, 18, 624. https://doi.org/10.3390/a18100624

AMA Style

Vaidya SS, Ali RH, Faiz S, Ahmed I, Khan TA. Evaluating Machine Learning Techniques for Brain Tumor Detection with Emphasis on Few-Shot Learning Using MAML. Algorithms. 2025; 18(10):624. https://doi.org/10.3390/a18100624

Chicago/Turabian Style

Vaidya, Soham Sanjay, Raja Hashim Ali, Shan Faiz, Iftikhar Ahmed, and Talha Ali Khan. 2025. "Evaluating Machine Learning Techniques for Brain Tumor Detection with Emphasis on Few-Shot Learning Using MAML" Algorithms 18, no. 10: 624. https://doi.org/10.3390/a18100624

APA Style

Vaidya, S. S., Ali, R. H., Faiz, S., Ahmed, I., & Khan, T. A. (2025). Evaluating Machine Learning Techniques for Brain Tumor Detection with Emphasis on Few-Shot Learning Using MAML. Algorithms, 18(10), 624. https://doi.org/10.3390/a18100624

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Evaluating Machine Learning Techniques for Brain Tumor Detection with Emphasis on Few-Shot Learning Using MAML

Abstract

1. Introduction

1.1. Gap Analysis

1.2. Problem Statement

1.3. Novelty of Our Work

1.4. Our Solutions

2. Materials and Methods

2.1. Dataset

2.2. Overall Workflow

2.3. Experimental Settings

2.4. Evaluation Metrics

2.4.1. Test Accuracy

2.4.2. F1 Score

2.4.3. Matthews Correlation Coefficient (MCC)

2.4.4. Balanced Accuracy

2.4.5. AUC Score and ROC Curve

2.4.6. Precision (Micro) and Recall (Micro)

2.4.7. Confusion Matrix

2.4.8. Training Accuracy and Validation Accuracy

2.4.9. Interpretability Analysis

2.5. Cross-Dataset Validation

2.6. Data Variability Analysis

2.7. Training Protocol and Fairness Across Models

2.8. Dataset Preprocessing and Tumor Boundary Preservation

2.9. Computational Complexity and Model Efficiency

2.10. Dataset Class Distribution

3. Results

3.1. Performance Comparison of Few-Shot Learning Models and Traditional Models

3.2. Classifications for Different Dataset Sizes for Deep Learning and Few-Shot Learning Models

3.3. Performance of Traditional Machine Learning Models Across Dataset Sizes

3.4. Performance of Few-Shot Learning Models in Brain Cancer Classification

3.5. Key Determinants of Few-Shot Learning Model Performance

3.6. Interpretability Analysis Results

3.6.1. Deep Learning Models

3.6.2. Few-Shot Learning Models

3.7. Visualization of Model Decision Features

3.8. Cross-Dataset Validation Results

3.9. Impact of Data Variability on Model Performance

3.10. Per-Class Performance and Error Analysis

4. Discussion

Future Directions

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI