Brain Tumor Classification Using DINO Features and Lightweight Classifiers

Missaoui, Rim; Del Coco, Marco; Saadaoui, Wajdi; Hechkel, Wided; Helali, Abdelhamid; Carcagnì, Pierluigi; Leo, Marco

doi:10.3390/electronics15050952

Open AccessArticle

Brain Tumor Classification Using DINO Features and Lightweight Classifiers

by

Rim Missaoui

^1,2,

Marco Del Coco

³

,

Wajdi Saadaoui

⁴,

Wided Hechkel

²,

Abdelhamid Helali

²

,

Pierluigi Carcagnì

³

and

Marco Leo

^3,*

¹

National High School of Engineering of Tunis (ENSIT), University of Tunis, 5 Rue Taha Hussein–Montfleury, Tunis 1008, Tunisia

²

Laboratory of Micro-Optoelectronics and Nanostructures (LMON), University of Monastir, Avenue of the Environment, Monastir 5019, Tunisia

³

Institute of Applied Sciences and Intelligent Systems, National Research Council of Italy (CNR), 73100 Lecce, Italy

⁴

Centre for Studies and Research in Medical Informatics (CERIM), Faculty of Medicine of Lille, University of Lille, 59000 Lille, France

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(5), 952; https://doi.org/10.3390/electronics15050952

Submission received: 10 January 2026 / Revised: 11 February 2026 / Accepted: 24 February 2026 / Published: 26 February 2026

(This article belongs to the Special Issue Assistive Technology: Advances, Applications and Challenges)

Download

Browse Figures

Versions Notes

Abstract

The accurate detection and classification of brain tumors from magnetic resonance imaging (MRI) are critical for diagnosis and treatment planning. While deep learning has shown remarkable success in this domain, many state-of-the-art models rely on complex, end-to-end convolutional neural networks (CNNs) that require extensive computational resources and large, annotated datasets for training. This work proposes a novel and efficient methodology that, for the first time, leverages self-supervised DINO vision transformer backbones (DINO v1, DINOv2, and DINOv3) on a large corpus of natural images as powerful feature extractors for brain tumor analysis. We utilize the rich, general-purpose features from DINO-family backbones without fine-tuning the core model. These extracted features are then fed into a simpler, task-specific classifier (such as a support vector machine or a multi-layer perceptron) for the final detection and multi-class classification (e.g., glioma, meningioma, and pituitary tumor). Our methodology is evaluated on two benchmark medical imaging datasets with various classifying granularities. The results demonstrate that the proposed method achieves competitive and, in some cases, superior classification accuracy compared to representative fine-tuned convolutional neural networks and attention-based architectures, while significantly reducing the number of trainable parameters and training time. In particular, the best configuration achieves up to 98.17% accuracy and an F1-score of 98.18% on the 15-class dataset and 99.08% accuracy and an F1-score of 99.02% on the 4-class dataset. This study confirms the exceptional transfer learning capabilities of self-supervised vision transformers like DINO in the medical imaging domain, establishing it as a highly effective and efficient backbone for robust brain tumor detection and classification systems.

Keywords:

self-supervised learning; machine learning; brain tumor classification; magnetic resonance imaging (MRI); feature extraction

1. Introduction

Brain tumors are considered a major challenge in contemporary medicine, which affects the survival and quality of living of a wide range of demographics [1]. These abnormal cell growths can be broadly classified as either benign or malignant; the latter is characterized by a high rate of recurrence, invasive characteristics, and therapy resistance. Because early intervention significantly improves clinical outcomes and survival rates, accurate and timely identification is crucial. Because of its superior soft-tissue imaging and non-invasive contrast, magnetic resonance imaging (MRI) is the gold standard for diagnosing brain tumors [2]. However, manually interpreting MRI scans requires radiologist expertise, is time-consuming, and is subject to inter-observer variability. This has led to the emergence of automated methods for brain tumor identification, localization, and classification [3,4].

The clinical diagnosis of a brain tumor is generally carried out using standard MRI sequences such as T1-weighted, T2-weighted, and Fluid-Attenuated Inversion Recovery (FLAIR) scans, with great anatomical details, which are widely adopted in both routine diagnostics and research datasets [5]. These are supplemented by advanced MRI modalities such as diffusion-weighted imaging (DWI), perfusion-weighted imaging (PWI), and magnetic resonance spectroscopy (MRS) to obtain functional and physiological attributes of tumor tissues, such as vascularity and metabolic signatures, and cellularity [6]. While all these modalities combined enhance diagnostic definition and therapy planning to a certain extent, interpretation is always intricate and significantly dependent upon experience from radiologists, reinforcing the need for automated systems to interpret multimodal MRI data efficiently and consistently.

Traditional machine learning (ML) techniques, including support vector machines (SVMs), Random Forests, and k-Nearest Neighbors, were employed in the early stages of brain tumor classification [7,8]. Such methods use manually designed features based on intensity and textures extracted from the MRI images. The effectiveness of those methods and their capabilities were limited by poor feature design and a lack of generalizability. However, the rise of deep learning, especially the convolutional neural network, brought a revolution in tumor classification by offering the capability to automatically learn features [9,10,11].

In medical imaging applications, convolutional neural networks (CNNs) have some noticeable limitations despite proven success [12]. These algorithms are computationally intensive and require huge amounts of data; additionally, manual annotation is both expensive and requires high-level expertise. The class imbalance that often occurs in medical datasets, in which certain types of cancer may be rare, further complicates the generalizability of CNN-based models. Additionally, CNNs are often designed as end-to-end architectures, which can be computationally demanding and prone to overfitting in scenarios with limited data. The lack of interpretability represents an issue in clinical applications, where transparency and reliability are crucial for decision assistance. Data-efficient, flexible, and comprehensible methods that incorporate the advantages of feature-based and learning-based paradigms are necessary to address these shortcomings.

This research adopts a new method to deal with these issues by using state-of-the-art self-supervised vision transformers (ViTs) from the DINO (DIstillation of knowledge with NO labels) family as feature extractors to classify brain tumors. Specifically, instead of analyzing only one of the variants, we evaluate three generations of the DINO representations, DINO (v1) [13], DINOv2 [14] and DINOv3 [15], in the same experimental conditions. DINO models are trained through self-supervised learning on large-scale natural-image datasets, which provide them with dense and high-quality features that are resilient and applicable across many domains. We use the DINO backbones as static feature extractors, with no fine-tuning, and explore two classifiers that have been trained on the extracted features. Specifically, we consider the support vector machine (SVM) as a powerful baseline and compare its use to the multi-layer perceptron (MLP) classifier. This hybrid system decreases the computational expense and overfitting and also integrates the advantages of self-supervised feature learning and classical learning classification methods.

The main contributions of this work are as follows:

We suggest DINO-based models as feature extractors, utilizing frozen self-supervised representations and multi-level feature fusion from the final and penultimate transformer layers to enhance classification performance.
We methodically assess the proposed framework using SVM and MLP classifiers on two brain tumor MRI datasets of varying complexity: a 15-class dataset [16] and a 4-class dataset [17].
We perform experiments with lightweight classifiers such as SVM and MLP, obtaining competitive or higher performance compared to current state-of-the-art techniques while decreasing computational complexity.

2. Related Work

2.1. Traditional and Deep Learning Approaches for Brain Tumor Classification

Traditional methods often rely on handcrafted features extracted from images, such as texture, intensity, and shape descriptors, which are then fed into classical classifiers. While these methods can achieve reasonable accuracy, they are limited by the quality of the handcrafted features and may not generalize well to diverse datasets. Padmavathy et al. (2024) [18] proposed a classical framework combining Gray-Level Co-occurrence Matrix (GLCM) texture features with an RBF-kernel SVM for benign and malignant tumor discrimination. After preprocessing and region-of-interest extraction, they computed GLCM descriptors and achieved 95.56% accuracy, outperforming wavelet-based feature baselines. MM Ghazvini et al. (2024) [19] integrated morphological segmentation with discrete wavelet transform (DWT) features, followed by PCA reduction and SVM classification. This model achieved an accuracy of 95% and a precision of 88% on a dataset of MRI images, highlighting the feature-driven SVM pipeline’s significance for clinical decision assistance. MJ Adamu et al. (2024) [20] proposed a MobileNetV2–SVM model for four-class tumor classification. They compacted high-level embeddings from a 7023-image MRI dataset using MobileNetV2 and replaced the dense head with an SVM to improve the non-linear separation, achieving very high AUC values, which highlight both accuracy and efficiency for resource-constrained settings. SM Alqhtani et al. (2024) [21] combined Wiener filtering, fuzzy C-means segmentation, and SVM classification to identify meningioma, glioma, and pituitary tumors. The method showed 98.2% accuracy and a 96.1% Dice score, indicating strong segmentation–classification synergy, when evaluated on the CE-MRI dataset. Aggarwal (2022) [22] proposed a handcrafted texture-based approach for binary brain tumor classification from T1-weighted MRI images. The method extracts second-order statistical descriptors from Gray-Level Co-occurrence Matrices (GLCMs) and feeds them into a Random Forest classifier while analyzing different GLCM parameter settings. Evaluated on 245 MR images, the optimized configuration achieved 83.3% accuracy on the test set, indicating that well-tuned GLCM features provide a computationally efficient baseline for tumor and non-tumor discrimination.

More recent hybrids replace handcrafted descriptors with transformer or CNN embeddings and then apply classical classifiers to improve robustness. H Allahem et al. (2025) [23] proposed a ViT-PCA-RF pipeline in which vision transformer features were compressed via PCA and classified with Random Forest. On the BTM dataset, the approach reached 99% accuracy with balanced sensitivity and specificity, demonstrating the effectiveness of pairing transformer representations with lightweight ML heads. AA Abdulla (2025) [24] similarly built an automated computer-aided diagnosis system using Wiener denoising, HOG features, PCA reduction, and Bayesian-optimized kNN/SVM. Tested on the public Figshare MRI dataset, the model achieved a high accuracy of 99.2% with the optimized kNN classifier, surpassing existing state-of-the-art methods. Tiwary et al. (2025) [25] proposed a hybrid framework that automatically extracts features using a custom CNN and feeds them into multiple machine learning classifiers, with Random Forest performing best against SVM, kNN, Decision Tree, and Naïve Bayes. Experiments on the Kaggle Brain Tumor MRI Dataset reported 99.61% training accuracy, 92.16% validation accuracy, and 71.2% accuracy on a held-out CSV testing split, demonstrating that CNN feature fusion with Random Forest can be effective, though generalization drops on the final test set.

Deep learning approaches, particularly CNNs, have revolutionized the field by automatically learning relevant features from raw MRI scans. Early supervised CNN pipelines showed that stacked convolution–pooling hierarchies can capture tumor-specific appearance cues and outperform many handcrafted baselines when trained end-to-end. Nurtay et al. (2025) [26] presented a comparative deep CNN study on the four-class Kaggle MRI benchmark, evaluating a custom CNN against common transfer learning backbones (ResNet50, VGG-16, and Xception). Their separable-convolution custom CNN achieved about 93–94% accuracy and the best ROC-AUC, exceeding the pretrained alternatives, demonstrating that task-specific CNN design can rival heavier ImageNet-initialized models on brain MRI classification tasks. Building on this direction, customized CNN architectures have been proposed to better match the structural characteristics of tumor MRI. Albalawi et al. (2024) [27] introduced four progressively refined CNN variants trained on multiple public MRI datasets and reported that their best task-tailored architecture reached 99.76% test accuracy, outperforming standard transfer learning baselines.

Because training large CNNs from scratch can be computationally expensive and label-hungry, Alemayehu (2025) [28] presented a compact CNN optimized with Keras-Tuner hyperparameter search and contour-based cropping to suppress background noise. Evaluated via five-fold cross-validation on a four-class 7023-image public MRI set, the model achieved 98.78% test accuracy while remaining parameter-efficient, supporting the practicality of low-complexity CNNs in resource-constrained clinical settings. Ilgün et al. (2025) [11] conducted experiments with various fine-tuned convolution neural network backbones using a combined public brain tumor MRI dataset (Figshare, SARTAJ, and Br35H). Their approach demonstrated the efficacy of CNN-based transfer learning for brain tumor classification, achieving up to 98.47% test accuracy with ResNet50. Prayogo et al. (2025) [29] proposed a hybrid CNN-based transfer learning framework that fine-tunes multiple lightweight pretrained backbones and fuses their deep embeddings. Their best hybrid (ResNet50V2 + MobileNetV2 + DenseNet121) reached 98.75% accuracy with equally strong precision/recall, outperforming single-backbone variants and demonstrating that feature fusion across pretrained CNNs provides richer tumor representations than any single model alone.

2.2. Self-Supervised Learning (SSL) and Vision Transformers (ViTs)

Self-supervised learning enables effective representation learning without the need for labeled data. Based on this approach, DINOv3 provides transferable and generalizable representations that can be efficiently applied in medical image applications with a limited number of annotations. Mughal et al. (2024) [30] examined the current self-supervised learning methods that can be applied in the medical imaging field and showed that pretraining through contrastive, clustering, and reconstruction objectives can significantly improve the performance of tumor classification when the labeled MRI data are limited. They also discussed the ability of SSL to mitigate domain shift and enhance model robustness across imaging protocols.

Beyond general SSL, several works combine SSL with transformer-based backbones to model MRI more effectively. Karagoz et al. (2024) [31] introduced ResViT, a hybrid residual CNN–ViT model pretrained through a generative SSL objective before fine-tuning on tumor classification, demonstrating notable gains over ImageNet initialization by achieving accuracies of 90.6% and 98.5%, respectively, on the BraTS 2023 and Figshare datasets. Rudro et al. (2025) [32] introduced an SSL method for brain tumor segmentation and classification using SimCLR and an EfficientNetB3 backbone. They used SSL-based model pretraining on extensive unlabeled datasets to acquire significant feature representations before executing supervised fine-tuning with a superior classifier head. The proposed model achieved 98.32% test accuracy on a four-class Kaggle dataset. In a similar direction, Nunes et al. (2025) [33] employed masked autoencoding pretraining for a 3D ViT on unlabeled brain MRI volumes and reported improved F1 scores of 91% under five-fold cross-validation after fine-tuning on BraTS tumor classes, particularly in low-label settings. Safwan et al. (2025) [34] further confirmed the benefit of contrastive SSL by introducing T3SSLNet, a triple-strategy SSL framework for MRI tumor classification, which evaluates SimCLR, MoCo, and BYOL with a ResNet-50 backbone, yielding accuracies around 96–97% after fine-tuning, confirming the value of contrastive SSL when labeled tumor MRI data is limited.

Parallel to SSL advances, vision transformers have also been improved directly for brain tumor classification. Khaniki et al. (2024) [35] proposed a ViT enhanced with selective cross-attention and feature calibration to fuse multi-scale tumor cues, achieving high accuracies of 98.9% and 99.2% with stochastic depth on public brain MRI benchmarks. Wang et al. (2024) [36] presented RanMerFormer, which introduces randomized token merging to reduce redundancy and computational cost while maintaining competitive classification performance with an accuracy of 98.86%.

2.3. Attention Mechanisms and Saliency Mapping

Attention mechanisms have been incorporated into deep learning models to improve feature extraction by focusing on relevant regions of the image. For instance, Zarenia et al. (2025) [2] proposed a deformable attention module for brain tumor classification and segmentation, which captures irregular and complex tumor patterns. Saliency mapping is another technique used to visualize and interpret the regions of the image that contribute most to the model’s decision, enhancing transparency and trust in automated systems. The model achieved around 96.6% multi-class accuracy on a 15-class MRI dataset, outperforming conventional CNN and ViT baselines.

More recent deep models incorporate explicit attention and saliency mechanisms to sharpen tumor-focused representations and improve interpretability. Masoudi et al. (2024) [37] proposed an optimized dual-attention network that uses a ResNet50 backbone followed by a depth-separable channel-attention module and a multi-head spatial-attention block to refine discriminative MRI features, achieving 99.32% accuracy and consistently high per-class performance using the Figshare dataset, indicating that joint channel–spatial attention can substantially boost multi-class tumor recognition. Srivastava et al. (2025) [38] similarly leveraged transformer attention but in a multi-scale relational setting, introducing an Automated Classification and Grading Diagnosis Model (ACGDM) that combines a Multi-Scale Graph Neural Network with a Spatio-Temporal Transformer Attention Mechanism (MSGNN-STTAM) to capture hierarchical spatial dependencies and cross-frame MRI evolution. Tested on BraTS 2018/2019/2020 and Br35H multimodal MRI datasets, the model reported up to 99.8% accuracy for tumor type detection, highlighting the benefit of attention-guided graph reasoning for robust grading and classification.

Tomar et al. (2024) [39] proposed a visual attention-based detection pipeline that builds an on-center saliency map to highlight tumor-relevant regions and then applies superpixel segmentation to preserve boundary structure before extracting the final lesion mask. Their model achieved 99.63% accuracy with strong Jaccard and Dice overlap scores, outperforming prior detection baselines. MA Khan et al. (2023) [40] advanced saliency usage further by introducing an automated multimodal framework that first enhances tumor visibility using deep saliency maps and then fuses deep features and selects an optimal subset via an improved dragonfly optimization strategy before classification. Evaluated on three BraTS available datasets, the model achieved an improved accuracy of 95.14%, 94.89%, and 95.94%, respectively. R Khan et al. (2025) [41] proposed X-SCSANet, an explainable stack convolutional self-attention network that improves both discrimination and interpretability by stacking the outputs of parallel CNN and self-attention branches and applying a customized Grad-CAM procedure. Evaluated on a four-class Kaggle MRI dataset, the model achieved 96.44% accuracy with 96.5% precision and 98.83% specificity, while producing saliency heatmaps that highlight tumor-relevant regions to justify predictions.

Beyond performance gains, saliency mapping is increasingly positioned as an explainability tool for validating deep predictors. Keles et al. (2023) [42] presented a focused case study showing how post hoc gradient-based saliency maps on brain MRI reveal that classifiers rely primarily on the tumor core and its surrounding context, especially shape-related cues. Their analysis argues that such visual explanations are essential for identifying model biases and improving trustworthiness in clinical decision support.

3. Methodology

Figure 1 shows an overview of how the proposed pipeline for classifying brain tumors will work. The framework combines data augmentation methods that are specific to each dataset, a frozen DINO backbone (DINO v1/v2/v3) for extracting, transforming, and combining features, and lightweight classifiers for the final classification of tumors.

3.1. Datasets

This study analyzes the ability of DINOv3 vision transformers as fixed feature extractors in recognizing brain tumors through two distinct research configurations: a 15-class multi-class classification task and a 4-class multi-class classification task. The two datasets have a difference in the label granularity, the distribution of the classes, and the organization of the data, which allows for analyzing the strength and the possibility of generalization of the proposed solution at various classification complexity levels.

3.1.1. Dataset A: 15-Class Brain Tumor MRI Dataset

The primary experimental evaluation was carried out with the publicly accessible Brain Tumor for 14 Classes dataset available on Kaggle [16]. The dataset consists of human brain MRI scans that contain 15 diagnostic classes, which include 14 tumor subtypes and one normal class (non-tumor). Due to the strong visual similarity among the different types of tumors and the highly imbalanced distribution, this dataset represents a challenging benchmark for multi-class brain tumor classification. All images underwent an initial validation process to identify and remove corrupted or unreadable files. After this filtering step, a total of 4456 MRI scans were retained for analysis. The detailed class-wise distribution is reported in Table 1.

The dataset includes the following tumor types: astrocytoma (AST), carcinoma (CAR), ependymoma (EPE), ganglioglioma (GAN), germinoma (GER), glioblastoma (GLI), granuloma (GRA), medulloblastoma (MED), meningioma (MEN), neurocytoma (NEU), oligodendroglioma (OLI), papilloma (PAP), schwannoma (SCH), tuberculoma (TUB), and normal brain tissue (NOR). These tumor classes exhibit highly similar morphological characteristics, making the dataset a challenging benchmark for multi-class brain tumor classification tasks.

To guarantee the reproducibility and maintain the original distribution of classes, the dataset was split by stratified random sampling with a fixed random seed (random state = 42). A 20% hold-out test split was used, resulting in 3565 training images and 891 test images, respectively. As this public dataset lacks identifiers of patients and subject-level metadata, the split had to be done at the image level rather than the patient level. One of the interesting characteristics of this data set is the high level of class imbalance, as a few classes of tumors are restricted to a small number of samples (e.g., ganglioglioma, granuloma, and and germinoma). Further, a number of classes of tumors also share similar properties in MRI images, making inter-class discrimination more complicated. Such challenges make it necessary to apply effective data balancing and representation learning methods.

3.1.2. Dataset B: 4-Class Brain Tumor MRI Dataset

A second dataset, the Brain Tumor MRI Dataset [17], was used to evaluate the suggested framework under a lower classification complexity. This dataset comprises 7023 brain MRI images, categorized into four clinically relevant groups: glioma, meningioma, pituitary tumor, and no tumor. This dataset is a composite derived from three publicly accessible sources: Figshare, SARTAJ, and Br35H datasets. The Br35H dataset was selected as the source of images related to the no-tumor category only. Due to the inconsistencies found in the glioma class labels of the SARTAJ dataset, a problem previously noted in earlier research, these samples were omitted and substituted with glioma images sourced from Figshare to improve label reliability. The detailed class-wise distribution is reported in Table 2.

In contrast to Dataset A, this dataset is available with a predefined train–test split, separated into training and testing directories, with a total of 5712 images for training and 1311 for testing. Since patient identifiers are not provided in this dataset and it is distributed as an image-level split, patient-level separation cannot be enforced. Although the total distribution of the classes is more balanced than the 15-class dataset, there is a slight imbalance, which was mainly caused by the higher proportion of the no-tumor samples. Moreover, the dataset is characterized by variability in the resolution of images and spatial dimensions, as it represents the disparity in the acquisition protocols of the sources of the data. During preprocessing, all images were uniformly transformed to a standard resolution, and all unnecessary margins were eliminated.

3.2. Data Preparation and Preprocessing

To ensure the quality, consistency, and reproducibility of the data for all experiments, a general preprocessing was done on both datasets before doing feature extraction. Preprocessing was intended to suppress uninformative background regions, normalize spatial resolution, and provide standardized inputs to feed into the feature extractor with DINOv3 Small, Base, and Large vision transformer backbones. Due to the variability in the size of the datasets, the level of imbalance in classes, and the arrangement of the directories, dataset-specific augmentation and balancing methods were implemented. All preprocessing operations were done in Python (version 3.12.12) using PIL, OpenCV, and NumPy libraries, while feature extraction was done using the Hugging Face Transformers library. The same processing was applied to all experiments to ensure fair comparison among different models and datasets.

3.2.1. Brain Region Cropping

Background and border regions in brain MR images are usually relatively large, and they would not contribute to the description of tumors but can introduce undesired variability when extracting features from image data. A special cropping of the different brain regions was performed for each image before its resizing to address this problem. Each image was initially converted to grayscale and smoothed with a Gaussian filter to suppress high-frequency noise. A binary threshold was then used, followed by morphological operations, erosion, and dilation to remove small artifacts. The largest external outline was then taken and assumed to be the brain region. The extreme contour coordinates (left, right, top, and bottom) were used to perform cropping. In case there was no valid contour found, the original image was not modified to prevent unintended data loss. Since this process reduces variability caused by the background and highlights meaningful brain structure information in images, it improves the stability of the feature extraction process.

3.2.2. Images Resizing and Normalization

The resolution of all images was unified to a spatial resolution of

224 \times 224

pixels, cropping the brain regions to match the input dimension required by the utilized DINOv3 model configurations. For interpolation, the Lanczos method was chosen due to its interpolation properties and capability to retain detailed information regarding the fine-grained anatomy and texture patterns, which play an important role in the discrimination of tumors. Following data interpolation, the required normalization parameters, taken from the ImageNet statistics defined within the DINOv3 preprocessing pipeline, were applied to the dataset despite the self-supervised training procedure of the DINOv3 model configurations to guarantee consistency with the input distribution required by the pretrained backbone model.

3.2.3. Data Augmentation and Class Balancing

Due to the significant disparities in class imbalance severity between the two datasets, different augmentation and balancing procedures were employed.

Dataset A: GAN-Based Class Balancing

Dataset A is a fine-grained 15-class classification task characterized by high inter-class visual similarity as well as a severe class imbalance problem. In particular, there are fewer than 100 samples available for a number of tumor categories, making this dataset extremely unbalanced. Underrepresented classes may perform worse as a result of this imbalance, which could introduce substantial bias in supervised learning. To mitigate this issue and to promote balanced representation learning, a class-specific generative data augmentation strategy based on Generative Adversarial Networks (GANs) [43] was employed. Specifically, a Deep Convolutional Generative Adversarial Network (DCGAN) was trained independently for each tumor class, using only the corresponding training samples from that class. Augmentation was done after splitting the datasets, and a DCGAN trained on the training split was used to generate synthetic images for the training and test subsets independently. The training set was balanced to 800 images of each class (12,000 in total), and the test set was balanced to 200 images of each class (3000 in total). We note that augmenting the test set changes the evaluation distribution relative to the original real-image test split; therefore, test performance should be interpreted under this balanced evaluation setting. This class-wise training strategy enables each generator to model class-specific anatomical and textural characteristics present in brain MRI scans. All GANs were trained at a fixed spatial resolution of

224 \times 224

pixels, as illustrated in Figure 2, consistent with the input resolution used in the downstream feature extraction pipeline.

Formally, the DCGAN training process follows the standard adversarial minimax objective:

min_{G} max_{D} V (D, G) = E_{x \sim p_{data} (x)} [log D (x)] + E_{z \sim p_{z} (z)} [log (1 - D (G (z)))]

(1)

Here G represents the generator network, D the discriminator network, x a real MRI image of an empirical data distribution

p_{data}

, and z a latent noise distribution sampled from a prior distribution

p_{z}

. In this adversarial system, the generator is trained to produce real-like images of the MRI that belong to the target category, and the discriminator is trained to differentiate between real and synthetic images.

Following convergence, a total of 12,000 training photos were produced by using the trained generators to create extra samples in order to balance the training set to 800 images per class. Classes with relatively larger original sample sizes required fewer synthetic samples, whereas severely underrepresented classes, such as ganglioglioma and granuloma, relied more heavily on GAN-generated images to achieve class parity. To achieve consistency in the experiments and a controlled assessment, a fixed number of 200 samples per class was also used, making the test set 3000 test images. Notably, GANs had no training on test data. Rather, the synthetic test samples were created only using the class-specific generators that were also trained on the training set. This helps in reducing information leakage while allowing a fair and balanced assessment for all classes of tumors, thus preventing classification performance from being biased due to class distribution differences.

Dataset B: Lightweight Training Augmentation

On the other hand, in Dataset B, there is a coarse-grained classification scenario with four classes that have relatively sharp boundaries between them and minimal class imbalance. This setting enables the usage of lightweight data augmentation without the possibility of overregularization. This particular dataset comes with a pre-split train and test set; therefore, data augmentation only takes place on the train set to improve sample diversity without compromising the test distribution. For each training image x, two additional augmented variants were generated through intensity and geometric transformations designed to maintain anatomical plausibility. Specifically, the applied augmentations consisted of:

Random brightness and contrast adjustments to simulate inter-scanner and protocol-related intensity variations.
Discrete rotations of $90^{\circ}$ , $180^{\circ}$ , or $270^{\circ}$ , which preserve the structural consistency of brain anatomy in MRI scans.

Formally, the augmentation process is defined as

A (x) = {x, T_{cb} (x), T_{rot} (x)}

(2)

where

T_{cb} (\cdot)

applies contrast scaling with a factor uniformly sampled from

[0.8, 1.2]

, followed by brightness adjustment with a factor uniformly sampled from

[0.9, 1.1]

, and

T_{rot} (\cdot)

applies a rotation randomly selected from

{90^{\circ}, 180^{\circ}, 270^{\circ}}

. This strategy yields a threefold expansion of the training set, while the test set remains unmodified to reflect realistic inference conditions.

3.3. DINO Feature Extraction

An overview of the extraction pipeline is shown in Figure 3. Feature extraction was performed using DINO-based self-supervised vision transformer (ViT) backbones (DINO (v1), DINOv2, and DINOv3) introduced by Meta AI for learning transferable visual representations at scale [13,14,15]. These self-supervised ViT models are pretrained on large-scale natural-image datasets, producing high-quality dense features suitable for various vision tasks. In this work, all DINO variants were used strictly as frozen feature extractors, meaning that no fine-tuning was performed. This design choice enables robust representation learning while reducing computational cost and mitigating overfitting risks, which are particularly important in medical imaging applications. It is important to mention that backbone availability varies among DINO generations, where DINOv2 and DINOv3 include Small, Base, and Large ViT backbones, while DINO (v1) is publicly accessible only in Small and Base configurations. Accordingly, we evaluated multiple backbone capacities (Small/Base/Large when available), loaded using the Hugging Face Transformers API and executed in evaluation mode with gradients disabled. Images were processed in mini-batches of size eight.

There was a preprocessing of each MRI image into an RGB format of

224 \times 224

pixel spatial resolution before feature extraction. Each image with original and augmented samples was then processed with the Hugging Face AutoImageProcessor configured for corresponding DINO model variants. The processor applies uniform preprocessing, including tensor conversion and channel-wise normalization using ImageNet statistics (mean

μ = [0.485, 0.456, 0.406]

, standard deviation

σ = [0.229, 0.224, 0.225]

) employed during DINO’s self-supervised pretraining. This normalization step aligns MRI intensity distributions with those expected by the pretrained feature extractor [44,45], thereby ensuring stable and consistent feature representations across different DINO variants and backbone capacities.

For a given input image x, the frozen DINO backbone applies several transformer layers to the image, resulting in a stream of embeddings corresponding to each layer. Among these embeddings, the [CLS] (classification) token is responsible for learning global context through self-attention, providing an effective representation of the whole image. To obtain complementary semantic abstraction, global image descriptors were extracted at two network depths: the final transformer layer and the penultimate transformer layer. For consistency across all DINO variants, we also evaluate the concatenation of the two descriptors as a combined representation. Formally, let

H^{(ℓ)} (x) \in R^{T \times d}

denote the hidden-state matrix at transformer layer ℓ, where T is the number of tokens and d is the embedding dimensionality. The extracted representations are defined as

f_{final} (x) = H_{[CLS]}^{(L)} (x), f_{penult} (x) = H_{[CLS]}^{(L - 1)} (x)

(3)

where L denotes the total number of transformer layers. The feature dimensionality depends on the backbone configuration; the extracted CLS embeddings have dimensions 384, 768, and 1024 for the Small, Base, and Large backbones, respectively.

For better numerical stability and to reduce the dimensionality of the features for the classification task, a two-stage feature preprocessing pipeline was adopted. First, the extracted feature vectors were standardized using z-score normalization, ensuring zero mean and unit variance across all feature dimensions. The normalization parameters were only computed using the original (non-augmented) training set. Principal Component Analysis (PCA) [46] was then used on the standardized features, and 90% of the cumulative variance was retained. Both the normalization transformation and the PCA transformation were only fitted on non-augmented training samples and then directly applied to augmented training features as well as to the test set. This design strictly prevents information leakage from augmented or evaluation data and ensures unbiased and reliable performance assessment.

3.4. Classification Models

To assess the discriminative quality of the extracted DINO feature representations, two complementary supervised classifiers were used, a support vector machine (SVM) and a multi-layer perceptron (MLP). These classifiers were chosen owing to their popularity in high-dimensional feature spaces, with known success in a variety of medical image analysis tasks. All experiments were performed using features generated through DINO (v1), DINOv2, and DINOv3 backbones across multiple model capacities (Small/Base/Large), using the penultimate layer, and their combinations (by concatenation). Because DINO (v1) is publicly available only in Small and Base configurations, results involving the Large backbone are reported only for DINOv2 and DINOv3. Experiments were performed consistently on both the 15-class and 4-class datasets.

Support Vector Machine (SVM): Non-linear decision boundaries in the DINO feature space were captured using a multi-class SVM with a radial basis function (RBF) kernel. The kernel coefficient was set to $γ = ‘ scale ’$ , and the regularization parameter was fixed at $C = 10$ . Since the 15-class dataset was intentionally balanced using DCGAN-based augmentation before feature extraction, no further class weighting was necessary during classifier training. For the 4-class dataset, mild residual class imbalance was addressed by incorporating class weights computed from the training set during SVM training. Model performance was estimated using stratified five-fold cross-validation on the training set, followed by evaluation on the held-out test set.
Multi-Layer Perceptron (MLP): A shallow MLP classifier was used to evaluate the suitability of DINO features for neural network-based classification. The network architecture had a solitary hidden layer with 100 neurons with ReLU activation, followed by a softmax output layer aligned with the number of target classes. The model was optimized with the Adam optimizer with a starting learning rate of $10^{- 3}$ . Training was performed with early stopping enabled, using a validation split of the training data to prevent overfitting. No explicit class weighting was applied during MLP training, given the balanced nature of the 15-class dataset and the limited imbalance in the 4-class dataset.

For both classifiers, identical preprocessing pipelines, feature configurations, and training–testing splits were used to ensure fair and reproducible comparison across datasets, DINO variants, and representation depths. Classification performance was evaluated using overall accuracy as well as class-wise precision, recall, and F1-score to provide balanced insight into performance across classes.

3.5. Experimental Setup

All experiments were performed with Google Colaboratory, which offers access to cloud-based GPU acceleration. An NVIDIA Tesla T4 GPU with 15 GB of VRAM and CUDA version 12.4 with driver version 550.54.15 was used in the computing environment. This setup offered enough computational memory to run the feature extraction process with much better performance using DINO-based backbones (DINO (v1), DINOv2, and DINOv3) across the available model sizes (Small/Base/Large), as well as the training and evaluation of the downstream classifiers. The Hugging Face Transformers library and the PyTorch framework were used to implement each model in Python. For optimal memory usage and computational efficiency, feature extraction was carried out in inference mode with gradient computation turned off. To guarantee consistent performance across all model variations, batch sizes were chosen empirically based on GPU memory limitations.

To ensure reproducibility, fixed random seeds were used where applicable, and all experiments were executed under identical software and hardware configurations for both datasets. The complete experimental pipeline, including preprocessing, feature extraction, classification, and evaluation, was executed consistently across all model configurations. Classification performance was evaluated using standard metrics, including accuracy, precision, recall, and F1-score. The obtained results were compared against representative state-of-the-art approaches, including deformable attention-based models and fine-tuned CNN (ResNet50) architectures, as reported in [2,11].

4. Results and Discussion

This section presents a quantitative and qualitative evaluation of the proposed brain tumor classification framework based on DINO-family (DINO (v1), DINOv2, and DINOv3) feature extraction combined with classical machine learning classifiers. Experiments were conducted on two datasets of different complexity: a 15-class brain tumor MRI dataset representing a fine-grained classification task and a 4-class brain tumor MRI dataset corresponding to a clinically more common coarse-grained scenario. Performance was evaluated using classification accuracy on a held-out test set, complemented by five-fold stratified cross-validation scores to assess generalization. Precision, recall, and F1-score were further analyzed for the best-performing configurations to provide insight into class-wise behavior.

4.1. Results on Dataset A

The classification performance on the 15-class brain tumor dataset is summarized in Table 3 (DINO (v1)), Table 4 (DINOv2), and Table 5 (DINOv3). The fine-grained structure of the labels and the significant visual resemblance among tumor subtypes make this work more difficult. The minor difference between training and test accuracies shows that all investigated configurations achieved good performance with minimal overfitting after correcting class imbalance with DCGAN-based augmentation before feature extraction.

Table 3 reports the obtained results using DINO (v1) features. Among all assessed configurations, the optimal overall performance on Dataset A was attained by DINO v1-Base, utilizing combined final and penultimate features with an SVM classifier, achieving an accuracy, recall, and F1-score of 98.17% and a precision of 98.20% on the test set. These results demonstrate the efficiency of the original DINO representations in fine-grained brain tumor classification.

As illustrated in Table 4, the best DINOv2 performance is obtained with the Base backbone with combined features, achieving an accuracy, precision, recall, and F1-score of 96.07%, 96.14%, 96.07%, and 96.09% respectively. Similar trends to DINO (v1) are observed; SVM always performs better than MLP and feature fusion enhances the performance of classification. In general, DINOv2 gives comparable results, but is marginally inferior to DINO (v1) in this regulated frozen-backbone context.

As reported in Table 5 across all DINOv3 backbones, support vector machine (SVM) classifiers consistently surpassed multi-layer perceptron (MLP) models in performance. The optimal overall performance was achieved with DINOv3-Small features with fused final- and penultimate-layer embeddings, resulting in an accuracy, precision, recall, and F1-score of 97.47%, 97.57%, 97.47%, and 97.49% respectively on the test set. Fusing features systematically outperformed single-layer representations, demonstrating that multi-level transformer features are useful for fine-grained tumor discrimination because they capture complementary semantic and structural information. Increasing backbone capacity from Small to Base and Large yielded competitive but marginal improvements, suggesting diminishing returns for larger models in this setting.

The best-performing configuration (DINO v1-Base + SVM with fused features), as shown in Table 6, was evaluated for precision, recall, and F1-score in order to further study class-wise behavior. Recall was high in the majority of the types of tumors, and this shows that they were well detected. For visually similar tumor subtypes like astrocytoma and schwannoma, slightly lower precision was noted, which is consistent with known radiological overlap. In general, the precision, recall, and F1-score obtained were 98.20%, 98.17%, and 98.17%, respectively, exhibiting solid and well-rounded performance across all classes.

To further examine the class-specific behavior of the proposed model, Figure 4 below shows the confusion matrix of the optimal configuration (DINOv1-Base with combined features and an SVM classifier). The matrix is a highly significant diagonal, demonstrating an effective differentiation with the majority of tumor classes. Minor misclassifications occur between visually similar tumor subtypes, including astrocytoma and schwannoma, aligning with the per-class precision and recall trends documented in Table 6.

4.2. Results on Dataset B

The four-class brain tumor MRI dataset results are shown in Table 7 (DINO (v1)), Table 8 (DINOv2), and Table 9 (DINOv3). Higher classification performance was seen in all configurations when compared to the 15-class task because there were fewer classes and more distinct inter-class boundaries.

As reported in Table 7, the optimal performance for DINOv1 was obtained using the Small backbone with penultimate-layer features and an SVM classifier. This configuration achieved an accuracy of 99.08% and a precision, recall, and F1-score of 99.04%, 99.00%, and 99.02% respectively on the test data, demonstrating a highly equitable performance across all four tumor categories. The confusion between tumor types and the no-tumor class was minimal, confirming the strong transferability of DINO (v1) representations to coarse-grained brain tumor classification.

Table 8 provides results with DINOv2 that are similarly high with all the backbone sizes and feature sets. The highest test performances were achieved with DINOv2-Base with combined features and an SVM classifier, with an accuracy of 98.40% and a precision, recall, and F1-score of 98.32%, 98.25%, and 98.26% respectively. This indicates that the feature fusion generally improves performances, and SVM was always stronger than MLP, suggesting that the learned DINOv2 representations are well separated under margin-based decision boundaries.

As reported in Table 9, for DINOv3, SVM classifiers consistently produced strong results, with the highest test accuracy of 98.55% obtained using Small backbone features extracted from the final transformer layer, with a precision, recall, and F1-score of 98.49%, 98.42%, and 98.43% respectively. Differences in feature configurations and backbone sizes were minimal, implying that even compact DINOv3 representations are effective for coarse-grained classification. While DINOv3-Base and DINOv3-Large achieved competitive performance, gains over the Small variant were marginal. MLP classifiers achieved comparable but persistently lower accuracy than SVM models, demonstrating the durability of margin-based classifiers in high-dimensional feature areas.

Per-class evaluation for the best-performing configuration (Small DINO (v1) + SVM with penultimate-layer features), summarized in Table 10, shows near-perfect precision and recall across all four categories. The F1-score reached 99.02%, with minimal confusion between tumor classes and the no-tumor category, highlighting the suitability of the proposed framework for brain tumor classification scenarios.

Figure 5 illustrates the confusion matrix for the optimal configuration (DINO (v1)-Small utilizing penultimate-layer features and an SVM classifier), supplementing the per-class quantitative results. The matrix reveals near-perfect diagonal dominance, demonstrating an effective differentiation among the four tumor classes. Moreover, the lack of confusion between tumor categories and the no-tumor class ensures the robustness of the suggested framework for coarse-grained brain tumor classification.

4.3. Discussion

The experimental findings show that using DINO-family backbones (DINO (v1), DINOv2, and DINOv3) as fixed feature extractors produces highly discriminative and transferable representations for brain tumor MRI classification. Strong generalization performance was seen across both datasets, with only slight variations between cross-validation and test accuracies, suggesting that the suggested pipeline successfully reduces overfitting without necessitating end-to-end fine-tuning. Also, the comparative analysis within the DINO generations reveals that the performance does not necessarily increase with model version: although DINOv3 can still deliver good and consistent results, DINO (v1) has the best overall configurations across both datasets, whereas DINOv2 can be competitive but tends to be slightly lower than the other two generations under identical preprocessing, augmentation, and classifier parameters. The key result is related to the benefits achieved by feature fusion between the transformer layers. The final-layer features represent high-level context information, while the penultimate-layer features retain morphological information. The fusion of such representations leads to performance improvement, especially for the challenging 15-class dataset, where subtle inter-class differences are critical. This highlights the importance of multi-level representations in discriminating between tumor types. In general, support vector machines were more stable and robust than multi-layer perceptrons, especially in the high-dimensional feature spaces created by DINO backbones. Although larger backbones can provide richer representations, the gains over smaller ones were often modest, indicating that Small/Base variants offer a favorable trade-off between classification performance and computational cost in this setting. While other domain-specific medical self-supervised models (e.g., RadImageNet-pretrained backbones [47] and medical foundation models like REMEDIS [48]) have been shown to have advantages in specific medical imaging tasks, this study intentionally focuses on the DINO family (v1, v2, and v3) to enable a controlled and reproducible comparison under the same experimental conditions. By relying exclusively on publicly available self-supervised models pretrained on natural images, we isolate the effect of architectural and training refinements across DINO generations without confounding factors introduced by heterogeneous medical pretraining datasets or protocols. Notably, our results show that DINO v1-Base achieved the highest performance on both Dataset A (98.17%) and Dataset B (99.08%), indicating that newer architectural iterations do not uniformly yield improvements for all medical imaging tasks under a frozen feature paradigm. Nevertheless, all DINO variants achieved competitive or superior performance compared to fine-tuned CNN baselines, demonstrating strong cross-domain transferability and data efficiency. Lastly, the difference between the 4-class and 15-class datasets highlights the impact of data properties and classification granularity. Overall, the findings demonstrate that the combination of lightweight classifiers and self-supervised vision transformers creates an effective and useful framework for brain tumor classification, with great potential for incorporation into upcoming computer-aided diagnosis systems for brain tumor analysis.

4.4. Ablation Study

A thorough ablation study was carried out using DINOv3 features extracted at a resolution of 224 × 224 on the challenging 15-class brain tumor MRI dataset (Dataset A) in order to better understand the contribution of each component of the proposed framework. While the four-class brain tumor MRI dataset (Dataset B) primarily serves to validate overall generalization in a simpler, less imbalanced scenario, the ablation study is conducted only on Dataset A because it constitutes a more fine-grained and severely imbalanced setting that is better suited to analyze the impact of class balancing, feature fusion, and loss design. The experiments on the ablation are based on three important aspects: the impact of GAN-based data augmentation to balance the classes, the effect of multi-level feature fusing, and the effect of the loss function by comparing categorical cross-entropy with categorical focal loss.

The classification performance achieved with and without GAN-based augmentation, using various DINOv3 backbones (Small, Base, and Large) and different feature sets and classifiers, is shown in Table 11. In the absence of GAN augmentation, test accuracies are around 89–93%, which reflects the challenges of the highly unbalanced 15-class setting. In contrast, when the GAN-based class balancing is enabled, the test accuracy is consistently boosted to about 96–97% with comparable gains in cross-validation scores. For instance, in the case of the DINOv3-Small backbone with combined features and an SVM classifier, the test accuracy is 92.95% (without GAN) and 97.47% (with GAN), demonstrating the extent of the advantage offered by synthetic balancing. GAN-based augmentation is essential for managing severe class imbalance and improving both generalization and robustness in fine-grained brain tumor classification. These improvements are more evident in the MLP classifier, demonstrating that synthetic samples stabilize training and reduce the bias of the model towards dominant classes. Consequently, GAN-based data augmentation is essential to address severe class imbalance, which enhances both generalization and robustness in the fine-grained classification of brain tumors.

Across all backbone variants and classifiers, feature fusion of the last and penultimate transformer layer always outperforms the use of either feature representation alone. For example, feature fusion gives the best cross-validation and test accuracies, with a top performance reaching 97.47% test accuracy for the DINOv3-Small backbone when combined with an SVM classifier using the combined features (Table 11). This proves that representation at multiple transformer depths captures complementary information: final-layer features include more high-level semantic context, penultimate-layer features retain more detailed morphological information, and the combination of the two results in more discriminative and robust feature representations, which are especially useful in separating visually similar tumor subtypes in the multi-class tasks.

Although the proposed framework relies on frozen DINO features for efficiency and reduced overfitting, we additionally investigated whether partial fine-tuning could further improve performance. For this the comparison, we used a lightweight fine-tuning strategy whereby the final block of the transformer of the DINO backbone was unfrozen along with a small classification head but the rest of the network was kept frozen. This setting is usually employed as an intermediate between entire fine-tuning and fixed feature extraction. Table 12 gives the results on Dataset A. Fine-tuning gave poorer performance at test than the frozen feature method, despite requiring considerably higher training time and a non-negligible number of trainable parameters. The best fine-tuned configuration reached 92.00% test accuracy (final-layer setting), whereas the frozen pipeline achieved up to 97.47% on the same dataset (reported earlier in the main results tables). This indicates that, in this medical imaging setting with limited labeled data, keeping the backbone frozen provides stronger generalization and avoids overfitting effects that may arise during fine-tuning.

To evaluate the importance of further optimization-level methods after solving data-level class imbalance, we compare the standard categorical cross-entropy loss with the categorical focal loss for the MLP classifier. As seen in Table 13, focal loss produces slightly better or equal cross-validation scores in the majority of backbones and feature settings, implying improved training stability and a greater interest in harder-to-classify data. Focal loss does not consistently or significantly outperform cross-entropy, and test accuracy differences between the two losses are still minimal and configuration-dependent. After resolving class imbalance via GAN-based data augmentation, testing accuracies are strong and stable for both losses. Notably, the model yielded the best test accuracy (95.63%) using categorical cross-entropy, demonstrating that it can generalize slightly better than categorical focal loss in balanced cases.

The result of the ablation study gives three key observations. First, the use of GAN-based class balance plays the most crucial role in improving the results for the imbalanced 15-class dataset. Second, the application of the feature fusion strategy at the multi-level proves to enhance the discriminative capability for all DINO models. Third, partial fine-tuning (last block + head) does not improve performance and substantially increases training time compared to frozen feature extraction. Finally, categorical focal loss does not offer a consistent or substantial benefit over categorical cross-entropy when class imbalance is mitigated, since both loss functions demonstrate comparable and steady performance.

4.5. Performance Comparison

To provide additional context for the proposed framework, its efficacy is evaluated against contemporary state-of-the-art methodologies documented in the literature for both fine-grained and coarse-grained brain tumor classification tasks. The suggested method is characterized by a modest total training time, comprising feature extraction and classifier training. Quantitative comparisons are summarized in Table 14 and Table 15. The reported total training time for the suggested method comprises a single forward pass for feature extraction using the frozen DINO backbone variants, followed by training of a lightweight classifier, but in CNN-based methods, the same is equal to end-to-end network optimization. In the 15-class brain tumor dataset (Table 14), the deformable attention-based technique [2] attains robust classification performance by utilizing attention mechanisms specifically designed for tumor morphology. The suggested method employing frozen DINO (v1), v2, and v3 representations coupled with a lightweight SVM classifier achieved comparable and marginally superior classification accuracy while preserving balanced precision, recall, and F1-score. This underscores the efficacy of self-supervised transformer characteristics for precise tumor subtype differentiation without requiring backbone fine-tuning.

For the four-class dataset (Table 15), fine-tuned CNN-based models [11] report high accuracy through supervised transfer learning and end-to-end optimization. The proposed DINOv3-based framework achieves similar or marginally higher performance using a frozen feature extractor and a simple classifier. This demonstrates that competitive results can be obtained with reduced training complexity and fewer trainable parameters. Beyond supervised baselines, Table 15 also includes representative self-supervised learning approaches for brain tumor MRI classification. An explainable SSL framework based on SimCLR with an EfficientNetB3 backbone [32] achieves 98.32% test accuracy on the four-class Kaggle dataset. A generative SSL pretraining strategy using the ResViT hybrid residual CNN–ViT architecture [31] reports 98.5% accuracy on the Figshare dataset. In this setting, the proposed frozen feature approach remains competitive with both fine-tuned supervised models and SSL-pretrained MRI-specific pipelines.

Overall, these comparisons suggest that the proposed method provides a practical and effective alternative to existing approaches, achieving competitive performance across datasets of varying complexity while benefiting from the flexibility and efficiency of self-supervised feature representations.

4.6. Computational Efficiency

In comparison to end-to-end deep learning models, the suggested framework greatly lowers the number of trainable parameters and total training complexity by using frozen DINO backbones as fixed feature extractors. Our method just requires training a lightweight classifier (e.g., SVM or MLP) on top of precomputed representations, whereas traditional CNN-based systems require optimizing millions of parameters using backpropagation. This design leads to faster training times, lower memory requirements, and reduced risk of overfitting, particularly in scenarios where labeled medical data are limited. In addition, feature extraction can be performed offline, allowing the classification stage to be trained and updated efficiently without reprocessing the entire dataset. These properties make the proposed approach well-suited for practical deployment in clinical environments where computational resources and training time may be constrained.

4.7. Visualization and Interpretability

To provide qualitative insight into the behavior of the proposed framework, attention rollout maps were generated from the frozen DINOv3 backbone. Figure 6 presents representative brain tumor MRI samples, illustrating the spatial distribution of attention responses across different cases. In the displayed overlays, regions receiving higher attention weights frequently correspond to tumor locations, suggesting that the learned self-supervised representations capture clinically relevant structural patterns. By integrating interpretability into the model’s decision-making process, these qualitative observations enhance the quantitative findings. It is crucial to note that the attention maps are meant to be used as an explanatory tool to highlight image regions that contribute more strongly to the extracted representations rather than for accurate tumor localization or segmentation. Overall, the visualizations offer more proof of the reliability and applicability of DINOv3 features for tasks, including the classification of brain tumors.

5. Conclusions

This study investigated the applicability of self-supervised DINO-family representations for brain tumor MRI classification using a frozen feature extraction paradigm. By combining DINO (v1), DINOv2, and DINOv3 embeddings with lightweight classifiers, including support vector machines and multi-layer perceptrons, the proposed framework achieved strong and balanced performance across datasets of varying complexity. Experimental results demonstrated that all evaluated DINO features are highly transferable to the medical imaging domain, despite being pretrained on natural images, and can support accurate classification without the need for end-to-end backbone fine-tuning. Across both datasets, DINO (v1) achieved the best overall performance, while DINOv3 remained competitive and DINOv2 showed slightly lower accuracy in most configurations. Feature fusion across transformer depths further enhanced discriminative capacity, particularly in the challenging multi-class setting, while compact backbone variants provided a favorable trade-off between performance and computational efficiency. In addition, attention rollout visualizations offered qualitative insight into the model’s behavior, indicating that the learned representations emphasize clinically relevant tumor regions. Partial fine-tuning of the last transformer block and classification head did not improve accuracy and significantly increased computational cost, confirming the efficiency of the frozen feature extraction paradigm. These findings highlight the potential of self-supervised vision transformers as practical and efficient building blocks for computer-aided diagnosis systems. Overall, the proposed approach provides a flexible and resource-efficient alternative to conventional fully supervised deep learning pipelines for brain tumor classification. Despite these promising discoveries, it should also be acknowledged that several aspects of these experiments are limited. The limited number of publicly accessible 2D MRI datasets used in the experiments may limit generalization to different imaging modalities, acquisition procedures, and clinical settings. Furthermore, our approach does not explicitly model three-dimensional spatial context; it works with slice-level images. Lastly, while class imbalance was mitigated by applying DCGAN-based augmentation, synthetic samples might not accurately reflect the variability of actual clinical data. Future research will focus on investigating the extension of this framework to three-dimensional magnetic resonance images and the generalization of results to external datasets and extending the proposed framework to additional clinical data. Additionally, hybrid models that combine frozen self-supervised models with advanced classifiers or limited fine-tuning will be explored.

Author Contributions

Formal analysis, R.M., W.H. and W.S.; investigation, M.D.C., R.M. and P.C.; resources, R.M.; writing and correcting—original, A.H. and M.L.; supervision, A.H. and M.L.; funding acquisition, M.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Italian Ministry for Universities and Research (MUR) under the grant Future Artificial Intelligence Research—FAIR CUP B53C220036 30006 grant number MUR: PE0000013.

Data Availability Statement

The data presented in this study are available in Kaggle at https://www.kaggle.com/datasets/waseemnagahhenes/brain-tumor-for-14-classes (accessed on 30 September 2025) and https://www.kaggle.com/datasets/masoudnickparvar/brain-tumor-mri-dataset (accessed on 20 October 2025). These data were derived from the following resources available in the public domain: Brain Tumor for 14 Classes and Brain Tumor MRI Dataset.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Girardi, F.; Matz, M.; Stiller, C.; You, H.; Marcos Gragera, R.; Valkov, M.Y.; Bulliard, J.L.; De, P.; Morrison, D.; Wanner, M.; et al. Global survival trends for brain tumors, by histology: Analysis of individual records for 556,237 adults diagnosed in 59 countries during 2000–2014 (CONCORD-3). Neuro-Oncol. 2023, 25, 580–592. [Google Scholar] [CrossRef]
Zarenia, E.; Far, A.A.; Rezaee, K. Automated multi-class MRI brain tumor classification and segmentation using deformable attention and saliency mapping. Sci. Rep. 2025, 15, 8114. [Google Scholar] [CrossRef] [PubMed]
Srinivasan, S.; Francis, D.; Mathivanan, S.K.; Rajadurai, H.; Shivahare, B.D.; Shah, M.A. A hybrid deep CNN model for brain tumor image multi-classification. BMC Med. Imaging 2024, 24, 21. [Google Scholar] [CrossRef]
Tang, H.; Li, Z.; Zhang, D.; He, S.; Tang, J. Divide-and-conquer: Confluent triple-flow network for RGB-T salient object detection. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 47, 1958–1974. [Google Scholar] [CrossRef]
Smits, M. Update on neuroimaging in brain tumours. Curr. Opin. Neurol. 2021, 34, 497–504. [Google Scholar] [CrossRef]
Yu, L.; Yu, S.; Wang, F.; Zhou, X.; Yang, F.; Cao, D.; Xing, Z. Multiparametric MRI for differential diagnosis of primary central nervous system lymphoma and atypical glioblastoma: An analysis incorporating DWI, DCE-MRI, and contrast agent preload DSC-PWI. BMC Med. Imaging 2025, 25, 345. [Google Scholar] [CrossRef]
Pattanaik, B.; Anitha, K.; Rathore, S.; Biswas, P.; Sethy, P.; Behera, S. Brain tumor magnetic resonance images classification based machine learning paradigms. Contemp. Oncol. Onkol. 2022, 26, 268–274. [Google Scholar] [CrossRef] [PubMed]
Ali, U. Comparative Evaluation of Machine Learning Classifiers for Brain Tumor Detection. medRxiv 2024. [Google Scholar] [CrossRef]
Ali, R.R.; Yaacob, N.M.; Alqaryouti, M.H.; Sadeq, A.E.; Doheir, M.; Iqtait, M.; Rachmawanto, E.H.; Sari, C.A.; Yaacob, S.S. Learning architecture for brain tumor classification based on deep convolutional neural network: Classic and ResNet50. Diagnostics 2025, 15, 624. [Google Scholar] [CrossRef]
Chinga, A.; Bendezu, W.; Angulo, A. Comparative Study of CNN Architectures for Brain Tumor Classification Using MRI: Exploring GradCAM for Visualizing CNN Focus. Eng. Proc. 2025, 83, 22. [Google Scholar]
İlgün, E.G.; Dener, M. Brain tumor classification from MRI scans using fine-tuned CNN models. Neural Comput. Appl. 2025, 37, 28779–28801. [Google Scholar] [CrossRef]
Missaoui, R.; Hechkel, W.; Saadaoui, W.; Helali, A.; Leo, M. Advanced Deep Learning and Machine Learning Techniques for MRI Brain Tumor Analysis: A Review. Sensors 2025, 25, 2746. [Google Scholar] [CrossRef]
Caron, M.; Touvron, H.; Misra, I.; Jégou, H.; Mairal, J.; Bojanowski, P.; Joulin, A. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 9650–9660. [Google Scholar]
Oquab, M.; Darcet, T.; Moutakanni, T.; Vo, H.; Szafraniec, M.; Khalidov, V.; Fernandez, P.; Haziza, D.; Massa, F.; El-Nouby, A.; et al. Dinov2: Learning robust visual features without supervision. arXiv 2023, arXiv:2304.07193. [Google Scholar]
Siméoni, O.; Vo, H.V.; Seitzer, M.; Baldassarre, F.; Oquab, M.; Jose, C.; Khalidov, V.; Szafraniec, M.; Yi, S.; Ramamonjisoa, M.; et al. DINOv3. arXiv 2025, arXiv:2508.10104. [Google Scholar] [PubMed]
Brain Tumor for 14 Classes. Available online: https://www.kaggle.com/datasets/waseemnagahhenes/brain-tumor-for-14-classes (accessed on 30 September 2025).
Brain Tumor MRI Dataset. Available online: https://www.kaggle.com/datasets/masoudnickparvar/brain-tumor-mri-dataset (accessed on 20 October 2025).
Padmavathy, R.; Kalaiarasi, G.; Durgadevi, G.; Yogitha, R.; Dheepan, G.K. Study of support vector machine for classification of brain tumours. In AIP Conference Proceedings; AIP Publishing LLC: Melville, NY, USA, 2024; Volume 3022, p. 020008. [Google Scholar]
Ghazvini, M.M.; Dehlaghi, V.; Papi, A.; Mansoory, M.S. Diagnosis and Classification of Brain Tumors from MRI Images Using the SVM Algorithm. J. Clin. Res. Paramed. Sci. 2024, 13, e148703. [Google Scholar]
Adamu, M.J.; Kawuwa, H.B.; Qiang, L.; Nyatega, C.O.; Younis, A.; Fahad, M.; Dauya, S.S. Efficient and accurate brain tumor classification using hybrid mobileNetV2–support vector machine for magnetic resonance imaging diagnostics in neoplasms. Brain Sci. 2024, 14, 1178. [Google Scholar] [PubMed]
Alqhtani, S.M.; Soomro, T.A.; Shah, A.A.; Memon, A.A.; Irfan, M.; Rahman, S.; Jalalah, M.; Almawgani, A.H.; Eljak, L.A.B. Improved brain tumor segmentation and classification in brain MRI with FCM-SVM: A diagnostic approach. IEEE Access 2024, 12, 61312–61335. [Google Scholar] [CrossRef]
Aggarwal, A.K. Learning texture features from glcm for classification of brain tumor mri images using random forest classifier. Trans. Signal Process. 2022, 18, 60–63. [Google Scholar]
Allahem, H.; El-Ghany, S.A.; Abd El-Aziz, A.; Aldughayfiq, B.; Alshammeri, M.; Alamri, M. A Hybrid Model of Feature Extraction and Dimensionality Reduction Using ViT, PCA, and Random Forest for Multi-Classification of Brain Cancer. Diagnostics 2025, 15, 1392. [Google Scholar]
Abdulla, A.A. A computer-aided diagnosis system for brain tumors in magnetic resonance imaging (MRI). Multimed. Tools Appl. 2025, 84, 24887–24902. [Google Scholar] [CrossRef]
Tiwari, P.; Johri, P.; Katiyar, A. Fusion of Convolutional Neural Networks and Random Forests for Brain Tumor Classification in MRI Scans. Int. J. Comput. Exp. Sci. Eng. 2025, 11, 2738–2747. [Google Scholar] [CrossRef]
Nurtay, M.; Kissina, M.; Tau, A.; Akhmetov, A.; Alina, G.; Mutovina, N. Brain tumor classification using deep convolutional neural networks. Comput. Opt. 2025, 49, 253–262. [Google Scholar] [CrossRef]
Albalawi, E.; Thakur, A.; Dorai, D.R.; Bhatia Khan, S.; Mahesh, T.; Almusharraf, A.; Aurangzeb, K.; Anwar, M.S. Enhancing brain tumor classification in MRI scans with a multi-layer customized convolutional neural network approach. Front. Comput. Neurosci. 2024, 18, 1418546. [Google Scholar] [CrossRef]
Alemayehu, N. Light Weight CNN for classification of Brain Tumors from MRI Images. arXiv 2025, arXiv:2504.21188. [Google Scholar] [CrossRef]
Prayogo, R.D.; Hamid, N.; Nambo, H. Hybrid CNN-Based Transfer Learning Enhances Brain Tumor Classification on MRI Images. IEEE Access 2025, 13, 116654–116668. [Google Scholar] [CrossRef]
Mughal, F.A.; Jahangir, A.; Bibi, K.; Hakeem, S.; Farid, S.; Jabeen, A.; Asghar, A.; Ali, M. A Self-Supervised Learning Model for the Classification of Brain Tumors Using Medical Images: A Review. J. Med. Health Sci. Rev. 2025, 2, 5627–5648. [Google Scholar]
Karagoz, M.A.; Nalbantoglu, O.U.; Fox, G.C. Residual vision transformer (ResViT) based self-supervised learning model for brain tumor classification. arXiv 2024, arXiv:2411.12874. [Google Scholar]
Rudro, M.F.A.; Arman, S.H.; Rahman, N.N.; Nabil, A.H.; Rahman, A.; Ferdus, Z. An Explainable SSL-Based Model for Robust Multi-Class Brain Tumor Classification from MRI Images; Research Square: Durham, NC, USA, 2025. [Google Scholar]
Weber Nunes, D.; Rauber, D.; Palm, C. Self-supervised 3D Vision Transformer Pre-training for Robust Brain Tumor Classification. In BVM Workshop; Springer: Berlin/Heidelberg, Germany, 2025; pp. 298–303. [Google Scholar]
Safwan, M.N.; Rahman, S.; Mahadi, M.H.; Mobin, M.I.; Jabir, T.M.; Aung, Z.; Mridha, M. T3SSLNet: Triple-Method Self-Supervised Learning for Enhanced Brain Tumor Classification in MRI. IEEE Access 2025, 13, 127852–127867. [Google Scholar] [CrossRef]
Khaniki, M.A.L.; Mirzaeibonehkhater, M.; Manthouri, M.; Hasani, E. Brain tumor classification using vision transformer with selective cross-attention mechanism and feature calibration. arXiv 2024, arXiv:2406.17670. [Google Scholar] [CrossRef]
Wang, J.; Lu, S.Y.; Wang, S.H.; Zhang, Y.D. RanMerFormer: Randomized vision transformer with token merging for brain tumor classification. Neurocomputing 2024, 573, 127216. [Google Scholar] [CrossRef]
Masoudi, B. An optimized dual attention-based network for brain tumor classification. Int. J. Syst. Assur. Eng. Manag. 2024, 15, 2868–2879. [Google Scholar] [CrossRef]
Srivastava, S.; Jain, P.; Pandey, S.K.; Dubey, G.; Das, N.N. Automated Brain Tumor Classification and Grading Using Multi-scale Graph Neural Network with Spatio-Temporal Transformer Attention Through MRI Scans. In Interdisciplinary Sciences: Computational Life Sciences; Springer: Berlin/Heidelberg, Germany, 2025; pp. 1–29. [Google Scholar]
Tomar, N.; Chandel, S.; Bhatnagar, G. A visual attention-based algorithm for brain tumor detection using an on-center saliency map and a superpixel-based framework. Healthc. Anal. 2024, 5, 100323. [Google Scholar] [CrossRef]
Khan, M.A.; Khan, A.; Alhaisoni, M.; Alqahtani, A.; Alsubai, S.; Alharbi, M.; Malik, N.A.; Damaševičius, R. Multimodal brain tumor detection and classification using deep saliency map and improved dragonfly optimization algorithm. Int. J. Imaging Syst. Technol. 2023, 33, 572–587. [Google Scholar]
Khan, R.; Islam, R. X-SCSANet: Explainable Stack Convolutional Self-Attention Network for Brain Tumor Classification. Int. J. Intell. Syst. 2025, 2025, 1444673. [Google Scholar]
Keles, A.; Akcay, O.; Kul, H.; Bendechache, M. Saliency Maps as an Explainable AI Method in Medical Imaging: A Case Study on Brain Tumor Classification. Zenodo 2023. [Google Scholar] [CrossRef]
Goodfellow, I.J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. In Advances in Neural Information Processing Systems; NeurIPS Proceedings: San Diego, CA, USA, 2014; Volume 27. [Google Scholar]
Morid, M.A.; Borjali, A.; Del Fiol, G. A scoping review of transfer learning research on medical image analysis using ImageNet. Comput. Biol. Med. 2021, 128, 104115. [Google Scholar] [CrossRef] [PubMed]
Gu, C.; Lee, M. Deep transfer learning using real-world image features for medical image classification, with a case study on pneumonia X-ray images. Bioengineering 2024, 11, 406. [Google Scholar]
Greenacre, M.; Groenen, P.J.; Hastie, T.; d’Enza, A.I.; Markos, A.; Tuzhilina, E. Principal component analysis. Nat. Rev. Methods Prim. 2022, 2, 100. [Google Scholar]
Mei, X.; Liu, Z.; Robson, P.M.; Marinelli, B.; Huang, M.; Doshi, A.; Jacobi, A.; Cao, C.; Link, K.E.; Yang, T.; et al. RadImageNet: An open radiologic deep learning research dataset for effective transfer learning. Radiol. Artif. Intell. 2022, 4, e210315. [Google Scholar]
Azizi, S.; Culp, L.; Freyberg, J.; Mustafa, B.; Baur, S.; Kornblith, S.; Chen, T.; Tomasev, N.; Mitrović, J.; Strachan, P.; et al. Robust and data-efficient generalization of self-supervised machine learning for diagnostic imaging. Nat. Biomed. Eng. 2023, 7, 756–779. [Google Scholar]

Figure 1. Overview of the proposed framework for brain tumor feature extraction and classification.

Figure 2. DCGAN architecture used for class-specific data augmentation in the 15-class brain tumor dataset.

Figure 3. Architecture of the frozen DINO-based feature extraction module.

Figure 4. Confusion matrix for the 15-class brain tumor classification task using the best-performing model (DINOv1-Base with combined features and an SVM classifier).

Figure 5. Confusion matrix for the 4-class brain tumor MRI dataset using the best-performing model (DINO (v1)-Small with penultimate-layer features and an SVM classifier).

Figure 6. Attention rollout visualizations from the frozen DINOv3 backbone for representative brain tumor MRI samples. The red regions indicate the areas most influential for classification.

Table 1. Class-wise distribution of MRI scans in Dataset A (Brain Tumor for 14 Classes).

Tumor Class	Abbreviation	Number of Scans	Percentage (%)
Meningioma	MEN	874	19.6
Astrocytoma	AST	580	13.0
Normal	NOR	522	11.7
Schwannoma	SCH	465	10.4
Neurocytoma	NEU	457	10.3
Carcinoma	CAR	251	5.6
Papilloma	PAP	237	5.3
Oligodendroglioma	OLI	224	5.0
Glioblastoma	GLI	204	4.6
Ependymoma	EPE	150	3.4
Tuberculoma	TUB	145	3.3
Medulloblastoma	MED	131	2.9
Germinoma	GER	100	2.2
Granuloma	GRA	78	1.8
Ganglioglioma	GAN	38	0.9
Total		4456	100.0

Table 2. Class-wise distribution of MRI scans in Dataset B (Brain Tumor MRI Dataset).

Class	Number of Scans	Percentage (%)
Glioma	1621	23.1
Meningioma	1645	23.4
Pituitary	1757	25.1
No Tumor	2000	28.4
Total	7023	100.0

Table 3. Classification performance on the 15-class brain tumor MRI dataset using DINO (v1) features at

224 \times 224

resolution.

Table 3. Classification performance on the 15-class brain tumor MRI dataset using DINO (v1) features at

224 \times 224

resolution.

Backbone	Feature Set	CV Score (%)		Test Accuracy (%)		Precision (%)		Recall (%)		F1-Score (%)
Backbone	Feature Set	SVM	MLP	SVM	MLP	SVM	MLP	SVM	MLP	SVM	MLP
Small	Final	97.35	93.84	97.33	95.03	97.40	95.13	97.33	95.03	97.35	95.07
	Penultimate	96.91	93.93	97.20	94.73	97.27	94.86	97.20	94.73	97.22	94.77
	Combined	97.38	94.25	97.50	95.43	97.55	95.50	97.50	95.43	97.51	95.45
Base	Final	97.41	94.54	98.03	95.20	98.07	95.27	98.03	95.20	98.04	95.23
	Penultimate	97.27	94.62	97.97	95.43	98.00	95.48	97.97	95.43	97.97	95.45
	Combined	97.42	94.73	98.17	95.00	98.20	95.13	98.17	95.00	98.17	95.05

Table 4. Classification performance on the 15-class brain tumor MRI dataset using DINOv2 features at

224 \times 224

resolution.

Table 4. Classification performance on the 15-class brain tumor MRI dataset using DINOv2 features at

224 \times 224

resolution.

Backbone	Feature Set	CV Score (%)		Test Accuracy (%)		Precision (%)		Recall (%)		F1-Score (%)
Backbone	Feature Set	SVM	MLP	SVM	MLP	SVM	MLP	SVM	MLP	SVM	MLP
Small	Final	95.69	92.37	95.83	92.90	95.90	92.98	95.83	92.90	95.84	92.92
	Penultimate	93.42	90.92	93.93	91.40	94.25	91.49	93.93	91.40	93.95	91.39
	Combined	95.54	92.96	95.83	94.13	95.90	94.18	95.83	94.13	95.83	94.14
Base	Final	95.12	91.02	95.53	92.17	95.64	92.24	95.53	92.17	95.56	92.18
	Penultimate	94.47	90.63	94.77	90.33	94.88	90.46	94.77	90.33	94.80	90.36
	Combined	95.79	92.87	96.07	93.10	96.14	93.15	96.07	93.10	96.09	93.10
Large	Final	92.33	91.08	92.57	92.23	93.24	92.42	92.57	92.23	92.61	92.29
	Penultimate	93.08	90.99	93.37	92.73	93.80	92.83	93.37	92.73	93.41	92.76
	Combined	93.68	91.92	93.80	92.87	94.19	92.97	93.80	92.87	93.83	92.90

Table 5. Classification performance on the 15-class brain tumor MRI dataset using DINOv3 features at

224 \times 224

resolution.

Table 5. Classification performance on the 15-class brain tumor MRI dataset using DINOv3 features at

224 \times 224

resolution.

Backbone	Feature Set	CV Score (%)		Test Accuracy (%)		Precision (%)		Recall (%)		F1-Score (%)
Backbone	Feature Set	SVM	MLP	SVM	MLP	SVM	MLP	SVM	MLP	SVM	MLP
Small	Final	96.47	92.57	96.57	93.97	96.66	94.03	96.57	93.97	96.58	93.97
	Penultimate	96.04	91.82	96.20	92.93	96.36	93.03	96.20	92.93	96.24	92.96
	Combined	97.13	94.29	97.47	94.40	97.57	94.50	97.47	94.40	97.49	94.43
Base	Final	96.64	93.76	96.97	94.13	97.05	94.20	96.97	94.13	96.99	94.14
	Penultimate	94.94	92.47	95.47	93.07	95.56	93.29	95.47	93.07	95.49	93.14
	Combined	96.67	94.51	97.03	94.57	97.11	94.67	97.03	94.57	97.05	94.59
Large	Final	95.87	93.03	96.40	93.60	96.54	93.65	96.40	93.60	96.42	93.61
	Penultimate	96.76	94.05	97.20	94.90	97.28	94.94	97.20	94.90	97.22	94.90
	Combined	96.69	94.51	97.10	95.63	97.17	95.69	97.10	95.63	97.11	95.64

Table 6. Per-class classification metrics using combined-layer features for Dataset A (DINO v1-Base + SVM).

Class	Precision (%)	Recall (%)	F1-Score (%)	Accuracy (%)	Support
AST	93.40	99.00	96.12	99.00	200
CAR	100.00	99.00	99.50	99.00	200
EPE	97.03	98.00	97.51	98.00	200
GAN	100.00	100.00	100.00	100.00	200
GER	100.00	97.50	98.73	97.50	200
GLI	99.01	100.00	99.50	100.00	200
GRA	100.00	97.50	98.73	97.50	200
MED	98.99	98.50	98.75	98.50	200
MEN	94.95	94.00	94.47	94.00	200
NEU	98.48	97.50	97.99	97.50	200
NOR	99.01	100.00	99.50	100.00	200
OLI	99.00	99.50	99.25	99.50	200
PAP	98.99	98.00	98.49	98.00	200
SCH	95.17	98.50	96.81	98.50	200
TUB	98.96	95.50	97.20	95.50	200
Macro avg	98.20	98.17	98.17	98.17	3000
Weighted avg	98.20	98.17	98.17	–	3000

Table 7. Classification performance on the 4-class brain tumor MRI dataset using DINO (v1) features at

224 \times 224

resolution.

Table 7. Classification performance on the 4-class brain tumor MRI dataset using DINO (v1) features at

224 \times 224

resolution.

Backbone	Feature Set	CV Score (%)		Accuracy (%)		Precision (%)		Recall (%)		F1-Score (%)
Backbone	Feature Set	SVM	MLP	SVM	MLP	SVM	MLP	SVM	MLP	SVM	MLP
Small	Final	97.85	96.22	98.86	97.03	98.80	96.96	98.75	96.77	98.76	96.81
	Penultimate	97.88	96.11	99.08	97.56	99.04	97.43	99.00	97.35	99.02	97.37
	Combined	97.93	96.48	99.08	97.10	99.03	97.08	99.03	96.85	99.01	96.92
Base	Final	97.83	95.75	98.93	97.48	98.89	97.44	98.83	97.26	98.85	97.29
	Penultimate	97.79	95.85	98.86	97.41	98.81	97.28	98.75	97.22	98.76	97.21
	Combined	97.86	96.18	98.86	97.64	98.81	97.56	98.75	97.49	98.76	97.48

Table 8. Classification performance on the 4-class brain tumor MRI dataset using DINOv2 features at

224 \times 224

resolution.

Table 8. Classification performance on the 4-class brain tumor MRI dataset using DINOv2 features at

224 \times 224

resolution.

Backbone	Feature Set	CV Score (%)		Accuracy (%)		Precision (%)		Recall (%)		F1-Score (%)
Backbone	Feature Set	SVM	MLP	SVM	MLP	SVM	MLP	SVM	MLP	SVM	MLP
Small	Final	95.57	92.79	97.64	94.43	97.51	94.08	97.42	94.00	97.44	94.01
	Penultimate	95.34	92.66	97.48	94.58	97.41	94.38	97.25	94.11	97.26	94.13
	Combined	96.48	94.68	97.86	95.04	97.80	94.80	97.67	94.62	97.68	94.64
Base	Final	95.97	93.64	98.09	95.35	98.05	95.16	97.92	94.98	97.94	94.98
	Penultimate	95.17	93.05	97.48	95.19	97.36	95.00	97.26	94.87	97.28	94.86
	Combined	96.18	94.38	98.40	95.42	98.32	95.22	98.25	95.05	98.26	95.07
Large	Final	95.85	94.00	98.09	93.90	98.01	93.58	97.92	93.40	97.94	93.43
	Penultimate	96.27	94.78	97.79	95.80	97.67	95.62	97.59	95.44	97.60	95.47
	Combined	96.17	94.89	98.02	96.57	97.91	96.45	97.84	96.26	97.85	96.28

Table 9. Classification performance on the 4-class brain tumor MRI dataset using DINOv3 features at

224 \times 224

resolution.

Table 9. Classification performance on the 4-class brain tumor MRI dataset using DINOv3 features at

224 \times 224

resolution.

Backbone	Feature Set	CV Score (%)		Test Accuracy (%)		Precision (%)		Recall (%)		F1-Score (%)
Backbone	Feature Set	SVM	MLP	SVM	MLP	SVM	MLP	SVM	MLP	SVM	MLP
Small	Final	97.18	94.87	98.55	95.96	98.49	95.75	98.42	95.63	98.43	95.65
	Penultimate	96.43	93.00	97.71	94.13	97.70	94.03	97.50	93.64	97.52	93.69
	Combined	97.58	95.47	98.09	96.49	98.10	96.42	97.92	96.22	97.93	96.23
Base	Final	97.32	94.73	98.40	96.95	98.34	96.79	98.25	96.71	98.27	96.74
	Penultimate	95.71	94.28	96.87	95.19	96.82	95.05	96.59	94.78	96.61	94.83
	Combined	97.23	95.54	98.25	96.19	98.19	95.99	98.09	95.88	98.10	95.91
Large	Final	96.97	95.76	97.86	94.97	97.83	94.73	97.67	94.53	97.69	94.57
	Penultimate	97.27	95.03	98.09	96.34	98.04	96.25	97.92	96.06	97.94	96.07
	Combined	97.34	95.71	98.17	96.49	98.12	96.33	98.00	96.19	98.02	96.20

Table 10. Per-class classification metrics using penultimate-layer features for Dataset B (DINO (v1)-Small + SVM).

Class	Precision (%)	Recall (%)	F1-Score (%)	Accuracy (%)	Support
Glioma	98.99	97.67	98.32	97.67	300
Meningioma	97.74	99.02	98.38	99.02	306
No tumor	99.75	100.00	99.88	100.00	405
Pituitary	99.67	99.33	99.50	99.33	300
Macro avg	99.04	99.00	99.02	99.08	1311
Weighted avg	99.09	99.08	99.08	99.08	1311

Table 11. Ablation study on the effect of GAN-based augmentation on Dataset A using DINOv3 features at

224 \times 224

resolution. Results are reported for both SVM and MLP classifiers across different feature sets.

Table 11. Ablation study on the effect of GAN-based augmentation on Dataset A using DINOv3 features at

224 \times 224

resolution. Results are reported for both SVM and MLP classifiers across different feature sets.

Backbone	Feature Set	GAN	CV Score (%)		Test Accuracy (%)
Backbone	Feature Set	GAN	SVM	MLP	SVM	MLP
Small	Final	No	90.24	78.61	90.72	81.10
	Final	Yes	96.47	92.57	96.57	93.97
	Penultimate	No	90.24	77.57	91.28	79.19
	Penultimate	Yes	96.04	91.82	96.20	92.93
	Combined	No	91.58	80.79	92.95	84.45
	Combined	Yes	97.13	94.29	97.47	94.40
Base	Final	No	90.19	80.48	91.39	81.77
	Final	Yes	96.64	93.76	96.97	94.13
	Penultimate	No	88.14	78.55	89.04	81.21
	Penultimate	Yes	94.94	92.47	95.47	93.07
	Combined	No	90.44	83.87	91.83	84.56
	Combined	Yes	96.67	94.51	97.03	94.57
Large	Final	No	90.38	80.73	90.60	83.67
	Final	Yes	95.87	93.03	96.40	93.60
	Penultimate	No	91.56	82.27	92.28	83.45
	Penultimate	Yes	96.76	94.05	97.20	94.90
	Combined	No	91.19	84.87	91.61	83.78
	Combined	Yes	96.69	94.51	97.10	95.63

Table 12. Ablation on Dataset A: partial fine-tuning (unfreezing the last transformer block + classification head).

Feature Set	Train Accuracy (%)	Test Accuracy (%)	Train Time (s)	Trainable Params	Unfrozen Part
Penultimate	81.14	80.87	341.12	1,780,623	Last block + head
Final	92.03	92.00	382.49	1,780,623	Last block + head
Combined	92.14	90.60	383.13	1,786,383	Last block + head

Table 13. Ablation study on the effect of categorical focal loss on the 15-class Dataset A using DINOv3 features at

224 \times 224

resolution. Results are reported for the MLP classifier with different feature sets.

Table 13. Ablation study on the effect of categorical focal loss on the 15-class Dataset A using DINOv3 features at

224 \times 224

resolution. Results are reported for the MLP classifier with different feature sets.

Backbone	Feature Set	Loss Function	CV Score (%)	Test Accuracy (%)
Small	Final	Cross-Entropy	92.57	93.97
	Final	Focal Loss	93.53	93.23
	Penultimate	Cross-Entropy	91.82	92.93
	Penultimate	Focal Loss	92.54	92.83
	Combined	Cross-Entropy	94.29	94.40
	Combined	Focal Loss	94.79	94.47
Base	Final	Cross-Entropy	93.76	94.13
	Final	Focal Loss	94.15	93.93
	Penultimate	Cross-Entropy	92.47	93.07
	Penultimate	Focal Loss	92.76	92.23
	Combined	Cross-Entropy	94.51	94.57
	Combined	Focal Loss	94.94	94.93
Large	Final	Cross-Entropy	93.03	93.60
	Final	Focal Loss	93.85	94.63
	Penultimate	Cross-Entropy	94.05	94.90
	Penultimate	Focal Loss	94.53	94.73
	Combined	Cross-Entropy	94.51	95.63
	Combined	Focal Loss	94.98	95.27

Table 14. Performance comparison on the 15-class brain tumor MRI dataset.

Method	Feature Set	Accuracy (%)	Precision (%)	Recall (%)	F1-Score (%)	Total Training Time
Deformable Attention [2]	–	96.50	95.66	97.22	96.43	–
Proposed (DINO (v1)-Base + SVM)	Combined	98.17	98.20	98.18	98.18	6.28 min
Proposed (DINO (v1)-Small + MLP)	Combined	95.43	95.50	95.43	95.45	1.37 min
Proposed (DINOv2-Base + SVM)	Combined	96.07	96.14	96.07	96.09	3.71 min
Proposed (DINOv2-Small + MLP)	Combined	94.13	94.18	94.13	94.14	1.95 min
Proposed (DINOv3-Small + SVM)	Combined	97.47	97.57	97.47	97.49	2.55 min
Proposed (DINOv3-Large + MLP)	Combined	95.63	95.69	95.63	95.64	23.89 min

Table 15. Performance comparison on the 4-class brain tumor MRI dataset.

Method	Feature Set	Accuracy (%)	Precision (%)	Recall (%)	F1-Score (%)	Total Training Time
Fine-tuned CNN (ResNet50) [11]	–	98.47	98.55	98.48	98.51	4–6 min
SSL (SimCLR + EfficientNetB3) [32]	–	98.32	98	98	98	40 min
ResViT [31]	–	98.53	98.54	98.54	98.54	–
Proposed (DINO (v1)-Small + SVM)	Penultimate	99.08	99.04	99.00	99.02	8.80 min
Proposed (DINO (v1)-Base + MLP)	Combined	97.64	97.56	97.49	97.48	10.31 min
Proposed (DINOv2-Base + SVM)	Combined	98.40	98.32	98.25	98.26	12.55 min
Proposed (DINOv2-Large + MLP)	Combined	96.57	96.45	96.26	96.28	28.16 min
Proposed (DINOv3-Small + SVM)	Final	98.55	98.60	98.55	98.55	9.1 min
Proposed (DINOv3-Base + MLP)	Final	96.95	96.97	96.95	96.95	11.0 min

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Missaoui, R.; Del Coco, M.; Saadaoui, W.; Hechkel, W.; Helali, A.; Carcagnì, P.; Leo, M. Brain Tumor Classification Using DINO Features and Lightweight Classifiers. Electronics 2026, 15, 952. https://doi.org/10.3390/electronics15050952

AMA Style

Missaoui R, Del Coco M, Saadaoui W, Hechkel W, Helali A, Carcagnì P, Leo M. Brain Tumor Classification Using DINO Features and Lightweight Classifiers. Electronics. 2026; 15(5):952. https://doi.org/10.3390/electronics15050952

Chicago/Turabian Style

Missaoui, Rim, Marco Del Coco, Wajdi Saadaoui, Wided Hechkel, Abdelhamid Helali, Pierluigi Carcagnì, and Marco Leo. 2026. "Brain Tumor Classification Using DINO Features and Lightweight Classifiers" Electronics 15, no. 5: 952. https://doi.org/10.3390/electronics15050952

APA Style

Missaoui, R., Del Coco, M., Saadaoui, W., Hechkel, W., Helali, A., Carcagnì, P., & Leo, M. (2026). Brain Tumor Classification Using DINO Features and Lightweight Classifiers. Electronics, 15(5), 952. https://doi.org/10.3390/electronics15050952

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Brain Tumor Classification Using DINO Features and Lightweight Classifiers

Abstract

1. Introduction

2. Related Work

2.1. Traditional and Deep Learning Approaches for Brain Tumor Classification

2.2. Self-Supervised Learning (SSL) and Vision Transformers (ViTs)

2.3. Attention Mechanisms and Saliency Mapping

3. Methodology

3.1. Datasets

3.1.1. Dataset A: 15-Class Brain Tumor MRI Dataset

3.1.2. Dataset B: 4-Class Brain Tumor MRI Dataset

3.2. Data Preparation and Preprocessing

3.2.1. Brain Region Cropping

3.2.2. Images Resizing and Normalization

3.2.3. Data Augmentation and Class Balancing

Dataset A: GAN-Based Class Balancing

Dataset B: Lightweight Training Augmentation

3.3. DINO Feature Extraction

3.4. Classification Models

3.5. Experimental Setup

4. Results and Discussion

4.1. Results on Dataset A

4.2. Results on Dataset B

4.3. Discussion

4.4. Ablation Study

4.5. Performance Comparison

4.6. Computational Efficiency

4.7. Visualization and Interpretability

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI