1. Introduction
Brain tumor classification using magnetic resonance imaging (MRI) is a fundamental task in neuro-oncology, supporting diagnosis, treatment planning, and longitudinal monitoring [
1]. MRI provides rich soft tissue contrast and multi-sequence information, making it the imaging modality of choice for brain tumor assessments [
2]. In recent years, deep learning, particularly Convolutional Neural Networks (CNNs), has become the dominant approach for automated brain tumor classification, achieving state-of-the-art performance on several public datasets [
3,
4,
5]. Compared to traditional machine learning methods based on handcrafted features, CNN-based models can learn hierarchical representations directly from imaging data, enabling their superior discriminative performance [
6,
7].
However, growing evidence suggests that strong in-dataset accuracy does not necessarily imply clinical reliability [
8]. Deep learning models trained on a single dataset often suffer significant performance degradation when evaluated on unseen data acquired from different institutions, scanners, or imaging protocols [
9,
10]. This phenomenon, commonly referred to as domain shift, arises from variations in intensity distributions, acquisition parameters, preprocessing pipelines, and patient populations [
11]. In brain MRI specifically, intensity non-standardization across scanners and acquisition protocols has been shown to substantially affect models’ predictions [
12,
13]. As a result, models optimized solely for predictive accuracy may rely on dataset-specific cues rather than learning robust, generalizable representations [
14]. Several strategies have been proposed to mitigate domain shift in medical imaging, including data augmentation [
15], Domain Adaptation [
16], adversarial learning [
17], normalization schemes [
18], and architectural modifications [
19].
While these approaches can improve models’ robustness, many require access to target domain data during training or involve complex optimization procedures that limit their practical deployment [
20]. Moreover, most existing methods focus on aligning input distributions or model parameters, without explicitly constraining the stability of internal feature representations under realistic perturbations [
21,
22]. Recent studies suggest that instability in latent feature space is a key contributor to poor generalization under domain shift, particularly in medical imaging tasks where intensity variations are prevalent [
23].
Motivated by these observations, this work focuses on improving cross-dataset generalization by enforcing stability at the feature representation level rather than at the pixel or parameter level. This manuscript introduce Feature Space Stability Regularization (FSSR), a lightweight and model-agnostic training framework that explicitly encourages consistent latent representations for original MRI scans and their intensity-perturbed counterparts. By targeting feature-level robustness under MRI safe perturbations, FSSR aims to improve generalization to unseen clinical data without requiring architectural changes, additional annotations, or access to target domain samples during training. Accordingly, this work makes the following contributions:
- 1.
Feature space instability under realistic intensity perturbations is identified as a key factor limiting cross-dataset generalization in brain MRI classification, distinct from pixel-level or parameter-level variations.
- 2.
Feature Space Stability Regularization (FSSR) is introduced as a lightweight and model-agnostic training framework that enforces consistency between normalized feature embeddings of original and MRI-safe intensity-perturbed inputs via an auxiliary loss.
- 3.
External validation on the unseen BRISC-2025 dataset demonstrates that FSSR substantially improves generalization for deeper CNN architectures, achieving up to an 8.2% absolute accuracy improvement and 12.5% macro-F1 improvement without retraining or fine-tuning, while revealing architecture-dependent robustness behavior.
- 4.
Mechanistic feature space analysis shows that FSSR reduces mean feature deviation and feature variance, establishing a consistent empirical relationship between latent representation stability and cross-domain generalization performance.
3. Methodology
This section presents the proposed Feature Space Stability Regularization (FSSR) framework for robust brain tumor MRI classification under intensity-driven domain shift. We first describe the datasets and the problem’s formulation, followed by the preprocessing and intensity perturbation strategy. We then detail the network architectures, training objective, and optimization procedure, and conclude with the overall training framework and evaluation setup.
3.1. Datasets
Two publicly available brain MRI datasets with identical four-class taxonomies—glioma, meningioma, pituitary adenoma, and no tumor—were used for the model development and cross-domain evaluation.
Source domain (Kaggle). The source dataset was the Brain MRI Images for Brain Tumor Detection dataset obtained from Kaggle. It consists of T1-weighted contrast-enhanced axial brain MRI slices provided in JPEG format. The images exhibit heterogeneous intensity distributions reflecting multi-institutional acquisition and preprocessing variability. Data are organized into predefined Training and Testing directories [
40].
The original training set was split into training and validation subsets using stratified random sampling (80% training; 20% validation) to preserve the class proportions. The original testing directory was retained as an in-domain held-out test set comprising 1311 images, with the following class distribution: glioma (), meningioma (), no tumor (), and pituitary adenoma ().
Target domain (BRISC-2025). The Brain MRI Image Set for Classification 2025 (BRISC-2025) dataset was used exclusively for external validation to assess zero-shot generalization. This dataset contains 1000 images with a balanced class distribution (
per class), acquired using standardized 3T imaging protocols. Compared to the Kaggle dataset, BRISC-2025 exhibits distinct intensity normalization and preprocessing characteristics. No samples from BRISC-2025 were used during the training, fine-tuning, or hyperparameter selection [
41].
3.2. Problem Formulation
Let denote the source domain dataset, where represents an MRI image and denotes the corresponding tumor class label, with . The objective is to learn a classifier parameterized by that achieves a strong predictive performance on while remaining robust when deployed on an unseen target domain . The source and target domains differ primarily due to intensity-related distribution shifts arising from the acquisition and preprocessing variability. Importantly, no target domain samples are available during training, reflecting a realistic clinical deployment scenario.
3.3. Image Preprocessing and Intensity Perturbation
Each MRI image undergoes a standardized preprocessing pipeline to ensure consistent input dimensions while minimizing unnecessary variability. Images are converted to grayscale to remove channel redundancy and resized to a fixed spatial resolution of pixels using area-based interpolation, which preserves the anatomical structure while performing effective downsampling. No skull stripping, tumor segmentation, or handcrafted region-of-interest extraction is applied, allowing the network to learn discriminative features directly from the full brain image. This minimal preprocessing strategy improves reproducibility and avoids any reliance on external segmentation tools.
Let
denote the preprocessed image. To emulate realistic domain shifts while preserving the anatomical structure, an intensity-perturbed view
is generated using a stochastic, non-geometric transformation operator
. The perturbation operator includes the following: (i) global intensity scaling with a factor sampled uniformly from
; (ii) additive zero-mean Gaussian noise with standard deviation equal to 1% of the image intensity standard deviation; and (iii) spatial smoothing via a
average pooling operation applied with 50% probability. Formally,
where
is applied independently to each training sample.
3.4. Network Architecture and Feature Space Regularization
Each preprocessed image is forwarded through a Convolutional Neural Network (CNN) backbone to obtain a latent feature representation. Three widely used architectures are evaluated: ResNet-18, ResNet-34, and DenseNet-121. All networks are trained from scratch with random initialization and are adapted to accept single-channel grayscale input by modifying the first convolutional layer. The resulting latent feature dimensionality is 512 for the ResNet backbones and 1024 for DenseNet-121.
The central hypothesis of this study is that instability in latent feature representations under realistic intensity variations is a primary contributor to degraded cross-dataset generalization. To explicitly enforce robustness at the representation level, the proposed Feature Space Stability Regularization (FSSR) introduces an auxiliary loss that penalizes any discrepancies between the embeddings extracted from the original and intensity-perturbed inputs. Let
denote the feature embeddings of the original and perturbed images, respectively. The stability regularization loss is defined as
which encourages consistency between normalized feature representations while preserving the discriminative structure.
For supervised classification, the latent features are passed to a classifier head consisting of a single fully connected layer followed by a softmax activation. The categorical cross-entropy loss is used to optimize the classification performance.
3.5. Total Training Objective
The final training objective jointly optimizes the discriminative accuracy and robustness of feature representations to intensity-driven perturbations. The overall loss function is given by
where
controls the relative contribution of the stability regularization term. In all experiments,
is used based on preliminary hyperparameter tuning. Setting
reduces the formulation to the standard cross-entropy objective, enabling direct ablation and the assessment of the proposed regularization.
3.6. Overall Training Framework
During each training iteration, paired original and intensity-perturbed images are forwarded through the same CNN backbone with shared parameters. The classification loss is computed using the original images, while the FSSR loss measures the discrepancy between the feature embeddings of the original and perturbed inputs. Gradients from both objectives are backpropagated jointly, encouraging the network to learn intensity-invariant representations without sacrificing the discriminative performance.
Figure 1 provides a schematic overview of the proposed FSSR training framework.
3.7. Backbone Architectures and Rationale
Three Convolutional Neural Network (CNN) architectures with varying depth and connectivity are evaluated to assess the architectural generality and robustness of the proposed Feature Space Stability Regularization (FSSR) framework.
ResNet-18 and ResNet-34. Residual networks employ identity-based skip connections to stabilize optimization and alleviate the vanishing gradient problem in deep neural networks. A residual block is defined as
where
and
denote the input and output feature maps of the block, respectively, and
represents a sequence of convolutional, normalization, and non-linear operations.
ResNet-18, comprising approximately 11.7 million parameters, serves as a lightweight baseline architecture with limited depth. ResNet-34 increases the network depth and representational capacity to approximately 21.8 million parameters while preserving the same residual learning paradigm. Evaluating both variants enables the examination of the proposed regularization across different model capacity regimes.
DenseNet-121. DenseNet architectures adopt dense connectivity, in which each layer receives as input the concatenation of the feature maps from all the preceding layers within a dense block. The output of the
ℓ-th layer is defined as
where
denotes feature map concatenation and
represents a composite function consisting of batch normalization, convolution, and activation.
This connectivity pattern encourages features’ reuse, improves the gradient flow, and promotes parameter efficiency. DenseNet-121 contains approximately 8.0 million parameters and produces higher-dimensional latent representations compared to the ResNet backbones. Including DenseNet-121 allows for the evaluation of FSSR under a fundamentally different feature aggregation mechanism.
A summary of the backbone architectures and their key characteristics is provided in
Table 2.
A summary of the models’ characteristics is provided in
Table 3.
All the models are trained end-to-end using identical optimization settings to ensure a fair comparison across architectures and training strategies. Optimization is performed using the AdamW optimizer with an initial learning rate of and a fixed batch size of 32 for all experiments. Network parameters are initialized using the default PyTorch 2.1.2 initialization scheme, and the learning rate is kept constant throughout training to maintain consistent experimental conditions across all the evaluated models.
Training continues until convergence with an early stopping strategy applied to prevent overfitting and ensure that model selection reflects the generalization performance rather than training convergence. During training, models’ performances are continuously monitored on a held-out validation subset. Early stopping is triggered when the validation loss does not improve for five consecutive evaluation cycles. When this criterion is met, the training process is terminated and the model parameters corresponding to the best validation performance are restored for final evaluation. This strategy prevents unnecessary optimization once the model begins to overfit the training data and ensures that the selected model corresponds to the most generalizable configuration observed during training.
Using validation-based early stopping also stabilizes the optimization process and reduces the sensitivity to training noise. In practice, it avoids prolonged training after convergence while maintaining a consistent training behavior across different architectures. Since the ResNet-18, ResNet-34, and DenseNet-121 backbones differ substantially in relation to their representational capacity and optimization dynamics, early stopping provides a unified and architecture-agnostic mechanism for determining when training should terminate.
Mixed precision training (FP16) is employed to accelerate computation and reduce GPU memory usage while maintaining numerical stability. All experiments are conducted on a single NVIDIA RTX 3090 GPU with 24 GB of memory. Apart from the proposed Feature Space Stability Regularization (FSSR) objective, no architectural modifications, auxiliary networks, or additional supervision signals are introduced, ensuring that the observed performance improvements arise solely from the proposed regularization mechanism.
The regularization weight is set to , which represents a moderate balance between classification performance and feature space stability. Preliminary ablation experiments indicate that smaller values of provide insufficient regularization and yield only marginal improvements, whereas excessively large values may overly constrain the feature representations and hinder discriminative learning. Therefore, is adopted for all experiments to maintain a stable trade-off between robustness and predictive accuracy while avoiding architecture-specific hyperparameter tuning.
3.8. Feature Space Stability Measurement Protocol
To quantify representation robustness and provide mechanistic insight into the effect of FSSR, feature space stability is assessed by measuring the deviation between embeddings extracted from original and intensity-perturbed inputs. Specifically, the mean feature deviation across the evaluation set is computed as
where
and
denote the latent feature representations of the original and perturbed inputs, respectively, and
N is the number of samples. Lower values of
indicate the increased stability of the learned representations under intensity perturbations, suggesting an improved robustness to domain shift.
3.9. External Generalization Evaluation
The generalization performance is evaluated via zero-shot external testing on the unseen BRISC-2025 dataset, which was acquired using different scanner hardware and imaging protocols than the source training data. No retraining, fine-tuning, or Domain Adaptation is performed at any stage. The models trained on the source domain are directly applied to the external dataset without modification.
The classification performance is assessed using the overall accuracy and macro-averaged F1-score to account for potential class imbalance. In addition, the per-class precision and recall are reported to identify systematic misclassification patterns and class-specific generalization behavior.
4. Results
This section presents a comprehensive evaluation of the proposed Feature Space Stability Regularization(FSSR) framework for brain tumor MRI classification. The experimental results are reported across multiple dimensions: (i) source domain classification performance on the Kaggle Brain MRI test set ( images); (ii) zero-shot external generalization under domain shift using the BRISC-2025 dataset ( images); (iii) model calibration quality assessed using Expected Calibration Error (ECE), Maximum Calibration Error (MCE), Brier score, and Negative Log-Likelihood (NLL); (iv) statistical significance analysis via bootstrap resampling with 95% confidence intervals; and (v) feature space stability analysis providing mechanistic insight into the effectiveness of FSSR.
All the experiments are conducted using three backbone architectures—ResNet-18, ResNet-34, and DenseNet-121—trained from scratch without ImageNet pretraining. Unless otherwise stated, FSSR denotes the Feature Space Stability Regularization framework with . ECE denotes the Expected Calibration Error and NLL denotes the Negative Log-Likelihood.
4.1. Source Domain Classification Performance
Table 4 summarizes the classification and calibration performance on the Kaggle Brain MRI test set. Overall, Feature Space Stability Regularization (FSSR) yields consistent performance gains for deeper architectures while maintaining comparable performance for shallower networks.
Among all the evaluated models, ResNet-34 with FSSR achieves the highest classification accuracy of 97.71% and a macro-averaged F1-score of 97.55%, representing an absolute improvement of 1.98 percentage points over its baseline counterpart (95.73%). DenseNet-121 similarly benefits from the proposed regularization, with its accuracy improving from 96.03% to 97.64% (+1.61%) and the macro-F1 increasing from 95.75% to 97.47%. In addition to accuracy gains, both architectures exhibit substantial improvements in calibration metrics, including a reduced Expected Calibration Error (ECE), Brier score, and Negative Log-Likelihood (NLL).
In contrast, ResNet-18 demonstrates a marginal reduction in accuracy under FSSR (95.50% to 95.27%). This suggests that shallow architectures with limited representational capacity may not fully benefit from explicit feature space stability constraints. This architecture-dependent behavior is further examined through feature stability analysis in
Section 4.7.
A detailed breakdown of the misclassification patterns is provided through the confusion matrices shown in
Figure 2,
Figure 3 and
Figure 4. Across all architectures, the dominant source of error corresponds to confusion between the glioma and meningioma classes, a clinically relevant distinction due to differing treatment strategies. For ResNet-34, FSSR reduces glioma-to-meningioma misclassifications from 30 cases under baseline training to 11 cases, corresponding to a 63% reduction. Meningioma-to-glioma errors remain stable at eight cases for both methods.
Notably, all FSSR models achieve a perfect classification performance for the no tumor class (405/405 correct predictions), which is critical for minimizing false positive diagnoses in healthy patients.
4.2. Receiver Operating Characteristic Analysis
The receiver operating characteristic (ROC) analysis presented in
Figure 5,
Figure 6 and
Figure 7 demonstrates excellent discriminative performance across all model configurations. ResNet-34 with Feature Space Stability Regularization (FSSR) achieves perfect or near-perfect per-class AUC values on the source domain, including glioma (0.998), meningioma (0.997), no_tumor (1.000), and pituitary (1.000), resulting in a macro-AUC of 0.998. DenseNet-121 FSSR exhibits comparable performance, achieving the same macro-AUC value. Comprehensive classification metrics for all the evaluated architectures on the primary Kaggle Brain MRI test set are summarized in
Table 4.
4.3. External Generalization Under Domain Shift
The central hypothesis of this work is that Feature Space Stability Regularization improves cross-dataset generalization without requiring target domain data during training. To evaluate this, all the trained models were tested on the BRISC-2025 external validation dataset (N = 1000 images), which differs substantially from the source domain in terms of the scanner hardware, imaging protocols, and patient demographics. Importantly, no retraining or Domain Adaptation was performed—models trained exclusively on Kaggle data were applied directly in a zero-shot transfer setting.
As shown in
Table 5, the most notable finding is the substantial improvement for ResNet-34: its accuracy increases from 85.50% (baseline) to 93.70% (FSSR), representing a gain of +8.20 percentage points (
p < 0.001) and a 56% error reduction. The macro-F1 improvement is even larger, rising from 80.12% to 92.62% (+12.50 points), demonstrating that the FSSR benefits extend to class-balanced performance.
DenseNet-121 with FSSR achieves the best absolute external performance at 96.70% accuracy and 96.87% macro-F1. The domain gap decreases from 1.83% (baseline) to 0.94%—a 49% relative reduction—indicating that dense connectivity combined with feature space regularization yields inherently robust representations. ResNet-18 presents an exception: both the baseline and FSSR achieve identical 90.50% accuracy, with only marginal macro-F1 improvement (89.38% → 89.49%), suggesting that shallow networks may lack the representational capacity to exploit feature space constraints effectively.
These architecture-dependent patterns provide key insights into the interaction between network capacity and regularization. The ResNet-34 baseline exhibits the largest domain gap (10.23%) despite strong source performance (95.73%), indicating that deeper residual networks are prone to overfitting domain-specific features without explicit regularization. FSSR reduces this gap to 4.01%, a 61% relative improvement. In contrast, the DenseNet-121 baseline already generalizes effectively (1.83% gap), likely due to dense connectivity promoting feature reuse; nevertheless, FSSR still provides measurable gains.
4.4. Statistical Significance Analysis
To confirm that the observed improvements are statistically meaningful rather than due to sampling variability, we performed bootstrap hypothesis testing. For each model comparison, 1000 bootstrap samples were generated from test set predictions, and 95% confidence intervals (CI) were computed for accuracy differences between FSSR and baseline. The results were considered statistically significant when the CI excluded zero.
Table 6 summarizes the complete statistical analysis. On the source domain (Kaggle), FSSR yields significant improvements for ResNet-34 (
= +1.99%, 95% CI: [0.99%, 3.13%], and
) and DenseNet-121 (
= +1.59%, 95% CI: [0.46%, 2.75%], and
). ResNet-18 shows no significant difference (
= +0.23%; 95% CI: [−0.84%, 1.37%]).
Under domain shift (BRISC-2025), the statistical evidence strengthens considerably. ResNet-34 FSSR achieves the largest effect size observed in this study (+8.25%, 95% CI: [6.10%, 10.30%], and ), with the narrow confidence interval indicating high statistical confidence. DenseNet-121 FSSR also demonstrates significant improvement (+2.49%, 95% CI: [1.00%, 4.10%], and ), while ResNet-18 remains non-significant.
4.5. Model Calibration Under Domain Shift
Beyond discriminative accuracy, well-calibrated probability estimates are essential for clinical decision support systems where prediction confidence directly influences diagnostic workflows [
42]. A perfectly calibrated model produces predicted probabilities that match empirical accuracy: when predicting 80% confidence, the model should be correct approximately 80% of the time. In medical imaging applications, miscalibrated predictions can lead to inappropriate clinical decisions, making calibration assessment a critical component of models’ evaluation [
43].
To quantify calibration quality, we evaluate four widely adopted metrics: the Expected Calibration Error (ECE), Maximum Calibration Error (MCE), Brier score, and Negative Log-Likelihood (NLL) [
44,
45].
The ECE measures the average discrepancy between predicted confidence and empirical accuracy across
M confidence bins and is defined as
where
denotes the set of samples whose predicted confidence falls within bin
m,
is the empirical accuracy of that bin, and
is the mean predicted probability [
44].
The Maximum Calibration Error (MCE) captures the worst-case deviation between confidence and accuracy:
The Brier score evaluates the mean squared difference between the predicted probabilities
and true labels
[
46]:
Finally, Negative Log-Likelihood (NLL) measures the log probability assigned to the correct class:
Lower values indicate better calibrated and more reliable probability estimates. These metrics are widely used to evaluate predictive uncertainty in deep learning models, particularly in safety-critical medical applications [
43,
44,
45].
On the source domain, all the models exhibit acceptable calibration, with the ECE . However, under domain shift, baseline models show substantial calibration degradation. Most notably, the ResNet-34 baseline exhibits on BRISC-2025, indicating that the predicted confidence systematically overestimates true accuracy by approximately 11.7 percentage points, a potentially critical issue in clinical settings where overconfident incorrect predictions may mislead clinicians.
As summarized in
Table 7, the proposed FSSR substantially improves calibration under domain shift. For example, ResNet-34 with FSSR achieves
, representing a 66% reduction compared to the baseline. Similarly, the Brier score improves from
to
(a 60% reduction). DenseNet-121 combined with FSSR achieves the best calibration overall with
, compared to
for the baseline (49% reduction).
These results indicate that enforcing feature space stability not only improves the cross-domain classification performance but also produces probability estimates that more faithfully reflect models’ uncertainty, which is critical for their reliable deployment in medical decision support systems [
42].
4.6. Comparative Baselines
To contextualize the effectiveness of the proposed Feature Space Stability Regularization (FSSR), we compare it against representative approaches designed to improve robustness and generalization under domain shift. Specifically, we consider two widely used strategies that operate at different stages of the learning process: feature distribution mixing and contrastive representation learning.
MixStyle. MixStyle is a Domain Generalization technique that improves robustness by mixing instance-level feature statistics during training [
47]. Given a feature map
extracted from an intermediate network layer, MixStyle perturbs the feature distribution by randomly mixing the channel-wise mean and standard deviation between two samples within a mini-batch.
Let
and
denote the channel-wise mean and standard deviation of the feature map. MixStyle generates a new feature representation as
where
and
controls the mixing strength. This operation encourages the model to learn representations that are less sensitive to domain-specific feature statistics.
SimCLR-based Representation Learning. SimCLR is a contrastive self-supervised learning framework that learns invariant feature representations by maximizing the agreement between augmented views of the same image [
48]. Given two augmented views
and
of the same image, their corresponding feature embeddings
and
are encouraged to be similar in the representation space while remaining distinct from other samples in the batch.
The contrastive loss used in SimCLR is defined as
where
denotes cosine similarity and
is a temperature scaling parameter.
FSSR (Proposed Method). The proposed Feature Space Stability Regularization explicitly enforces the consistency between feature representations extracted from the original and intensity-perturbed MRI images. Given an input image
x and its perturbed version
, the stability constraint is defined as
The overall training objective becomes
Unlike MixStyle or SimCLR, which modify the feature distributions or require contrastive pretraining, FSSR directly constrains representation stability during supervised training without introducing additional networks or training stages.
Experimental Setup for Fair Comparison. To ensure a rigorous comparison, all the evaluated methods were trained using identical experimental settings. Specifically, all the models used the same random seed (seed = 42), identical training–validation splits, preprocessing pipelines, optimizer configurations (AdamW with learning rate and batch size 32), and early stopping criteria. This controlled setup ensures that any differences in performance arise from the regularization strategies rather than confounding factors such as initialization or data ordering.
The resulting performance across architectures is summarized in
Table 8. The results report both the source domain (Kaggle) performance and zero-shot external domain (BRISC-2025) performance, allowing for the direct comparison of cross-domain robustness.
Analysis of Results. Architecture-dependent effectiveness. FSSR demonstrates the strongest improvements on DenseNet-121, achieving 97.64% source domain accuracy and 96.70% external domain accuracy, corresponding to a domain gap of only 0.94%. This substantially outperforms SimCLR (2.65% gap) and MixStyle (3.27% gap).
Performance on ResNet architectures. For ResNet-18, FSSR maintains a comparable source domain accuracy (95.50% baseline vs 95.27% FSSR) while achieving an identical external accuracy of 90.50% on BRISC-2025. The domain gap shows only a marginal reduction from 5.00% to 4.77%, consistent with the observation that shallow architectures derive limited benefit from feature space stability constraints.
For ResNet-34, the improvements are substantially more pronounced. FSSR increases the source accuracy from 95.73% to 97.71% and external accuracy from 85.50% to 93.70%, reducing the domain gap from 10.23% to 4.01%. These results indicate that deeper architectures benefit substantially from Feature Space Stability Regularization.
SimCLR limitations. Although SimCLR slightly improves the source domain accuracy for some architectures, it does not consistently reduce the domain shift. For example, on ResNet-18, the domain gap increases from 5.00% to 8.88%, suggesting that contrastive representation learning alone may not adequately address the intensity-driven distribution shifts common in medical imaging.
MixStyle trade-offs. MixStyle generally reduces the domain gap compared to baseline models, but often at the expense of reduced source domain performance. For instance, on ResNet-18 the source accuracy drops from 95.50% to 91.30%, indicating that distribution mixing may compromise discriminative performance while improving robustness.
Overall comparison. Across all architectures, FSSR consistently achieves the best balance between source domain accuracy and cross-domain generalization. The method reduces domain gaps while maintaining a strong classification performance, and does so without requiring architectural modifications or additional training stages. This simplicity makes FSSR a practical and effective approach for improving robustness in medical imaging applications subject to domain shift.
4.7. Feature Space Stability Analysis
To validate the mechanistic hypothesis underlying FSSR—that regularizing feature space stability leads to more robust representations—we analyzed embedding variability under input perturbations. For each validation sample
x, we computed the
distance between the feature vectors of the original image and its intensity-perturbed counterpart:
where
represents an intensity-scaled version of
x (scale factor sampled from [0.9, 1.1]) and
denotes the learned 512-dimensional (ResNet) or 1024-dimensional (DenseNet) feature extractor.
As shown in
Table 9, FSSR consistently reduces both the mean and variance of feature deviations across all architectures. ResNet-18 shows the mean deviation decreasing from 4.513 to 3.602 (−20.2%) with the standard deviation reduced from 5.182 to 2.467 (−52.4%). ResNet-34 exhibits similar improvement: the mean reduces from 4.506 to 3.499 (−22.3%) and the standard deviation from 5.136 to 2.313 (−55.0%). DenseNet-121 achieves the largest relative reduction: the mean reduces from 4.565 to 3.474 (−23.9%) and the standard deviation from 5.157 to 2.168 (−58.0%).
The histogram analysis in
Figure 8,
Figure 9 and
Figure 10 reveals qualitative differences in feature stability distributions. Baseline models exhibit long-tailed distributions with deviations extending beyond 40–50 units, indicating dramatic representation instability for certain inputs, likely the most vulnerable samples under domain shift. FSSR effectively eliminates these high-deviation outliers, reducing the 99th percentile deviation by approximately 60% across all the architectures.
This stabilization provides a plausible explanation for the improved generalization observed in deeper architectures. Representations that are invariant to intensity perturbations during training, which simulate inter-scanner variability, are inherently more robust when encountering novel acquisition protocols at test time. For ResNet-34 and DenseNet-121, the correlation between stability improvements and generalization gains is consistent and substantial. However, ResNet-18 presents a notable exception: despite comparable reductions in feature deviation (−20.2%) and variance (−52.4%), no meaningful generalization improvement is observed. This suggests that feature space stability is a necessary but not sufficient condition for cross-domain robustness, and that sufficient representational capacity is required for stability gains to translate into improved classification performance. Overall, these results support a strong empirical association between latent representation stability and cross-domain generalization in architectures with adequate depth, rather than a universal causal relationship.
4.8. Ablation Study: Regularization Weight
To investigate the sensitivity of FSSR to the regularization weight , we conducted a systematic ablation study across all three backbone architectures. The models were trained with , where corresponds to standard cross-entropy training without feature space regularization. All the other training parameters remained identical to ensure fair comparison.
Table 10,
Table 11 and
Table 12 present the complete results of the
sensitivity analysis. Several key observations emerge from this experiment:
Effect of on cross-domain generalization. For deeper architectures, introducing Feature Space Stability Regularization () substantially improves the generalization to the external BRISC-2025 dataset compared to baseline training (). For ResNet-34, BRISC accuracy increases from 85.50% () to 93.70% (), representing an 8.20 percentage point improvement and a 56% reduction in the domain gap. DenseNet-121 shows similar trends, with the domain gap decreasing from 1.83% to 0.94% at the optimal value. However, for ResNet-18, FSSR does not yield consistent generalization gains: most values result in marginally lower BRISC accuracy than the baseline (), with only matching the baseline performance of 90.50%. This architecture-dependent behavior suggests that shallow networks with limited representational capacity may be unable to simultaneously satisfy the classification and stability objectives, and that FSSR benefits are contingent on sufficient model depth.
Optimal range. The results indicate that moderate values of (0.02–0.05) consistently yield the best trade-off between the source domain accuracy and cross-domain generalization. Very small values () provide insufficient regularization, while excessively large values () may over-constrain the feature space and reduce the discriminative capacity.
Architecture-dependent sensitivity. Deeper architectures exhibit greater sensitivity to and derive larger benefits from feature space regularization. ResNet-18 shows modest improvements across all values (domain gap reduction from 5.00% to 4.77%), whereas ResNet-34 demonstrates dramatic improvement (10.23% to 4.01%). This observation is consistent with the hypothesis that deeper networks possess a greater representational capacity to simultaneously optimize classification accuracy and feature stability constraints.
Based on these results, was selected for all the main experiments as it provides a stable balance between classification performance and cross-domain robustness across architectures, avoiding the need for architecture-specific hyperparameter tuning.
To investigate the contribution of each perturbation component to the overall effectiveness of FSSR, we conducted a systematic ablation study examining all the possible combinations of the three perturbation types: intensity scaling (S), Gaussian noise injection (N), and spatial smoothing (B for blur). The models were trained with
using each perturbation configuration, and evaluated on both the source domain (Kaggle) and external domain (BRISC-2025). The “None” configuration corresponds to standard cross-entropy training without any perturbations (equivalent to the baseline reported in
Table 4 and
Table 5), while “Full” denotes the complete FSSR pipeline combining all three components.
Key Findings. The ablation study reveals several important insights regarding the contribution of each perturbation component:
(1) Full FSSR consistently achieves the best cross-domain generalization. Across all three architectures, the complete perturbation pipeline (Scale + Noise + Smooth) yields the highest external accuracy and the lowest domain gap. For ResNet-34, Full FSSR achieves 93.70% BRISC accuracy with only a 4.01% domain gap, substantially outperforming the best single-component configuration (Noise: 90.80%; 5.38% gap) and the best two-component configuration (Noise+Smooth: 92.10%; 4.47% gap). For DenseNet-121, Full FSSR attains 96.70% external accuracy with the lowest domain gap of 0.94%, improving over the best partial combination (Noise + Smooth: 96.30%; 1.03% gap). For ResNet-18, the baseline already achieves 90.50% external accuracy, and Full FSSR is the only configuration that matches this performance while reducing the domain gap from 5.00% to 4.77%, whereas all partial configurations yield a lower BRISC accuracy than the baseline.
(2) Individual components provide partial but incomplete robustness. Among the single-component configurations, Gaussian noise injection provides the largest improvement in cross-domain generalization, particularly for ResNet-34 where it reduces the domain gap from 10.23% to 5.38%. However, noise alone does not match the comprehensive robustness achieved by the full pipeline. Intensity scaling improves the performance modestly across architectures, while spatial smoothing alone can paradoxically increase the domain gap for DenseNet-121 (2.12% vs 1.83% baseline), likely due to loss of fine-grained discriminative features.
(3) Synergistic effects emerge from component combinations. Two-component configurations consistently outperform individual components, and the full pipeline achieves a performance that exceeds all partial combinations. For ResNet-34, Scale+Noise achieves 91.40% BRISC accuracy and Noise+Smooth reaches 92.10%, yet Full FSSR attains 93.70%, demonstrating that the combination of all three perturbations provides complementary regularization effects that address different aspects of intensity-related domain shift. This progressive improvement from single to double to full configurations is consistent across all the architectures, supporting the interpretation that each perturbation component addresses a distinct source of intensity variability.
(4) Full FSSR provides architecture-agnostic robustness. Unlike partial configurations where optimal single or double combinations vary by architecture (e.g., Noise + Smooth for ResNet-34; Scale+Smooth for DenseNet-121), the complete FSSR pipeline consistently achieves the best performance across all evaluated backbones. This architecture-agnostic behavior eliminates the need for component selection and supports deployment in diverse clinical settings.
Table 16 summarizes the improvement achieved by Full FSSR over the baseline and the best single-component configuration.
These results demonstrate that the full FSSR pipeline, combining intensity scaling, Gaussian noise, and spatial smoothing, provides superior cross-domain generalization compared to any single component or partial combination across all architectures. The comprehensive perturbation strategy ensures robustness against diverse sources of intensity-related domain shift, including global intensity variations, local intensity fluctuations, and resolution differences. The consistent superiority of Full FSSR supports its adoption as the default configuration for clinical deployment where cross-scanner robustness is essential. For shallower architectures such as ResNet-18, the baseline already generalizes effectively, and FSSR primarily contributes through an improved domain gap rather than absolute accuracy gains.
4.9. Summary of Key Findings
The comprehensive experimental evaluation supports the following principal conclusions regarding the effectiveness of Feature Space Stability Regularization (FSSR) for brain tumor MRI classification:
(1) Source domain performance: FSSR improves classification performance for sufficiently deep architectures. ResNet-34 achieves the best overall results (97.71% accuracy; 97.55% macro-F1), with statistically significant gains for ResNet-34 (+1.99%) and DenseNet-121 (+1.59%; ). The performance gains are primarily driven by the reduced glioma–meningioma confusion.
(2) Cross-Domain Generalization: FSSR substantially enhances the zero-shot transfer to the external BRISC-2025 dataset. ResNet-34 shows an 8.20% accuracy improvement (85.50% to 93.70%; ), corresponding to a 56% reduction in error rate. DenseNet-121 achieves the highest external accuracy (96.70%) with a minimal domain gap of 0.94%, without any target domain exposure during training.
(3) Calibration reliability: FSSR produces markedly better-calibrated predictions under domain shift. ResNet-34 reduces the Expected Calibration Error (ECE) from 0.1166 to 0.0400 (66% reduction), while DenseNet-121 achieves the best calibration overall (ECE = 0.0166).
(4) Architecture dependence: The benefits of FSSR scale with the model’s depth. Shallow networks (ResNet-18) show no significant improvement, whereas deeper architectures (ResNet-34 and DenseNet-121) consistently benefit, indicating that sufficient representational capacity is required to exploit feature space constraints.
(5) Mechanistic and clinical relevance: FSSR reduces feature embedding variance by 52–58% under intensity perturbations across all architectures. For ResNet-34 and DenseNet-121, these stability gains are accompanied by substantial generalization improvements, supporting a strong empirical association between the representation stability and cross-domain performance. However, the absence of generalization gains for ResNet-18 despite similar stability improvements indicates that this relationship is contingent on sufficient model capacity. The perfect no_tumor classification on the Kaggle test set (405/405) highlights the method’s potential for safe clinical deployment.