Next Article in Journal
Construction and Study of a Probabilistic Model for the Sliding Mode Along and Across the Slip Line
Previous Article in Journal
Event-Triggered Synchronization of T-S Fuzzy Neural Network with Quantized Encoding–Decoding Mechanism
Previous Article in Special Issue
JDC-DA: An Unsupervised Target Domain Algorithm for Alzheimer’s Disease Diagnosis with Structural MRI Using Joint Domain and Category Dual Adaptation
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Improving Cross-Domain Generalization in Brain MRIs via Feature Space Stability Regularization

by
Shawon Chakrabarty Kakon
,
Harishik Dev Singh Jamwal
and
Saurabh Singh
*
Department of Artificial Intelligence and Big Data, Woosong University, Daejeon 34606, Republic of Korea
*
Author to whom correspondence should be addressed.
Mathematics 2026, 14(6), 1082; https://doi.org/10.3390/math14061082
Submission received: 27 February 2026 / Revised: 17 March 2026 / Accepted: 20 March 2026 / Published: 23 March 2026

Abstract

Deep learning models for brain tumor classification from magnetic resonance imaging (MRI) often achieve high in-dataset accuracy but exhibit substantial performance degradation when evaluated on unseen clinical data due to domain shift arising from variations in imaging protocols and intensity distributions. Existing approaches largely rely on architectural scaling or parameter-level regularization, which do not explicitly constrain the stability of learned feature representations. This manuscript proposes Feature Space Stability Regularization (FSSR), a lightweight and model-agnostic training framework that enforces consistency in latent feature representations under realistic, MRI-safe-intensity perturbations. FSSR introduces an auxiliary feature space loss that minimizes the 2 distance between normalized embeddings extracted from the input MRI images and their intensity-perturbed counterparts, alongside standard cross-entropy supervision. This manuscript evaluated FSSR across three convolutional backbones, ResNet-18, ResNet-34, and DenseNet-121, trained exclusively on the Kaggle Brain MRI dataset. Feature space analysis demonstrates that FSSR consistently reduces mean feature deviation and variance across architectures, indicating more stable internal representations. Generalization is assessed via zero-shot evaluation on the fully unseen BRISC-2025 dataset without retraining or fine-tuning. On the source domain, the best-performing configuration achieves 97.71% accuracy and 97.55% macro-F1. Under domain shift, FSSR improves external accuracy by up to 8.20 percentage points and the macro-F1 by up to 12.50 percentage points, with DenseNet-121 achieving a 96.70% accuracy and 96.87% macro-F1 at a domain gap of only 0.94%. Confusion matrix analysis further reveals the reduced class confusion and more stable recall across challenging tumor categories, demonstrating that feature-level stability is a key factor for robust brain MRI classification under domain shift.

1. Introduction

Brain tumor classification using magnetic resonance imaging (MRI) is a fundamental task in neuro-oncology, supporting diagnosis, treatment planning, and longitudinal monitoring [1]. MRI provides rich soft tissue contrast and multi-sequence information, making it the imaging modality of choice for brain tumor assessments [2]. In recent years, deep learning, particularly Convolutional Neural Networks (CNNs), has become the dominant approach for automated brain tumor classification, achieving state-of-the-art performance on several public datasets [3,4,5]. Compared to traditional machine learning methods based on handcrafted features, CNN-based models can learn hierarchical representations directly from imaging data, enabling their superior discriminative performance [6,7].
However, growing evidence suggests that strong in-dataset accuracy does not necessarily imply clinical reliability [8]. Deep learning models trained on a single dataset often suffer significant performance degradation when evaluated on unseen data acquired from different institutions, scanners, or imaging protocols [9,10]. This phenomenon, commonly referred to as domain shift, arises from variations in intensity distributions, acquisition parameters, preprocessing pipelines, and patient populations [11]. In brain MRI specifically, intensity non-standardization across scanners and acquisition protocols has been shown to substantially affect models’ predictions [12,13]. As a result, models optimized solely for predictive accuracy may rely on dataset-specific cues rather than learning robust, generalizable representations [14]. Several strategies have been proposed to mitigate domain shift in medical imaging, including data augmentation [15], Domain Adaptation [16], adversarial learning [17], normalization schemes [18], and architectural modifications [19].
While these approaches can improve models’ robustness, many require access to target domain data during training or involve complex optimization procedures that limit their practical deployment [20]. Moreover, most existing methods focus on aligning input distributions or model parameters, without explicitly constraining the stability of internal feature representations under realistic perturbations [21,22]. Recent studies suggest that instability in latent feature space is a key contributor to poor generalization under domain shift, particularly in medical imaging tasks where intensity variations are prevalent [23].
Motivated by these observations, this work focuses on improving cross-dataset generalization by enforcing stability at the feature representation level rather than at the pixel or parameter level. This manuscript introduce Feature Space Stability Regularization (FSSR), a lightweight and model-agnostic training framework that explicitly encourages consistent latent representations for original MRI scans and their intensity-perturbed counterparts. By targeting feature-level robustness under MRI safe perturbations, FSSR aims to improve generalization to unseen clinical data without requiring architectural changes, additional annotations, or access to target domain samples during training. Accordingly, this work makes the following contributions:
1.
Feature space instability under realistic intensity perturbations is identified as a key factor limiting cross-dataset generalization in brain MRI classification, distinct from pixel-level or parameter-level variations.
2.
Feature Space Stability Regularization (FSSR) is introduced as a lightweight and model-agnostic training framework that enforces consistency between normalized feature embeddings of original and MRI-safe intensity-perturbed inputs via an auxiliary 2 loss.
3.
External validation on the unseen BRISC-2025 dataset demonstrates that FSSR substantially improves generalization for deeper CNN architectures, achieving up to an 8.2% absolute accuracy improvement and 12.5% macro-F1 improvement without retraining or fine-tuning, while revealing architecture-dependent robustness behavior.
4.
Mechanistic feature space analysis shows that FSSR reduces mean feature deviation and feature variance, establishing a consistent empirical relationship between latent representation stability and cross-domain generalization performance.

2. Related Work

2.1. Deep Learning for Automated Brain Tumor Classification

The transition from handcrafted radiomic descriptors to deep learning has significantly transformed neuro-oncological image analysis [24,25]. Convolutional Neural Networks (CNNs) have demonstrated their strong capability in relation to learning hierarchical and discriminative representations directly from multi-sequence MRI data [26]. Modern architectures such as ResNet- and DenseNet-based models consistently achieve high in-domain accuracy on benchmark datasets [27,28].
Recent investigations incorporate hybrid frameworks, transfer learning, and attention-based mechanisms to further improve diagnostic reliability [3,4,5]. Reviews highlight that multi-scale feature aggregation and attention modules are increasingly adopted to better capture tumor heterogeneity [6,10].
However, accumulating evidence indicates that these models may exploit scanner-specific artifacts [29] or high-frequency acquisition cues rather than clinically meaningful tumor features. This “shortcut learning” behavior leads to inflated internal validation results while generalization deteriorates on unseen cohorts [8,30].

2.2. Domain Shift and Generalization in Medical Imaging

Domain shift remains a central obstacle in deploying medical AI systems [31]. MRI intensities lack a standardized physical meaning, making learned representations highly sensitive to scanner configuration and acquisition protocols [14]. Variations in magnetic field strengths and vendor reconstruction pipelines, and protocol differences introduce non-linear intensity distortions that destabilize model predictions [9,11].
Furthermore, preprocessing strategies significantly influence downstream performance [12,13] and models’ robustness to common image-level transformations such as color shifts and resizing, and normalization does not transfer reliably across datasets with large sample differences [32]. Standard k-fold cross-validation within a single dataset primarily evaluates internal consistency and may yield overly optimistic performance estimates that fail to reflect real-world distributional variability [33].

2.3. Regularization and Adaptation Strategies

To mitigate domain shift, research generally follows Domain Adaptation (DA) or Domain Generalization (DG) paradigms.
Domain Adaptation (DA). DA methods align feature distributions between the source and the target domains, often using adversarial learning frameworks such as DANN [34]. Broader reviews outline its theoretical and methodological foundations [35,36]. Recent medical imaging extensions include adversarial and histogram-based alignment strategies [16]. While effective, DA requires access to target domain data during training, limiting its applicability in dynamic clinical settings.
Domain Generalization (DG). DG assumes no target domain access and primarily relies on data augmentation and normalization techniques [15,20]. Common strategies include geometric transformations, intensity perturbations, and histogram normalization [18,21]. However, pixel-level transformations do not guarantee invariance in the latent feature space [37]. Models may become robust to superficial perturbations while retaining scanner-dependent biases internally.

2.4. Feature Space Stability and Motivation for FSSR

Recent research emphasizes that robust generalization requires stability within the latent representation space. Ideally, semantically equivalent inputs should map to nearby feature embeddings despite acquisition variability [38]. Disentanglement approaches have shown that separating biological and technical factors in latent space improves interpretability and robustness [23].
Consistency regularization methods encourage prediction stability under perturbations [39]. However, many approaches rely on adversarial alignment or access to target data. A lightweight, fully supervised mechanism explicitly constraining embedding stability under MRI-specific intensity perturbations remains insufficiently explored.
To address this gap, we propose Feature Space Stability Regularization (FSSR), which introduces an auxiliary 2 penalty enforcing consistency between normalized latent embeddings derived from perturbed MRI inputs. Unlike traditional Domain Adaptation techniques, FSSR does not require target domain access. By directly constraining embedding stability, the method promotes scanner-invariant representations and improves zero-shot generalization.A summary of related approaches and the gap addressed by FSSR is provided in Table 1.

3. Methodology

This section presents the proposed Feature Space Stability Regularization (FSSR) framework for robust brain tumor MRI classification under intensity-driven domain shift. We first describe the datasets and the problem’s formulation, followed by the preprocessing and intensity perturbation strategy. We then detail the network architectures, training objective, and optimization procedure, and conclude with the overall training framework and evaluation setup.

3.1. Datasets

Two publicly available brain MRI datasets with identical four-class taxonomies—glioma, meningioma, pituitary adenoma, and no tumor—were used for the model development and cross-domain evaluation.
Source domain (Kaggle). The source dataset was the Brain MRI Images for Brain Tumor Detection dataset obtained from Kaggle. It consists of T1-weighted contrast-enhanced axial brain MRI slices provided in JPEG format. The images exhibit heterogeneous intensity distributions reflecting multi-institutional acquisition and preprocessing variability. Data are organized into predefined Training and Testing directories [40].
The original training set was split into training and validation subsets using stratified random sampling (80% training; 20% validation) to preserve the class proportions. The original testing directory was retained as an in-domain held-out test set comprising 1311 images, with the following class distribution: glioma ( n = 300 ), meningioma ( n = 306 ), no tumor ( n = 405 ), and pituitary adenoma ( n = 300 ).
Target domain (BRISC-2025). The Brain MRI Image Set for Classification 2025 (BRISC-2025) dataset was used exclusively for external validation to assess zero-shot generalization. This dataset contains 1000 images with a balanced class distribution ( n = 250 per class), acquired using standardized 3T imaging protocols. Compared to the Kaggle dataset, BRISC-2025 exhibits distinct intensity normalization and preprocessing characteristics. No samples from BRISC-2025 were used during the training, fine-tuning, or hyperparameter selection [41].

3.2. Problem Formulation

Let D s = { ( x i , y i ) } i = 1 N denote the source domain dataset, where x i R H × W × C represents an MRI image and y i { 1 , , K } denotes the corresponding tumor class label, with K = 4 . The objective is to learn a classifier f θ parameterized by θ that achieves a strong predictive performance on D s while remaining robust when deployed on an unseen target domain D t . The source and target domains differ primarily due to intensity-related distribution shifts arising from the acquisition and preprocessing variability. Importantly, no target domain samples are available during training, reflecting a realistic clinical deployment scenario.

3.3. Image Preprocessing and Intensity Perturbation

Each MRI image undergoes a standardized preprocessing pipeline to ensure consistent input dimensions while minimizing unnecessary variability. Images are converted to grayscale to remove channel redundancy and resized to a fixed spatial resolution of 224 × 224 pixels using area-based interpolation, which preserves the anatomical structure while performing effective downsampling. No skull stripping, tumor segmentation, or handcrafted region-of-interest extraction is applied, allowing the network to learn discriminative features directly from the full brain image. This minimal preprocessing strategy improves reproducibility and avoids any reliance on external segmentation tools.
Let x i prep denote the preprocessed image. To emulate realistic domain shifts while preserving the anatomical structure, an intensity-perturbed view x ˜ i is generated using a stochastic, non-geometric transformation operator T ( · ) . The perturbation operator includes the following: (i) global intensity scaling with a factor sampled uniformly from [ 0.9 ,   1.1 ] ; (ii) additive zero-mean Gaussian noise with standard deviation equal to 1% of the image intensity standard deviation; and (iii) spatial smoothing via a 3 × 3 average pooling operation applied with 50% probability. Formally,
x ˜ i = T ( x i prep ) ,
where T ( · ) is applied independently to each training sample.

3.4. Network Architecture and Feature Space Regularization

Each preprocessed image is forwarded through a Convolutional Neural Network (CNN) backbone g θ ( · ) to obtain a latent feature representation. Three widely used architectures are evaluated: ResNet-18, ResNet-34, and DenseNet-121. All networks are trained from scratch with random initialization and are adapted to accept single-channel grayscale input by modifying the first convolutional layer. The resulting latent feature dimensionality is 512 for the ResNet backbones and 1024 for DenseNet-121.
The central hypothesis of this study is that instability in latent feature representations under realistic intensity variations is a primary contributor to degraded cross-dataset generalization. To explicitly enforce robustness at the representation level, the proposed Feature Space Stability Regularization (FSSR) introduces an auxiliary loss that penalizes any discrepancies between the embeddings extracted from the original and intensity-perturbed inputs. Let
z i = g θ ( x i prep ) , z ˜ i = g θ ( x ˜ i )
denote the feature embeddings of the original and perturbed images, respectively. The stability regularization loss is defined as
L stab = z i z i 2 z ˜ i z ˜ i 2 2 2 ,
which encourages consistency between normalized feature representations while preserving the discriminative structure.
For supervised classification, the latent features are passed to a classifier head h ( · ) consisting of a single fully connected layer followed by a softmax activation. The categorical cross-entropy loss L ce is used to optimize the classification performance.

3.5. Total Training Objective

The final training objective jointly optimizes the discriminative accuracy and robustness of feature representations to intensity-driven perturbations. The overall loss function is given by
L = L ce + λ L stab ,
where λ controls the relative contribution of the stability regularization term. In all experiments, λ = 0.05 is used based on preliminary hyperparameter tuning. Setting λ = 0 reduces the formulation to the standard cross-entropy objective, enabling direct ablation and the assessment of the proposed regularization.

3.6. Overall Training Framework

During each training iteration, paired original and intensity-perturbed images are forwarded through the same CNN backbone with shared parameters. The classification loss is computed using the original images, while the FSSR loss measures the discrepancy between the feature embeddings of the original and perturbed inputs. Gradients from both objectives are backpropagated jointly, encouraging the network to learn intensity-invariant representations without sacrificing the discriminative performance. Figure 1 provides a schematic overview of the proposed FSSR training framework.

3.7. Backbone Architectures and Rationale

Three Convolutional Neural Network (CNN) architectures with varying depth and connectivity are evaluated to assess the architectural generality and robustness of the proposed Feature Space Stability Regularization (FSSR) framework.
ResNet-18 and ResNet-34. Residual networks employ identity-based skip connections to stabilize optimization and alleviate the vanishing gradient problem in deep neural networks. A residual block is defined as
y = F ( x ) + x ,
where x and y denote the input and output feature maps of the block, respectively, and F ( · ) represents a sequence of convolutional, normalization, and non-linear operations.
ResNet-18, comprising approximately 11.7 million parameters, serves as a lightweight baseline architecture with limited depth. ResNet-34 increases the network depth and representational capacity to approximately 21.8 million parameters while preserving the same residual learning paradigm. Evaluating both variants enables the examination of the proposed regularization across different model capacity regimes.
DenseNet-121. DenseNet architectures adopt dense connectivity, in which each layer receives as input the concatenation of the feature maps from all the preceding layers within a dense block. The output of the -th layer is defined as
x = H [ x 0 , x 1 , , x 1 ] ,
where [ · ] denotes feature map concatenation and H ( · ) represents a composite function consisting of batch normalization, convolution, and activation.
This connectivity pattern encourages features’ reuse, improves the gradient flow, and promotes parameter efficiency. DenseNet-121 contains approximately 8.0 million parameters and produces higher-dimensional latent representations compared to the ResNet backbones. Including DenseNet-121 allows for the evaluation of FSSR under a fundamentally different feature aggregation mechanism.
A summary of the backbone architectures and their key characteristics is provided in Table 2.
A summary of the models’ characteristics is provided in Table 3.
All the models are trained end-to-end using identical optimization settings to ensure a fair comparison across architectures and training strategies. Optimization is performed using the AdamW optimizer with an initial learning rate of 1 × 10 4 and a fixed batch size of 32 for all experiments. Network parameters are initialized using the default PyTorch 2.1.2 initialization scheme, and the learning rate is kept constant throughout training to maintain consistent experimental conditions across all the evaluated models.
Training continues until convergence with an early stopping strategy applied to prevent overfitting and ensure that model selection reflects the generalization performance rather than training convergence. During training, models’ performances are continuously monitored on a held-out validation subset. Early stopping is triggered when the validation loss does not improve for five consecutive evaluation cycles. When this criterion is met, the training process is terminated and the model parameters corresponding to the best validation performance are restored for final evaluation. This strategy prevents unnecessary optimization once the model begins to overfit the training data and ensures that the selected model corresponds to the most generalizable configuration observed during training.
Using validation-based early stopping also stabilizes the optimization process and reduces the sensitivity to training noise. In practice, it avoids prolonged training after convergence while maintaining a consistent training behavior across different architectures. Since the ResNet-18, ResNet-34, and DenseNet-121 backbones differ substantially in relation to their representational capacity and optimization dynamics, early stopping provides a unified and architecture-agnostic mechanism for determining when training should terminate.
Mixed precision training (FP16) is employed to accelerate computation and reduce GPU memory usage while maintaining numerical stability. All experiments are conducted on a single NVIDIA RTX 3090 GPU with 24 GB of memory. Apart from the proposed Feature Space Stability Regularization (FSSR) objective, no architectural modifications, auxiliary networks, or additional supervision signals are introduced, ensuring that the observed performance improvements arise solely from the proposed regularization mechanism.
The regularization weight is set to λ = 0.05 , which represents a moderate balance between classification performance and feature space stability. Preliminary ablation experiments indicate that smaller values of λ provide insufficient regularization and yield only marginal improvements, whereas excessively large values may overly constrain the feature representations and hinder discriminative learning. Therefore, λ = 0.05 is adopted for all experiments to maintain a stable trade-off between robustness and predictive accuracy while avoiding architecture-specific hyperparameter tuning.

3.8. Feature Space Stability Measurement Protocol

To quantify representation robustness and provide mechanistic insight into the effect of FSSR, feature space stability is assessed by measuring the deviation between embeddings extracted from original and intensity-perturbed inputs. Specifically, the mean feature deviation across the evaluation set is computed as
μ Δ = 1 N i = 1 N z i z i 2 z ˜ i z ˜ i 2 2 ,
where z i and z ˜ i denote the latent feature representations of the original and perturbed inputs, respectively, and N is the number of samples. Lower values of μ Δ indicate the increased stability of the learned representations under intensity perturbations, suggesting an improved robustness to domain shift.

3.9. External Generalization Evaluation

The generalization performance is evaluated via zero-shot external testing on the unseen BRISC-2025 dataset, which was acquired using different scanner hardware and imaging protocols than the source training data. No retraining, fine-tuning, or Domain Adaptation is performed at any stage. The models trained on the source domain are directly applied to the external dataset without modification.
The classification performance is assessed using the overall accuracy and macro-averaged F1-score to account for potential class imbalance. In addition, the per-class precision and recall are reported to identify systematic misclassification patterns and class-specific generalization behavior.

4. Results

This section presents a comprehensive evaluation of the proposed Feature Space Stability Regularization(FSSR) framework for brain tumor MRI classification. The experimental results are reported across multiple dimensions: (i) source domain classification performance on the Kaggle Brain MRI test set ( N = 1311 images); (ii) zero-shot external generalization under domain shift using the BRISC-2025 dataset ( N = 1000 images); (iii) model calibration quality assessed using Expected Calibration Error (ECE), Maximum Calibration Error (MCE), Brier score, and Negative Log-Likelihood (NLL); (iv) statistical significance analysis via bootstrap resampling with 95% confidence intervals; and (v) feature space stability analysis providing mechanistic insight into the effectiveness of FSSR.
All the experiments are conducted using three backbone architectures—ResNet-18, ResNet-34, and DenseNet-121—trained from scratch without ImageNet pretraining. Unless otherwise stated, FSSR denotes the Feature Space Stability Regularization framework with λ = 0.05 . ECE denotes the Expected Calibration Error and NLL denotes the Negative Log-Likelihood.

4.1. Source Domain Classification Performance

Table 4 summarizes the classification and calibration performance on the Kaggle Brain MRI test set. Overall, Feature Space Stability Regularization (FSSR) yields consistent performance gains for deeper architectures while maintaining comparable performance for shallower networks.
Among all the evaluated models, ResNet-34 with FSSR achieves the highest classification accuracy of 97.71% and a macro-averaged F1-score of 97.55%, representing an absolute improvement of 1.98 percentage points over its baseline counterpart (95.73%). DenseNet-121 similarly benefits from the proposed regularization, with its accuracy improving from 96.03% to 97.64% (+1.61%) and the macro-F1 increasing from 95.75% to 97.47%. In addition to accuracy gains, both architectures exhibit substantial improvements in calibration metrics, including a reduced Expected Calibration Error (ECE), Brier score, and Negative Log-Likelihood (NLL).
In contrast, ResNet-18 demonstrates a marginal reduction in accuracy under FSSR (95.50% to 95.27%). This suggests that shallow architectures with limited representational capacity may not fully benefit from explicit feature space stability constraints. This architecture-dependent behavior is further examined through feature stability analysis in Section 4.7.
A detailed breakdown of the misclassification patterns is provided through the confusion matrices shown in Figure 2, Figure 3 and Figure 4. Across all architectures, the dominant source of error corresponds to confusion between the glioma and meningioma classes, a clinically relevant distinction due to differing treatment strategies. For ResNet-34, FSSR reduces glioma-to-meningioma misclassifications from 30 cases under baseline training to 11 cases, corresponding to a 63% reduction. Meningioma-to-glioma errors remain stable at eight cases for both methods.
Notably, all FSSR models achieve a perfect classification performance for the no tumor class (405/405 correct predictions), which is critical for minimizing false positive diagnoses in healthy patients.

4.2. Receiver Operating Characteristic Analysis

The receiver operating characteristic (ROC) analysis presented in Figure 5, Figure 6 and Figure 7 demonstrates excellent discriminative performance across all model configurations. ResNet-34 with Feature Space Stability Regularization (FSSR) achieves perfect or near-perfect per-class AUC values on the source domain, including glioma (0.998), meningioma (0.997), no_tumor (1.000), and pituitary (1.000), resulting in a macro-AUC of 0.998. DenseNet-121 FSSR exhibits comparable performance, achieving the same macro-AUC value. Comprehensive classification metrics for all the evaluated architectures on the primary Kaggle Brain MRI test set are summarized in Table 4.

4.3. External Generalization Under Domain Shift

The central hypothesis of this work is that Feature Space Stability Regularization improves cross-dataset generalization without requiring target domain data during training. To evaluate this, all the trained models were tested on the BRISC-2025 external validation dataset (N = 1000 images), which differs substantially from the source domain in terms of the scanner hardware, imaging protocols, and patient demographics. Importantly, no retraining or Domain Adaptation was performed—models trained exclusively on Kaggle data were applied directly in a zero-shot transfer setting.
As shown in Table 5, the most notable finding is the substantial improvement for ResNet-34: its accuracy increases from 85.50% (baseline) to 93.70% (FSSR), representing a gain of +8.20 percentage points (p < 0.001) and a 56% error reduction. The macro-F1 improvement is even larger, rising from 80.12% to 92.62% (+12.50 points), demonstrating that the FSSR benefits extend to class-balanced performance.
DenseNet-121 with FSSR achieves the best absolute external performance at 96.70% accuracy and 96.87% macro-F1. The domain gap decreases from 1.83% (baseline) to 0.94%—a 49% relative reduction—indicating that dense connectivity combined with feature space regularization yields inherently robust representations. ResNet-18 presents an exception: both the baseline and FSSR achieve identical 90.50% accuracy, with only marginal macro-F1 improvement (89.38% → 89.49%), suggesting that shallow networks may lack the representational capacity to exploit feature space constraints effectively.
These architecture-dependent patterns provide key insights into the interaction between network capacity and regularization. The ResNet-34 baseline exhibits the largest domain gap (10.23%) despite strong source performance (95.73%), indicating that deeper residual networks are prone to overfitting domain-specific features without explicit regularization. FSSR reduces this gap to 4.01%, a 61% relative improvement. In contrast, the DenseNet-121 baseline already generalizes effectively (1.83% gap), likely due to dense connectivity promoting feature reuse; nevertheless, FSSR still provides measurable gains.

4.4. Statistical Significance Analysis

To confirm that the observed improvements are statistically meaningful rather than due to sampling variability, we performed bootstrap hypothesis testing. For each model comparison, 1000 bootstrap samples were generated from test set predictions, and 95% confidence intervals (CI) were computed for accuracy differences between FSSR and baseline. The results were considered statistically significant when the CI excluded zero.
Table 6 summarizes the complete statistical analysis. On the source domain (Kaggle), FSSR yields significant improvements for ResNet-34 ( Δ = +1.99%, 95% CI: [0.99%, 3.13%], and p < 0.05 ) and DenseNet-121 ( Δ = +1.59%, 95% CI: [0.46%, 2.75%], and p < 0.05 ). ResNet-18 shows no significant difference ( Δ = +0.23%; 95% CI: [−0.84%, 1.37%]).
Under domain shift (BRISC-2025), the statistical evidence strengthens considerably. ResNet-34 FSSR achieves the largest effect size observed in this study (+8.25%, 95% CI: [6.10%, 10.30%], and p < 0.001 ), with the narrow confidence interval indicating high statistical confidence. DenseNet-121 FSSR also demonstrates significant improvement (+2.49%, 95% CI: [1.00%, 4.10%], and p < 0.01 ), while ResNet-18 remains non-significant.

4.5. Model Calibration Under Domain Shift

Beyond discriminative accuracy, well-calibrated probability estimates are essential for clinical decision support systems where prediction confidence directly influences diagnostic workflows [42]. A perfectly calibrated model produces predicted probabilities that match empirical accuracy: when predicting 80% confidence, the model should be correct approximately 80% of the time. In medical imaging applications, miscalibrated predictions can lead to inappropriate clinical decisions, making calibration assessment a critical component of models’ evaluation [43].
To quantify calibration quality, we evaluate four widely adopted metrics: the Expected Calibration Error (ECE), Maximum Calibration Error (MCE), Brier score, and Negative Log-Likelihood (NLL) [44,45].
The ECE measures the average discrepancy between predicted confidence and empirical accuracy across M confidence bins and is defined as
ECE = m = 1 M | B m | N acc ( B m ) conf ( B m ) ,
where B m denotes the set of samples whose predicted confidence falls within bin m, acc ( B m ) is the empirical accuracy of that bin, and conf ( B m ) is the mean predicted probability [44].
The Maximum Calibration Error (MCE) captures the worst-case deviation between confidence and accuracy:
MCE = max m acc ( B m ) conf ( B m ) .
The Brier score evaluates the mean squared difference between the predicted probabilities p i and true labels y i [46]:
Brier = 1 N i = 1 N ( p i y i ) 2 .
Finally, Negative Log-Likelihood (NLL) measures the log probability assigned to the correct class:
NLL = 1 N i = 1 N log p ( y i | x i ) .
Lower values indicate better calibrated and more reliable probability estimates. These metrics are widely used to evaluate predictive uncertainty in deep learning models, particularly in safety-critical medical applications [43,44,45].
On the source domain, all the models exhibit acceptable calibration, with the ECE < 0.04 . However, under domain shift, baseline models show substantial calibration degradation. Most notably, the ResNet-34 baseline exhibits ECE = 0.1166 on BRISC-2025, indicating that the predicted confidence systematically overestimates true accuracy by approximately 11.7 percentage points, a potentially critical issue in clinical settings where overconfident incorrect predictions may mislead clinicians.
As summarized in Table 7, the proposed FSSR substantially improves calibration under domain shift. For example, ResNet-34 with FSSR achieves ECE = 0.0400 , representing a 66% reduction compared to the baseline. Similarly, the Brier score improves from 0.0640 to 0.0257 (a 60% reduction). DenseNet-121 combined with FSSR achieves the best calibration overall with ECE = 0.0166 , compared to 0.0325 for the baseline (49% reduction).
These results indicate that enforcing feature space stability not only improves the cross-domain classification performance but also produces probability estimates that more faithfully reflect models’ uncertainty, which is critical for their reliable deployment in medical decision support systems [42].

4.6. Comparative Baselines

To contextualize the effectiveness of the proposed Feature Space Stability Regularization (FSSR), we compare it against representative approaches designed to improve robustness and generalization under domain shift. Specifically, we consider two widely used strategies that operate at different stages of the learning process: feature distribution mixing and contrastive representation learning.
MixStyle. MixStyle is a Domain Generalization technique that improves robustness by mixing instance-level feature statistics during training [47]. Given a feature map f ( x ) extracted from an intermediate network layer, MixStyle perturbs the feature distribution by randomly mixing the channel-wise mean and standard deviation between two samples within a mini-batch.
Let μ ( f ) and σ ( f ) denote the channel-wise mean and standard deviation of the feature map. MixStyle generates a new feature representation as
f ˜ ( x i ) = σ m f ( x i ) μ ( f ( x i ) ) σ ( f ( x i ) ) + μ m ,
where
μ m = λ μ ( f ( x i ) ) + ( 1 λ ) μ ( f ( x j ) ) ,
σ m = λ σ ( f ( x i ) ) + ( 1 λ ) σ ( f ( x j ) ) ,
and λ Beta ( α , α ) controls the mixing strength. This operation encourages the model to learn representations that are less sensitive to domain-specific feature statistics.
SimCLR-based Representation Learning. SimCLR is a contrastive self-supervised learning framework that learns invariant feature representations by maximizing the agreement between augmented views of the same image [48]. Given two augmented views x i and x j of the same image, their corresponding feature embeddings z i and z j are encouraged to be similar in the representation space while remaining distinct from other samples in the batch.
The contrastive loss used in SimCLR is defined as
L SimCLR = log exp ( sim ( z i , z j ) / τ ) k = 1 2 N 1 [ k i ] exp ( sim ( z i , z k ) / τ )
where sim ( · ) denotes cosine similarity and τ is a temperature scaling parameter.
FSSR (Proposed Method). The proposed Feature Space Stability Regularization explicitly enforces the consistency between feature representations extracted from the original and intensity-perturbed MRI images. Given an input image x and its perturbed version x ˜ , the stability constraint is defined as
L stab = f ( x ) f ( x ) 2 f ( x ˜ ) f ( x ˜ ) 2 2 2
The overall training objective becomes
L total = L CE + λ L stab
Unlike MixStyle or SimCLR, which modify the feature distributions or require contrastive pretraining, FSSR directly constrains representation stability during supervised training without introducing additional networks or training stages.
Experimental Setup for Fair Comparison. To ensure a rigorous comparison, all the evaluated methods were trained using identical experimental settings. Specifically, all the models used the same random seed (seed = 42), identical training–validation splits, preprocessing pipelines, optimizer configurations (AdamW with learning rate 1 × 10 4 and batch size 32), and early stopping criteria. This controlled setup ensures that any differences in performance arise from the regularization strategies rather than confounding factors such as initialization or data ordering.
The resulting performance across architectures is summarized in Table 8. The results report both the source domain (Kaggle) performance and zero-shot external domain (BRISC-2025) performance, allowing for the direct comparison of cross-domain robustness.
Analysis of Results. Architecture-dependent effectiveness. FSSR demonstrates the strongest improvements on DenseNet-121, achieving 97.64% source domain accuracy and 96.70% external domain accuracy, corresponding to a domain gap of only 0.94%. This substantially outperforms SimCLR (2.65% gap) and MixStyle (3.27% gap).
Performance on ResNet architectures. For ResNet-18, FSSR maintains a comparable source domain accuracy (95.50% baseline vs 95.27% FSSR) while achieving an identical external accuracy of 90.50% on BRISC-2025. The domain gap shows only a marginal reduction from 5.00% to 4.77%, consistent with the observation that shallow architectures derive limited benefit from feature space stability constraints.
For ResNet-34, the improvements are substantially more pronounced. FSSR increases the source accuracy from 95.73% to 97.71% and external accuracy from 85.50% to 93.70%, reducing the domain gap from 10.23% to 4.01%. These results indicate that deeper architectures benefit substantially from Feature Space Stability Regularization.
SimCLR limitations. Although SimCLR slightly improves the source domain accuracy for some architectures, it does not consistently reduce the domain shift. For example, on ResNet-18, the domain gap increases from 5.00% to 8.88%, suggesting that contrastive representation learning alone may not adequately address the intensity-driven distribution shifts common in medical imaging.
MixStyle trade-offs. MixStyle generally reduces the domain gap compared to baseline models, but often at the expense of reduced source domain performance. For instance, on ResNet-18 the source accuracy drops from 95.50% to 91.30%, indicating that distribution mixing may compromise discriminative performance while improving robustness.
Overall comparison. Across all architectures, FSSR consistently achieves the best balance between source domain accuracy and cross-domain generalization. The method reduces domain gaps while maintaining a strong classification performance, and does so without requiring architectural modifications or additional training stages. This simplicity makes FSSR a practical and effective approach for improving robustness in medical imaging applications subject to domain shift.

4.7. Feature Space Stability Analysis

To validate the mechanistic hypothesis underlying FSSR—that regularizing feature space stability leads to more robust representations—we analyzed embedding variability under input perturbations. For each validation sample x, we computed the 2 distance between the feature vectors of the original image and its intensity-perturbed counterpart:
f ( x ) f ( x ˜ ) 2
where x ˜ represents an intensity-scaled version of x (scale factor sampled from [0.9, 1.1]) and f ( · ) denotes the learned 512-dimensional (ResNet) or 1024-dimensional (DenseNet) feature extractor.
As shown in Table 9, FSSR consistently reduces both the mean and variance of feature deviations across all architectures. ResNet-18 shows the mean deviation decreasing from 4.513 to 3.602 (−20.2%) with the standard deviation reduced from 5.182 to 2.467 (−52.4%). ResNet-34 exhibits similar improvement: the mean reduces from 4.506 to 3.499 (−22.3%) and the standard deviation from 5.136 to 2.313 (−55.0%). DenseNet-121 achieves the largest relative reduction: the mean reduces from 4.565 to 3.474 (−23.9%) and the standard deviation from 5.157 to 2.168 (−58.0%).
The histogram analysis in Figure 8, Figure 9 and Figure 10 reveals qualitative differences in feature stability distributions. Baseline models exhibit long-tailed distributions with deviations extending beyond 40–50 units, indicating dramatic representation instability for certain inputs, likely the most vulnerable samples under domain shift. FSSR effectively eliminates these high-deviation outliers, reducing the 99th percentile deviation by approximately 60% across all the architectures.
This stabilization provides a plausible explanation for the improved generalization observed in deeper architectures. Representations that are invariant to intensity perturbations during training, which simulate inter-scanner variability, are inherently more robust when encountering novel acquisition protocols at test time. For ResNet-34 and DenseNet-121, the correlation between stability improvements and generalization gains is consistent and substantial. However, ResNet-18 presents a notable exception: despite comparable reductions in feature deviation (−20.2%) and variance (−52.4%), no meaningful generalization improvement is observed. This suggests that feature space stability is a necessary but not sufficient condition for cross-domain robustness, and that sufficient representational capacity is required for stability gains to translate into improved classification performance. Overall, these results support a strong empirical association between latent representation stability and cross-domain generalization in architectures with adequate depth, rather than a universal causal relationship.

4.8. Ablation Study: Regularization Weight λ

To investigate the sensitivity of FSSR to the regularization weight λ , we conducted a systematic ablation study across all three backbone architectures. The models were trained with λ { 0 , 0.001 , 0.005 , 0.01 , 0.02 , 0.05 , 0.1 , 0.2 } , where λ = 0 corresponds to standard cross-entropy training without feature space regularization. All the other training parameters remained identical to ensure fair comparison.
Table 10, Table 11 and Table 12 present the complete results of the λ sensitivity analysis. Several key observations emerge from this experiment:
Effect of λ on cross-domain generalization. For deeper architectures, introducing Feature Space Stability Regularization ( λ > 0 ) substantially improves the generalization to the external BRISC-2025 dataset compared to baseline training ( λ = 0 ). For ResNet-34, BRISC accuracy increases from 85.50% ( λ = 0 ) to 93.70% ( λ = 0.05 ), representing an 8.20 percentage point improvement and a 56% reduction in the domain gap. DenseNet-121 shows similar trends, with the domain gap decreasing from 1.83% to 0.94% at the optimal λ value. However, for ResNet-18, FSSR does not yield consistent generalization gains: most λ values result in marginally lower BRISC accuracy than the baseline ( λ = 0 ), with only λ = 0.05 matching the baseline performance of 90.50%. This architecture-dependent behavior suggests that shallow networks with limited representational capacity may be unable to simultaneously satisfy the classification and stability objectives, and that FSSR benefits are contingent on sufficient model depth.
Optimal λ range. The results indicate that moderate values of λ (0.02–0.05) consistently yield the best trade-off between the source domain accuracy and cross-domain generalization. Very small values ( λ 0.001 ) provide insufficient regularization, while excessively large values ( λ 0.2 ) may over-constrain the feature space and reduce the discriminative capacity.
Architecture-dependent sensitivity. Deeper architectures exhibit greater sensitivity to λ and derive larger benefits from feature space regularization. ResNet-18 shows modest improvements across all λ values (domain gap reduction from 5.00% to 4.77%), whereas ResNet-34 demonstrates dramatic improvement (10.23% to 4.01%). This observation is consistent with the hypothesis that deeper networks possess a greater representational capacity to simultaneously optimize classification accuracy and feature stability constraints.
Based on these results, λ = 0.05 was selected for all the main experiments as it provides a stable balance between classification performance and cross-domain robustness across architectures, avoiding the need for architecture-specific hyperparameter tuning.
To investigate the contribution of each perturbation component to the overall effectiveness of FSSR, we conducted a systematic ablation study examining all the possible combinations of the three perturbation types: intensity scaling (S), Gaussian noise injection (N), and spatial smoothing (B for blur). The models were trained with λ = 0.05 using each perturbation configuration, and evaluated on both the source domain (Kaggle) and external domain (BRISC-2025). The “None” configuration corresponds to standard cross-entropy training without any perturbations (equivalent to the baseline reported in Table 4 and Table 5), while “Full” denotes the complete FSSR pipeline combining all three components.
Table 13, Table 14 and Table 15 present the complete results for each architecture.
Key Findings. The ablation study reveals several important insights regarding the contribution of each perturbation component:
(1) Full FSSR consistently achieves the best cross-domain generalization. Across all three architectures, the complete perturbation pipeline (Scale + Noise + Smooth) yields the highest external accuracy and the lowest domain gap. For ResNet-34, Full FSSR achieves 93.70% BRISC accuracy with only a 4.01% domain gap, substantially outperforming the best single-component configuration (Noise: 90.80%; 5.38% gap) and the best two-component configuration (Noise+Smooth: 92.10%; 4.47% gap). For DenseNet-121, Full FSSR attains 96.70% external accuracy with the lowest domain gap of 0.94%, improving over the best partial combination (Noise + Smooth: 96.30%; 1.03% gap). For ResNet-18, the baseline already achieves 90.50% external accuracy, and Full FSSR is the only configuration that matches this performance while reducing the domain gap from 5.00% to 4.77%, whereas all partial configurations yield a lower BRISC accuracy than the baseline.
(2) Individual components provide partial but incomplete robustness. Among the single-component configurations, Gaussian noise injection provides the largest improvement in cross-domain generalization, particularly for ResNet-34 where it reduces the domain gap from 10.23% to 5.38%. However, noise alone does not match the comprehensive robustness achieved by the full pipeline. Intensity scaling improves the performance modestly across architectures, while spatial smoothing alone can paradoxically increase the domain gap for DenseNet-121 (2.12% vs 1.83% baseline), likely due to loss of fine-grained discriminative features.
(3) Synergistic effects emerge from component combinations. Two-component configurations consistently outperform individual components, and the full pipeline achieves a performance that exceeds all partial combinations. For ResNet-34, Scale+Noise achieves 91.40% BRISC accuracy and Noise+Smooth reaches 92.10%, yet Full FSSR attains 93.70%, demonstrating that the combination of all three perturbations provides complementary regularization effects that address different aspects of intensity-related domain shift. This progressive improvement from single to double to full configurations is consistent across all the architectures, supporting the interpretation that each perturbation component addresses a distinct source of intensity variability.
(4) Full FSSR provides architecture-agnostic robustness. Unlike partial configurations where optimal single or double combinations vary by architecture (e.g., Noise + Smooth for ResNet-34; Scale+Smooth for DenseNet-121), the complete FSSR pipeline consistently achieves the best performance across all evaluated backbones. This architecture-agnostic behavior eliminates the need for component selection and supports deployment in diverse clinical settings.
Table 16 summarizes the improvement achieved by Full FSSR over the baseline and the best single-component configuration.
These results demonstrate that the full FSSR pipeline, combining intensity scaling, Gaussian noise, and spatial smoothing, provides superior cross-domain generalization compared to any single component or partial combination across all architectures. The comprehensive perturbation strategy ensures robustness against diverse sources of intensity-related domain shift, including global intensity variations, local intensity fluctuations, and resolution differences. The consistent superiority of Full FSSR supports its adoption as the default configuration for clinical deployment where cross-scanner robustness is essential. For shallower architectures such as ResNet-18, the baseline already generalizes effectively, and FSSR primarily contributes through an improved domain gap rather than absolute accuracy gains.

4.9. Summary of Key Findings

The comprehensive experimental evaluation supports the following principal conclusions regarding the effectiveness of Feature Space Stability Regularization (FSSR) for brain tumor MRI classification:
(1) Source domain performance: FSSR improves classification performance for sufficiently deep architectures. ResNet-34 achieves the best overall results (97.71% accuracy; 97.55% macro-F1), with statistically significant gains for ResNet-34 (+1.99%) and DenseNet-121 (+1.59%; p < 0.05 ). The performance gains are primarily driven by the reduced glioma–meningioma confusion.
(2) Cross-Domain Generalization: FSSR substantially enhances the zero-shot transfer to the external BRISC-2025 dataset. ResNet-34 shows an 8.20% accuracy improvement (85.50% to 93.70%; p < 0.001 ), corresponding to a 56% reduction in error rate. DenseNet-121 achieves the highest external accuracy (96.70%) with a minimal domain gap of 0.94%, without any target domain exposure during training.
(3) Calibration reliability: FSSR produces markedly better-calibrated predictions under domain shift. ResNet-34 reduces the Expected Calibration Error (ECE) from 0.1166 to 0.0400 (66% reduction), while DenseNet-121 achieves the best calibration overall (ECE = 0.0166).
(4) Architecture dependence: The benefits of FSSR scale with the model’s depth. Shallow networks (ResNet-18) show no significant improvement, whereas deeper architectures (ResNet-34 and DenseNet-121) consistently benefit, indicating that sufficient representational capacity is required to exploit feature space constraints.
(5) Mechanistic and clinical relevance: FSSR reduces feature embedding variance by 52–58% under intensity perturbations across all architectures. For ResNet-34 and DenseNet-121, these stability gains are accompanied by substantial generalization improvements, supporting a strong empirical association between the representation stability and cross-domain performance. However, the absence of generalization gains for ResNet-18 despite similar stability improvements indicates that this relationship is contingent on sufficient model capacity. The perfect no_tumor classification on the Kaggle test set (405/405) highlights the method’s potential for safe clinical deployment.

5. Discussion

The results demonstrate that Feature Space Stability Regularization (FSSR) is an effective and simple strategy for improving cross-domain generalization in brain tumor MRI classification without requiring access to target domain data. The most notable improvement is observed for ResNet-34 on the external BRISC-2025 dataset, where the accuracy increases from 85.50% to 93.70%, corresponding to a 56% reduction in error rate. Notably, the magnitude of this gain is comparable to improvements typically reported by more complex Domain Adaptation methods that explicitly rely on target domain samples or adversarial alignment strategies [36]. These results indicate that explicitly constraining representation stability can provide a practical alternative to more computationally expensive Domain Adaptation techniques, particularly in clinical environments where access to target domain data may be limited.
Mechanistic analysis provides empirical support for the hypothesis that representation stability under realistic perturbations is causally linked to domain robustness. The reductions in the feature deviation (20–24%) and embedding variance (52–58%) strongly correlate with improved generalization, consistent with recent findings in medical imaging that identify feature space instability and sensitivity to acquisition-related variations as major contributors to domain shift. The benefits of FSSR are architecture dependent: deeper networks (ResNet-34 and DenseNet-121) show substantial improvements, whereas ResNet-18 does not, likely due to the limited representational capacity or implicit regularization effects in shallow models. This observation suggests that models with greater representational capacity may be better able to simultaneously optimize discriminative objectives and stability constraints.
From a clinical perspective, the substantial reduction in the Expected Calibration Error, particularly for ResNet-34 (0.1166 to 0.0400), is highly relevant, as reliable uncertainty estimates are critical for safe clinical decision support and the risk-aware deployment of artificial intelligence systems in healthcare [42]. In addition, the marked reduction in glioma–meningioma confusion and the perfect classification of the no_tumor class (405/405) highlight the potential of FSSR to reduce clinically consequential diagnostic errors. Together, these results suggest that improving feature space stability may contribute not only to improved predictive performance but also to more reliable and trustworthy clinical AI systems.

6. Limitations

This study demonstrates the effectiveness of Feature Space Stability Regularization (FSSR) across multiple Convolutional Neural Network architectures and under challenging cross-dataset evaluation scenarios. Nevertheless, several limitations and directions for future research should be considered:
  • Dataset scope:
    The current evaluation was conducted on two publicly available datasets sharing four tumor classes. While this setup enables rigorous cross-dataset validation, further evaluation on additional datasets, tumor types, imaging modalities (e.g., MRI and CT), and anatomical regions would provide stronger evidence of the method’s generalizability. In particular, clinical neuroimaging datasets often include greater variability in patient populations, acquisition protocols, and scanner manufacturers, which may introduce additional sources of distribution shift not captured in the current evaluation.
  • Volumetric MRI analysis:
    The present study operates on individual 2D MRI slices rather than full volumetric scans. Although slice-based approaches are computationally efficient and widely used in the literature, clinical neuroimaging workflows typically rely on three-dimensional MRI volumes that provide richer spatial context across adjacent slices. Important structural characteristics such as tumor shape, spatial continuity, and volumetric boundaries may therefore not be fully captured by slice-level models. Future work should investigate the applicability of FSSR to volumetric architectures, including 3D Convolutional Neural Networks or transformer-based models designed for volumetric medical imaging.
  • MRI sequence variability:
    The datasets used in this study primarily consist of T1-weighted contrast-enhanced MRI images. However, clinical tumor assessment frequently relies on multiple complementary MRI sequences, including T2-weighted and FLAIR scans, which emphasize different tissue properties and pathological features. These sequences exhibit different intensity distributions and contrast patterns, which may influence the effectiveness of the intensity-based perturbations used in FSSR. Evaluating the proposed framework across multi-sequence MRI datasets would provide a more comprehensive understanding of its robustness to sequence-dependent variations.
  • Architecture coverage:
    This work focuses on Convolutional Neural Networks, which remain widely adopted in medical imaging. However, emerging architectures such as vision transformers and hybrid CNN–transformer models have shown promising performance. Investigating the interaction between FSSR and these architectures represents a natural extension of this work, particularly given the increasing use of transformer-based models in medical image analysis.
  • Regularization tuning:
    A fixed regularization weight ( λ = 0.05 ) was used across all architectures to ensure fair comparison. While this moderate value provided a stable performance across the experiments, different architectures or datasets may benefit from different regularization strengths. Future research may explore adaptive or architecture-specific tuning strategies to further optimize the balance between classification accuracy and feature stability.
  • Task extension:
    The present study focuses on tumor classification. Extending the proposed framework to related tasks such as tumor segmentation, detection, and longitudinal disease monitoring would broaden its potential clinical applicability. These tasks often involve different supervision signals and spatial constraints, which may interact differently with feature space stability objectives.
  • Pretraining effects:
    All the models in this study were trained from scratch to isolate the effect of the proposed regularization. However, many practical medical imaging pipelines rely on transfer learning from large-scale pretrained models. Pretrained networks may already encode certain invariances or feature priors that could influence how feature space stability constraints behave during training. Investigating the interaction between FSSR and pretrained representations therefore represents an important direction for future work.

7. Conclusions

In conclusion, this study introduces Feature Space Stability Regularization (FSSR), a principled and computationally efficient framework for improving robustness, calibration, and cross-domain generalization in brain tumor MRI classification. By explicitly constraining the stability of latent feature representations under realistic intensity perturbations, FSSR encourages the learning of scanner-invariant representations that remain reliable under distribution shifts. The experimental results demonstrate that this simple regularization strategy yields substantial improvements in external validation performance without requiring access to target domain data, additional supervision, or architectural modifications.
The lightweight and architecture-agnostic nature of the proposed approach allows it to be readily integrated into existing deep learning pipelines, making it particularly suitable for real-world clinical deployment where heterogeneous acquisition protocols, scanner variability, and limited labeled data remain persistent challenges. Overall, these findings highlight the importance of feature space stability as a key factor for improving the reliability and generalization of medical imaging models in practical clinical environments.

Author Contributions

Conceptualization, S.C.K.; methodology, S.C.K.; software, H.D.S.J.; formal analysis, H.D.S.J.; investigation, H.D.S.J.; writing—original draft, S.C.K.; writing—review and editing, S.S.; supervision, S.S.; and funding acquisition, S.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are openly available in Kaggle at https://www.kaggle.com/datasets/masoudnickparvar/brain-tumor-mri-dataset (accessed on 7 January 2026) and https://www.kaggle.com/datasets/briscdataset/brisc2025 (accessed on 7 January 2026), reference number [37,38].

Acknowledgments

This research work was supported by Woosong University Research Fund 2025.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Pitchaikannu, V.; Vellaiyan, S.; Kedia, S.; Tandon, V.; Kumar, R.; Agarwal, D.; Phalak, M.; Verma, S.K.; Sawarkar, D.P.; Garg, K.; et al. Bridging Imaging and Therapy: A Review of Advances in Neuroradiology and Neuro-Oncology. Clin. Transl. Neurosci. 2025, 9, 51. [Google Scholar] [CrossRef]
  2. Missaoui, R.; Hechkel, W.; Saadaoui, W.; Helali, A.; Leo, M. Advanced Deep Learning and Machine Learning Techniques for MRI Brain Tumor Analysis: A Review. Sensors 2025, 25, 2746. [Google Scholar] [CrossRef] [PubMed]
  3. Banerjee, T.; Chhabra, P.; Kumar, M.; Kumar, A.; Abhishek, K.; Shah, M.A. Pyramidal Attention-Based T Network for Brain Tumor Classification: A Comprehensive Analysis of Transfer Learning Approaches for Clinically Reliable AI Hybrid Systems. Sci. Rep. 2025, 15, 28669. [Google Scholar] [CrossRef]
  4. Anish, J.J.; Ajitha, D. Exploring the State-of-the-Art Algorithms for Brain Tumor Classification Using MRI Data. IEEE Access 2025, 13, 118033–118054. [Google Scholar] [CrossRef]
  5. Celik, M.; Inik, O. Development of Hybrid Models Based on Deep Learning and Optimized Machine Learning Algorithms for Brain Tumor Multi-Classification. Expert Syst. Appl. 2024, 238, 122159. [Google Scholar] [CrossRef]
  6. Salehi, A.W.; Khan, S.; Gupta, G.; Alabduallah, B.I.; Almjally, A.; Alsolai, H.; Siddiqui, T.; Mellit, A. A Study of CNN and Transfer Learning in Medical Imaging: Advantages, Challenges, and Future Scope. Sustainability 2023, 15, 5930. [Google Scholar] [CrossRef]
  7. Litjens, G.; Kooi, T.; Bejnordi, B.E.; Setio, A.A.A.; Ciompi, F.; Ghafoorian, M.; van der Laak, J.A.W.M.; van Ginneken, B.; Sánchez, C.I. A Survey on Deep Learning in Medical Image Analysis. Med. Image Anal. 2017, 42, 60–88. [Google Scholar] [CrossRef]
  8. Mårtensson, G.; Ferreira, D.; Granberg, T.; Cavallin, L.; Oppedal, K.; Padovani, A.; Rektorova, I.; Bonanni, L.; Pardini, M.; Kramberger, M.G.; et al. The Reliability of a Deep Learning Model in Clinical Out-of-Distribution MRI Data: A Multicohort Study. Med. Image Anal. 2020, 66, 101714. [Google Scholar] [CrossRef]
  9. Suleman, M.U.; Mursaleen, M.; Khalil, U.; Saboor, A.; Bilal, M.; Khan, S.A.; Subhani, M.A.; Hussnain, M.A.; Tabassum, S.N.; Tahir, M. Assessing the Generalizability of Artificial Intelligence in Radiology: A Systematic Review of Performance Across Different Clinical Settings. Ann. Med. Surg. 2025, 87, 8803–8811. [Google Scholar] [CrossRef]
  10. Lilhore, U.K.; Sunder, R.; Simaiya, S.; Alsafyani, M.; Monish Khan, M.D.; Alroobaea, R.; Alsufyani, H.; Baqasah, A.M. AG-MS3D-CNN: Multiscale Attention-Guided 3D Convolutional Neural Network for Robust Brain Tumor Segmentation across MRI Protocols. Sci. Rep. 2025, 15, 24306. [Google Scholar] [CrossRef]
  11. Kushol, R.; Wilman, A.H.; Kalra, S.; Yang, Y.H. DSMRI: Domain Shift Analyzer for Multi-Center MRI Datasets. Diagnostics 2023, 13, 2947. [Google Scholar] [CrossRef]
  12. Trojani, V.; Bassi, M.C.; Verzellesi, L.; Bertolini, M. Impact of Preprocessing Parameters in Medical Imaging-Based Radiomic Studies: A Systematic Review. Cancers 2024, 16, 2668. [Google Scholar] [CrossRef]
  13. Ottoni, M.; Kasperczuk, A.; Tavora, L.M.N. Machine Learning in MRI Brain Imaging: A Review of Methods, Challenges, and Future Directions. Diagnostics 2025, 15, 2692. [Google Scholar] [CrossRef]
  14. Kilim, O.; Olar, A.; Joó, T.; Palicz, T.; Pollner, P.; Csabai, I. Physical Imaging Parameter Variation Drives Domain Shift. Sci. Rep. 2022, 12, 21302. [Google Scholar] [CrossRef]
  15. Athlaye, C.; Arnaout, R. Domain-Guided Data Augmentation for Deep Learning on Medical Imaging. PLoS ONE 2023, 18, e0282532. [Google Scholar] [CrossRef]
  16. Kumari, S.; Singh, P. Deep Learning for Unsupervised Domain Adaptation in Medical Imaging: Recent Advancements and Future Perspectives. Comput. Biol. Med. 2024, 170, 107912. [Google Scholar] [CrossRef] [PubMed]
  17. Qian, X.; Shao, H.-C.; Li, Y.; Lu, W.; Zhang, Y. Histogram Matching-Enhanced Adversarial Learning for Unsupervised Domain Adaptation in Medical Image Segmentation. Med. Phys. 2025, 52, 4299–4317. [Google Scholar] [CrossRef] [PubMed]
  18. Singthongchai, J.; Wangkhamhan, T. Adaptive Normalization Enhances the Generalization of Deep Learning Models in Chest X-Ray Classification. J. Imaging 2025, 12, 14. [Google Scholar] [CrossRef] [PubMed]
  19. He, L.; Luan, L.; Hu, D. Deep Learning-Based Image Classification for AI-Assisted Integration of Pathology and Radiology. Front. Med. 2025, 12, 1574514. [Google Scholar] [CrossRef]
  20. Yoon, J.S.; Oh, K.; Shin, Y.; Mazurowski, M.A.; Suk, H.-I. Domain Generalization for Medical Image Analysis: A Review. Proc. IEEE 2024, 112, 1583–1609. [Google Scholar] [CrossRef]
  21. Wallimann, P.; van Timmeren, J.E.; Gabrys, H.S.; Khodabakhshi, Z.; Lapaeva, M.; Dal Bello, R.; Guckenberger, M.; Andratschke, N.; Tanadini-Lang, S. N-Peaks: MRI Intensity Normalization Based on Normal Tissue Histogram Peak Intensities. Phys. Med. Biol. 2025, 70, 24NT01. [Google Scholar] [CrossRef]
  22. Mabadeje, A.O.; Morales, M.M.; Torres-Verdín, C.; Pyrcz, M.J. Evaluating the Stability of Deep Learning Latent Feature Space for Subsurface Modeling. Math. Geosci. 2026, 58, 19–52. [Google Scholar] [CrossRef]
  23. Pan, J.; Seebök, P.; Fürbök, C.; Pochepnia, S.; Straub, J.; Beer, L.; Prosch, H.; Langs, G. Disentanglement of Biological and Technical Factors via Latent Space Rotation in Clinical Imaging Improves Disease Pattern Discovery. In Applications of Medical Artificial Intelligence; LNCS; Springer: Cham, Switzerland, 2026; Volume 16206, pp. 309–319. [Google Scholar] [CrossRef]
  24. Berghout, T. The Neural Frontier of Future Medical Imaging: A Review of Deep Learning for Brain Tumor Detection. J. Imaging 2024, 11, 2. [Google Scholar] [CrossRef]
  25. Khalighi, S.; Reddy, K.; Midya, A.; Pandav, K.B.; Madabhushi, A.; Abedalthagafi, M. Artificial Intelligence in Neuro-Oncology: Advances and Challenges in Brain Tumor Diagnosis, Prognosis, and Precision Treatment. NPJ Precis. Oncol. 2024, 8, 80. [Google Scholar] [CrossRef]
  26. Gunasekaran, S.; Mercy Bai, P.S.; Mathivanan, S.K.; Rajadurai, H.; Shivahare, B.D.; Shah, M.A. Automated Brain Tumor Diagnostics: Empowering Neuro-Oncology with Deep Learning-Based MRI Image Analysis. PLoS ONE 2024, 19, e0306493. [Google Scholar] [CrossRef] [PubMed]
  27. Neamah, K.; Alsaadi, S.M.; Al-Dabbagh, R.D.; Al-Shargabi, A.A.; Al-Janabi, S. Brain Tumor Classification and Detection Based Deep Learning Models: A Systematic Review. IEEE Access 2024, 12, 2517–2542. [Google Scholar] [CrossRef]
  28. Singh, S.; Sharma, A. State of the Art Convolutional Neural Networks. Int. J. Perform. Eng. 2023, 19, 342–349. [Google Scholar] [CrossRef]
  29. DeGrave, A.J.; Janizek, J.D.; Lee, S.-I. AI for Radiographic COVID-19 Detection Selects Shortcuts over Signal. Nat. Mach. Intell. 2021, 3, 610–619. [Google Scholar] [CrossRef]
  30. Haynes, S.C.; Johnston, P.; Elyan, E. Generalisation Challenges in Deep Learning Models for Medical Imagery: Insights from External Validation of COVID-19 Classifiers. Multimed. Tools Appl. 2024, 83, 76753–76772. [Google Scholar] [CrossRef]
  31. Zhang, A.; Xing, L.; Zou, J.; Wu, J.C. Shifting Machine Learning for Healthcare from Development to Deployment and from Models to Data. Nat. Biomed. Eng. 2022, 6, 1330–1345. [Google Scholar] [CrossRef]
  32. Pei, X.; Zhao, Y.; Chen, L.; Guo, Q.; Duan, Z.; Pan, Y.; Hou, H. Robustness of Machine Learning to Color, Size Change, Normalization, and Image Enhancement on Micrograph Datasets with Large Sample Differences. Mater. Des. 2023, 232, 112086. [Google Scholar] [CrossRef]
  33. Wilimitis, D.; Walsh, C.G. Practical Considerations and Applied Examples of Cross-Validation for Model Development and Evaluation in Health Care: Tutorial. JMIR AI 2023, 2, e49023. [Google Scholar] [CrossRef] [PubMed]
  34. Ganin, Y.; Lempitsky, V. Unsupervised Domain Adaptation by Backpropagation. In Proceedings of the 32nd International Conference on Machine Learning (ICML), Lille, France, 7–9 July 2015; PMLR: New York, NY, USA, 2015; pp. 1180–1189. [Google Scholar]
  35. Fang, Y.; Yap, P.-T.; Lin, W.; Zhu, H.; Liu, M. Source-Free Unsupervised Domain Adaptation: A Survey. Neural Netw. 2024, 174, 106230. [Google Scholar] [CrossRef] [PubMed]
  36. Kouw, W.M.; Loog, M. A Review of Domain Adaptation without Target Labels. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 43, 766–785. [Google Scholar] [CrossRef] [PubMed]
  37. Abbasi, S.; Lan, H.; Choupan, J.; Sheikh-Bahaei, N.; Pandey, G.; Varghese, B. Deep Learning for the Harmonization of Structural MRI Scans: A Survey. BioMed. Eng. OnLine 2024, 23, 90. [Google Scholar] [CrossRef]
  38. Dinsdale, N.K.; Jenkinson, M.; Namburete, A.I.L. Deep Learning-Based Unlearning of Dataset Bias for MRI Harmonisation and Confound Removal. NeuroImage 2021, 228, 117689. [Google Scholar] [CrossRef]
  39. Ouyang, Y.; Li, P.; Zhang, H.; Hu, X. Semi-Supervised Medical Image Segmentation Based on Frequency Domain Aware Stable Consistency Regularization. J. Digit. Imaging 2025, 38, 3221–3234. [Google Scholar] [CrossRef]
  40. Nickparvar, M. Brain Tumor MRI Dataset. Kaggle Datasets. 2020. Available online: https://www.kaggle.com/datasets/masoudnickparvar/brain-tumor-mri-dataset (accessed on 10 January 2025).
  41. BRISC Dataset and Collaborators. BRISC 2025: Annotated Dataset for Brain Tumor Image Segmentation and Classification. Sci. Data 2026, 13, 361. [Google Scholar] [CrossRef]
  42. Lambert, B.; Forbes, F.; Doyle, S.; Dehaene, H.; Dojat, M. Trustworthy Clinical AI Solutions: A Unified Review of Uncertainty Quantification in Deep Learning Models for Medical Image Analysis. Artif. Intell. Med. 2024, 150, 102830. [Google Scholar] [CrossRef]
  43. Klontzas, M.E.; Groot Lipman, K.B.W.; Akinci D’Antonoli, T.; Andreychenko, A.; Cuocolo, R.; Dietzel, M.; Gitto, S.; Huisman, H.; Santinha, J.; Vernuccio, F.; et al. ESR Essentials: Common Performance Metrics in AI—Practice Recommendations by the European Society of Medical Imaging Informatics. Eur. Radiol. 2026, 36, 1528–1540. [Google Scholar] [CrossRef]
  44. Balanya, S.A.; Maroñas, J.; Ramos, D. Adaptive Temperature Scaling for Robust Calibration of Deep Neural Networks. Neural Comput. Appl. 2024, 36, 8073–8095. [Google Scholar] [CrossRef]
  45. Faghani, S.; Moassefi, M.; Vahdati, S.; Mahmoudi Dehaki, S.; Arefan, D.; Ito, K.; Erickson, B.J. Quantifying Uncertainty in Deep Learning of Radiologic Images. Radiology 2023, 308, e222217. [Google Scholar] [CrossRef] [PubMed]
  46. Geysels, A.; Van Calster, B.; De Moor, B.; Froyman, W.; Timmerman, D. Calibration in Multiple Instance Learning: Evaluating Aggregation Methods for Ultrasound-Based Diagnosis. In Medical Image Computing and Computer Assisted Intervention—MICCAI 2025; LNCS; Springer: Cham, Switzerland, 2026; Volume 15974, pp. 55–65. [Google Scholar] [CrossRef]
  47. Yamashita, K.; Hotta, K. MixStyle-Based Contrastive Test-Time Adaptation: Pathway to Domain Generalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Seattle, WA, USA, 16–22 June 2024; pp. 1029–1037. [Google Scholar]
  48. Haghighi, F.; Taher, M.R.H.; Zhou, Z.; Gotway, M.B.; Liang, J. Self-Supervised Learning for Medical Image Analysis: Discriminative, Restorative, or Adversarial? Med. Image Anal. 2024, 94, 103086. [Google Scholar] [CrossRef] [PubMed]
Figure 1. Schematic overview of the proposed FSSR training framework.
Figure 1. Schematic overview of the proposed FSSR training framework.
Mathematics 14 01082 g001
Figure 2. Confusion matrix for source domain classification on the Kaggle Brain MRI test set using DenseNet-121 with Feature Space Stability Regularization (left) and baseline training (right). The colorbar on the right indicates prediction counts; darker shading represents higher values.
Figure 2. Confusion matrix for source domain classification on the Kaggle Brain MRI test set using DenseNet-121 with Feature Space Stability Regularization (left) and baseline training (right). The colorbar on the right indicates prediction counts; darker shading represents higher values.
Mathematics 14 01082 g002
Figure 3. Confusion matrix for source domain classification on the Kaggle Brain MRI test set using ResNet-34 with Feature Space Stability Regularization (left) and baseline training (right). The colorbar on the right indicates prediction counts; darker shading represents higher values.
Figure 3. Confusion matrix for source domain classification on the Kaggle Brain MRI test set using ResNet-34 with Feature Space Stability Regularization (left) and baseline training (right). The colorbar on the right indicates prediction counts; darker shading represents higher values.
Mathematics 14 01082 g003
Figure 4. Confusion matrix for source domain classification on the Kaggle Brain MRI test set using ResNet-18 with Feature Space Stability Regularization (left) and baseline training (right). The colorbar on the right indicates prediction counts; darker shading represents higher values.
Figure 4. Confusion matrix for source domain classification on the Kaggle Brain MRI test set using ResNet-18 with Feature Space Stability Regularization (left) and baseline training (right). The colorbar on the right indicates prediction counts; darker shading represents higher values.
Mathematics 14 01082 g004
Figure 5. ROC curves for DenseNet-121 with Feature Space Stability Regularization (FSSR) on the Kaggle (source) and BRISC (cross-domain) datasets. Dotted line represents random-chance classifier (AUC = 0.50).
Figure 5. ROC curves for DenseNet-121 with Feature Space Stability Regularization (FSSR) on the Kaggle (source) and BRISC (cross-domain) datasets. Dotted line represents random-chance classifier (AUC = 0.50).
Mathematics 14 01082 g005
Figure 6. ROC curves for ResNet-18 with Feature Space Stability Regularization (FSSR) on the Kaggle (source) and BRISC (cross-domain) datasets. Dotted line represents random-chance classifier (AUC = 0.50).
Figure 6. ROC curves for ResNet-18 with Feature Space Stability Regularization (FSSR) on the Kaggle (source) and BRISC (cross-domain) datasets. Dotted line represents random-chance classifier (AUC = 0.50).
Mathematics 14 01082 g006
Figure 7. ROC curves for ResNet-34 with Feature Space Stability Regularization (FSSR) on the Kaggle (source) and BRISC (cross-domain) datasets. Dotted line represents random-chance classifier (AUC = 0.50).
Figure 7. ROC curves for ResNet-34 with Feature Space Stability Regularization (FSSR) on the Kaggle (source) and BRISC (cross-domain) datasets. Dotted line represents random-chance classifier (AUC = 0.50).
Mathematics 14 01082 g007
Figure 8. Feature space deviation distribution for DenseNet-121, demonstrating the strongest stabilization effect with FSSR and highly compact embeddings.
Figure 8. Feature space deviation distribution for DenseNet-121, demonstrating the strongest stabilization effect with FSSR and highly compact embeddings.
Mathematics 14 01082 g008
Figure 9. Feature space deviation distribution for ResNet-34, where FSSR markedly reduces the mean deviation and suppresses long-tailed instability under perturbations.
Figure 9. Feature space deviation distribution for ResNet-34, where FSSR markedly reduces the mean deviation and suppresses long-tailed instability under perturbations.
Mathematics 14 01082 g009
Figure 10. Feature space deviation distribution for ResNet-18 under intensity perturbations, showing reduced variance with FSSR despite limited impact on classification performance.
Figure 10. Feature space deviation distribution for ResNet-18 under intensity perturbations, showing reduced variance with FSSR despite limited impact on classification performance.
Mathematics 14 01082 g010
Table 1. Summary of related approaches for brain tumor classification and Domain Generalization.
Table 1. Summary of related approaches for brain tumor classification and Domain Generalization.
Approach CategoryKey TechniquesPrimary FocusLimitations/GapsReferences
Standard Deep LearningCNN architectures (ResNet, DenseNet), transfer learningMaximizing in-domain classification accuracyProne to shortcut learning; performance degrades under cross-scanner domain shift[3,4,5,27,28]
Data AugmentationGeometric transforms, intensity perturbationsIncreasing input-space diversityPixel-level robustness does not ensure latent feature invariance[15,20]
Domain Adaptation (DA)Adversarial learning (DANN), feature alignmentSource–target feature alignmentRequires target domain data during training; unsuitable for zero-shot deployment[17,34,36]
NormalizationIntensity standardization, adaptive normalizationHarmonizing scanner intensity distributionsPreprocessing dependent; limited robustness to complex non-linear artifacts[14,18,21]
Proposed Method (FSSR)Feature Space Stability RegularizationEmbedding stability under MRI-specific perturbationsTarget free (zero shot); lightweight; explicit feature consistency constraintThis work
Table 2. Summary of backbone architectures used in this study.
Table 2. Summary of backbone architectures used in this study.
ModelDepthParametersFeature Dim.Initialization
ResNet-181811.7 M512Random
ResNet-343421.8 M512Random
DenseNet-1211218.0 M1024Random
Table 3. Summary of key training and regularization parameters.
Table 3. Summary of key training and regularization parameters.
ParameterValue
Input resolution 224 × 224 pixels
PreprocessingGrayscale conversion and area-based resizing
Weight initializationRandom (trained from scratch)
OptimizerAdamW
Learning rate 1 × 10 4
Batch size32
Maximum epochs25
Early stopping patience5 epochs
FSSR weight ( λ )0.05
Intensity scaling range [ 0.9 ,   1.1 ]
Gaussian noise ( σ )1% of image standard deviation
Spatial smoothing 3 × 3 average pooling ( p = 0.5 )
Table 4. Source domain classification and calibration performance on Kaggle Brain MRI test set (N = 1311 images).
Table 4. Source domain classification and calibration performance on Kaggle Brain MRI test set (N = 1311 images).
BackboneMethodAcc (%)Macro-F1 (%)ECEBrierNLL
ResNet-18Baseline95.5095.150.02600.01860.1593
FSSR95.2794.910.02220.01860.1599
ResNet-34Baseline95.7395.420.03510.01910.1966
FSSR97.7197.550.01480.00930.0839
DenseNet-121Baseline96.0395.750.02020.01500.1209
FSSR97.6497.470.01170.00890.0711
Table 5. External generalization performance on BRISC-2025 dataset.
Table 5. External generalization performance on BRISC-2025 dataset.
BackboneMethodAcc (%)F1 (%)Domain Gap (%)ECEBrierNLL
ResNet-18Baseline90.5089.385.000.04210.03730.3120
FSSR90.5089.494.770.05340.03900.3348
ResNet-34Baseline85.5080.1210.230.11660.06400.7109
FSSR93.7092.624.010.04000.02570.2413
DenseNet-121Baseline94.2094.321.830.03250.02230.1788
FSSR96.7096.870.940.01660.01300.1041
Table 6. Statistical significance analysis using bootstrap hypothesis testing (n = 1000 iterations).
Table 6. Statistical significance analysis using bootstrap hypothesis testing (n = 1000 iterations).
DatasetBackbone Δ Accuracy (%)95% CISignificance
KaggleResNet-18+0.23[−0.84, 1.37]Not significant
KaggleResNet-34+1.99[0.99, 3.13] p < 0.05
KaggleDenseNet-121+1.59[0.46, 2.75] p < 0.05
BRISCResNet-18+0.03[−1.77, 1.83]Not significant
BRISCResNet-34+8.20[6.10, 10.30] p < 0.001
BRISCDenseNet-121+2.49[1.00, 4.10] p < 0.01
Table 7. Model calibration metrics with confidence level analysis.
Table 7. Model calibration metrics with confidence level analysis.
DatasetBackboneMethodECEMCEBrierConf. Level95% CI
KaggleResNet-18FSSR0.02220.46470.01860.9796[0.9756, 0.9833]
KaggleResNet-34FSSR0.01480.63300.00930.9901[0.9872, 0.9926]
KaggleDenseNet-121FSSR0.01170.51360.00890.9846[0.9810, 0.9879]
BRISCResNet-18FSSR0.05340.39000.03900.9465[0.9392, 0.9540]
BRISCResNet-34FSSR0.04000.48480.02570.9757[0.9706, 0.9805]
BRISCDenseNet-121FSSR0.01660.40770.01300.9735[0.9681, 0.9783]
Table 8. Comparison of Domain Generalization methods across backbone architectures. The results report source domain (Kaggle) performance, external domain (BRISC-2025) performance, and the resulting domain gap.
Table 8. Comparison of Domain Generalization methods across backbone architectures. The results report source domain (Kaggle) performance, external domain (BRISC-2025) performance, and the resulting domain gap.
ArchitectureMethodKaggleBRISC-2025Domain Gap
Acc (%)F1 (%)Acc (%)F1 (%)(%)
ResNet-18Baseline95.5095.1590.5089.385.00
SimCLR94.2893.9085.4081.728.88
MixStyle91.3090.7285.6084.295.70
FSSR (Ours)95.2794.9190.5089.494.77
ResNet-34Baseline95.7395.4285.5080.1210.23
SimCLR93.5293.1586.0083.477.52
MixStyle93.9093.5587.0084.546.90
FSSR (Ours)97.7197.5593.7092.624.01
DenseNet-121Baseline96.0395.7594.2094.321.83
SimCLR93.7593.4091.1090.562.65
MixStyle92.3792.1289.1089.063.27
FSSR (Ours)97.6497.4796.7096.870.94
Table 9. Feature space stability metrics comparing baseline (CE) and FSSR models.
Table 9. Feature space stability metrics comparing baseline (CE) and FSSR models.
BackboneMethodMean Dev.Std Dev. Δ Mean Δ Std
ResNet-18Baseline (CE)4.5135.182
FSSR3.6022.467−20.2%−52.4%
ResNet-34Baseline (CE)4.5065.136
FSSR3.4992.313−22.3%−55.0%
DenseNet-121Baseline (CE)4.5655.157
FSSR3.4742.168−23.9%−58.0%
Table 10. λ sensitivity analysis for ResNet-18.
Table 10. λ sensitivity analysis for ResNet-18.
λ Kaggle Acc (%)Kaggle F1 (%)BRISC Acc (%)Gap (%)
095.5095.1590.505.00
0.00195.0494.6889.205.84
0.00595.1294.7689.605.52
0.0194.8994.5289.405.49
0.0295.0494.6889.805.24
0.0595.2794.9190.504.77
0.1094.6694.2889.205.46
0.2094.2093.8088.505.70
Table 11. λ sensitivity analysis for ResNet-34.
Table 11. λ sensitivity analysis for ResNet-34.
λ Kaggle Acc (%)Kaggle F1 (%)BRISC Acc (%)Gap (%)
095.7395.4285.5010.23
0.00195.9695.6687.208.76
0.00596.1995.9088.507.69
0.0196.5096.2289.806.70
0.0296.8096.5491.505.30
0.0597.7197.5593.704.01
0.1097.2597.0692.804.45
0.2096.9596.7492.004.95
Table 12. λ sensitivity analysis for DenseNet-121.
Table 12. λ sensitivity analysis for DenseNet-121.
λ Kaggle Acc (%)Kaggle F1 (%)BRISC Acc (%)Gap (%)
096.0395.7594.201.83
0.00196.3496.0894.801.54
0.00596.5796.3295.401.17
0.0196.8096.5695.801.00
0.0297.1896.9696.200.98
0.0597.6497.4796.700.94
0.1097.4197.2296.301.11
0.2097.1096.9095.801.30
Table 13. Perturbation component ablation for ResNet-18 ( λ = 0.05 ).
Table 13. Perturbation component ablation for ResNet-18 ( λ = 0.05 ).
ComponentsKaggleBRISCGap
AccF1AccF1(%)
None95.5095.1590.5089.385.00
Scale95.1294.8088.6086.326.52
Noise95.1994.8789.4087.855.79
Smooth95.0494.7288.2085.946.84
Scale + Noise95.2794.9589.7088.125.57
Scale + Smooth95.3595.0589.5087.735.85
Noise + Smooth95.1994.9089.8088.255.39
Full (S + N + B)95.2794.9190.5089.494.77
Full FSSR: 4.6% reduction in domain gap relative to baseline.
Table 14. Perturbation component ablation for ResNet-34 ( λ = 0.05 ).
Table 14. Perturbation component ablation for ResNet-34 ( λ = 0.05 ).
ComponentsKaggleBRISCGap
AccF1AccF1(%)
None95.7395.4285.5080.1210.23
Scale95.9695.6987.8084.568.16
Noise96.1895.9790.8089.155.38
Smooth96.1195.8587.5084.238.61
Scale + Noise96.7296.4891.4089.875.32
Scale + Smooth96.3496.1089.9087.456.44
Noise + Smooth96.5796.3592.1090.784.47
Full (S + N + B)97.7197.5593.7092.624.01
Full FSSR: +8.20% BRISC accuracy; 61% reduction in domain gap.
Table 15. Perturbation component ablation for DenseNet-121 ( λ = 0.05 ).
Table 15. Perturbation component ablation for DenseNet-121 ( λ = 0.05 ).
ComponentsKaggleBRISCGap
AccF1AccF1(%)
None96.0395.7594.2094.321.83
Scale96.8896.7095.5095.421.38
Noise96.9596.7895.3095.181.65
Smooth96.7296.5594.6094.152.12
Scale + Noise97.1096.9495.8095.651.30
Scale + Smooth97.1897.0296.0095.881.18
Noise + Smooth97.3397.1896.3096.151.03
Full (S + N + B)97.6497.4796.7096.870.94
Full FSSR: +2.50% BRISC accuracy; 49% reduction in domain gap.
Table 16. Summary: Full FSSR vs. baseline and best single component.
Table 16. Summary: Full FSSR vs. baseline and best single component.
ArchitectureBaselineBest SingleFull FSSRImprovement
BRISCBRISCBRISCvs. Basevs. Single
ResNet-1890.5089.40 (N)90.50+0.00+1.10
ResNet-3485.5090.80 (N)93.70+8.20+2.90
DenseNet-12194.2095.50 (S)96.70+2.50+1.20
Average90.0791.9093.63+3.57+1.73
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Kakon, S.C.; Jamwal, H.D.S.; Singh, S. Improving Cross-Domain Generalization in Brain MRIs via Feature Space Stability Regularization. Mathematics 2026, 14, 1082. https://doi.org/10.3390/math14061082

AMA Style

Kakon SC, Jamwal HDS, Singh S. Improving Cross-Domain Generalization in Brain MRIs via Feature Space Stability Regularization. Mathematics. 2026; 14(6):1082. https://doi.org/10.3390/math14061082

Chicago/Turabian Style

Kakon, Shawon Chakrabarty, Harishik Dev Singh Jamwal, and Saurabh Singh. 2026. "Improving Cross-Domain Generalization in Brain MRIs via Feature Space Stability Regularization" Mathematics 14, no. 6: 1082. https://doi.org/10.3390/math14061082

APA Style

Kakon, S. C., Jamwal, H. D. S., & Singh, S. (2026). Improving Cross-Domain Generalization in Brain MRIs via Feature Space Stability Regularization. Mathematics, 14(6), 1082. https://doi.org/10.3390/math14061082

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop