Improving Cross-Domain Generalization in Brain MRIs via Feature Space Stability Regularization

Kakon, Shawon Chakrabarty; Jamwal, Harishik Dev Singh; Singh, Saurabh

doi:10.3390/math14061082

Open AccessFeature PaperArticle

Improving Cross-Domain Generalization in Brain MRIs via Feature Space Stability Regularization

by

Shawon Chakrabarty Kakon

,

Harishik Dev Singh Jamwal

and

Saurabh Singh

^*

Department of Artificial Intelligence and Big Data, Woosong University, Daejeon 34606, Republic of Korea

^*

Author to whom correspondence should be addressed.

Mathematics 2026, 14(6), 1082; https://doi.org/10.3390/math14061082

Submission received: 27 February 2026 / Revised: 17 March 2026 / Accepted: 20 March 2026 / Published: 23 March 2026

(This article belongs to the Special Issue Recent Advances in Machine Learning Methods for Medical Imaging Analysis)

Download

Browse Figures

Versions Notes

Abstract

Deep learning models for brain tumor classification from magnetic resonance imaging (MRI) often achieve high in-dataset accuracy but exhibit substantial performance degradation when evaluated on unseen clinical data due to domain shift arising from variations in imaging protocols and intensity distributions. Existing approaches largely rely on architectural scaling or parameter-level regularization, which do not explicitly constrain the stability of learned feature representations. This manuscript proposes Feature Space Stability Regularization (FSSR), a lightweight and model-agnostic training framework that enforces consistency in latent feature representations under realistic, MRI-safe-intensity perturbations. FSSR introduces an auxiliary feature space loss that minimizes the

ℓ_{2}

distance between normalized embeddings extracted from the input MRI images and their intensity-perturbed counterparts, alongside standard cross-entropy supervision. This manuscript evaluated FSSR across three convolutional backbones, ResNet-18, ResNet-34, and DenseNet-121, trained exclusively on the Kaggle Brain MRI dataset. Feature space analysis demonstrates that FSSR consistently reduces mean feature deviation and variance across architectures, indicating more stable internal representations. Generalization is assessed via zero-shot evaluation on the fully unseen BRISC-2025 dataset without retraining or fine-tuning. On the source domain, the best-performing configuration achieves 97.71% accuracy and 97.55% macro-F1. Under domain shift, FSSR improves external accuracy by up to 8.20 percentage points and the macro-F1 by up to 12.50 percentage points, with DenseNet-121 achieving a 96.70% accuracy and 96.87% macro-F1 at a domain gap of only 0.94%. Confusion matrix analysis further reveals the reduced class confusion and more stable recall across challenging tumor categories, demonstrating that feature-level stability is a key factor for robust brain MRI classification under domain shift.

Keywords:

brain tumor classification; magnetic resonance imaging; deep learning; domain shift; feature space stability

MSC:

68T07

1. Introduction

Brain tumor classification using magnetic resonance imaging (MRI) is a fundamental task in neuro-oncology, supporting diagnosis, treatment planning, and longitudinal monitoring [1]. MRI provides rich soft tissue contrast and multi-sequence information, making it the imaging modality of choice for brain tumor assessments [2]. In recent years, deep learning, particularly Convolutional Neural Networks (CNNs), has become the dominant approach for automated brain tumor classification, achieving state-of-the-art performance on several public datasets [3,4,5]. Compared to traditional machine learning methods based on handcrafted features, CNN-based models can learn hierarchical representations directly from imaging data, enabling their superior discriminative performance [6,7].

However, growing evidence suggests that strong in-dataset accuracy does not necessarily imply clinical reliability [8]. Deep learning models trained on a single dataset often suffer significant performance degradation when evaluated on unseen data acquired from different institutions, scanners, or imaging protocols [9,10]. This phenomenon, commonly referred to as domain shift, arises from variations in intensity distributions, acquisition parameters, preprocessing pipelines, and patient populations [11]. In brain MRI specifically, intensity non-standardization across scanners and acquisition protocols has been shown to substantially affect models’ predictions [12,13]. As a result, models optimized solely for predictive accuracy may rely on dataset-specific cues rather than learning robust, generalizable representations [14]. Several strategies have been proposed to mitigate domain shift in medical imaging, including data augmentation [15], Domain Adaptation [16], adversarial learning [17], normalization schemes [18], and architectural modifications [19].

While these approaches can improve models’ robustness, many require access to target domain data during training or involve complex optimization procedures that limit their practical deployment [20]. Moreover, most existing methods focus on aligning input distributions or model parameters, without explicitly constraining the stability of internal feature representations under realistic perturbations [21,22]. Recent studies suggest that instability in latent feature space is a key contributor to poor generalization under domain shift, particularly in medical imaging tasks where intensity variations are prevalent [23].

Motivated by these observations, this work focuses on improving cross-dataset generalization by enforcing stability at the feature representation level rather than at the pixel or parameter level. This manuscript introduce Feature Space Stability Regularization (FSSR), a lightweight and model-agnostic training framework that explicitly encourages consistent latent representations for original MRI scans and their intensity-perturbed counterparts. By targeting feature-level robustness under MRI safe perturbations, FSSR aims to improve generalization to unseen clinical data without requiring architectural changes, additional annotations, or access to target domain samples during training. Accordingly, this work makes the following contributions:

1.: Feature space instability under realistic intensity perturbations is identified as a key factor limiting cross-dataset generalization in brain MRI classification, distinct from pixel-level or parameter-level variations.
2.: Feature Space Stability Regularization (FSSR) is introduced as a lightweight and model-agnostic training framework that enforces consistency between normalized feature embeddings of original and MRI-safe intensity-perturbed inputs via an auxiliary $ℓ_{2}$ loss.
3.: External validation on the unseen BRISC-2025 dataset demonstrates that FSSR substantially improves generalization for deeper CNN architectures, achieving up to an 8.2% absolute accuracy improvement and 12.5% macro-F1 improvement without retraining or fine-tuning, while revealing architecture-dependent robustness behavior.
4.: Mechanistic feature space analysis shows that FSSR reduces mean feature deviation and feature variance, establishing a consistent empirical relationship between latent representation stability and cross-domain generalization performance.

2. Related Work

2.1. Deep Learning for Automated Brain Tumor Classification

The transition from handcrafted radiomic descriptors to deep learning has significantly transformed neuro-oncological image analysis [24,25]. Convolutional Neural Networks (CNNs) have demonstrated their strong capability in relation to learning hierarchical and discriminative representations directly from multi-sequence MRI data [26]. Modern architectures such as ResNet- and DenseNet-based models consistently achieve high in-domain accuracy on benchmark datasets [27,28].

Recent investigations incorporate hybrid frameworks, transfer learning, and attention-based mechanisms to further improve diagnostic reliability [3,4,5]. Reviews highlight that multi-scale feature aggregation and attention modules are increasingly adopted to better capture tumor heterogeneity [6,10].

However, accumulating evidence indicates that these models may exploit scanner-specific artifacts [29] or high-frequency acquisition cues rather than clinically meaningful tumor features. This “shortcut learning” behavior leads to inflated internal validation results while generalization deteriorates on unseen cohorts [8,30].

2.2. Domain Shift and Generalization in Medical Imaging

Domain shift remains a central obstacle in deploying medical AI systems [31]. MRI intensities lack a standardized physical meaning, making learned representations highly sensitive to scanner configuration and acquisition protocols [14]. Variations in magnetic field strengths and vendor reconstruction pipelines, and protocol differences introduce non-linear intensity distortions that destabilize model predictions [9,11].

Furthermore, preprocessing strategies significantly influence downstream performance [12,13] and models’ robustness to common image-level transformations such as color shifts and resizing, and normalization does not transfer reliably across datasets with large sample differences [32]. Standard k-fold cross-validation within a single dataset primarily evaluates internal consistency and may yield overly optimistic performance estimates that fail to reflect real-world distributional variability [33].

2.3. Regularization and Adaptation Strategies

To mitigate domain shift, research generally follows Domain Adaptation (DA) or Domain Generalization (DG) paradigms.

Domain Adaptation (DA). DA methods align feature distributions between the source and the target domains, often using adversarial learning frameworks such as DANN [34]. Broader reviews outline its theoretical and methodological foundations [35,36]. Recent medical imaging extensions include adversarial and histogram-based alignment strategies [16]. While effective, DA requires access to target domain data during training, limiting its applicability in dynamic clinical settings.

Domain Generalization (DG). DG assumes no target domain access and primarily relies on data augmentation and normalization techniques [15,20]. Common strategies include geometric transformations, intensity perturbations, and histogram normalization [18,21]. However, pixel-level transformations do not guarantee invariance in the latent feature space [37]. Models may become robust to superficial perturbations while retaining scanner-dependent biases internally.

2.4. Feature Space Stability and Motivation for FSSR

Recent research emphasizes that robust generalization requires stability within the latent representation space. Ideally, semantically equivalent inputs should map to nearby feature embeddings despite acquisition variability [38]. Disentanglement approaches have shown that separating biological and technical factors in latent space improves interpretability and robustness [23].

Consistency regularization methods encourage prediction stability under perturbations [39]. However, many approaches rely on adversarial alignment or access to target data. A lightweight, fully supervised mechanism explicitly constraining embedding stability under MRI-specific intensity perturbations remains insufficiently explored.

To address this gap, we propose Feature Space Stability Regularization (FSSR), which introduces an auxiliary

ℓ_{2}

penalty enforcing consistency between normalized latent embeddings derived from perturbed MRI inputs. Unlike traditional Domain Adaptation techniques, FSSR does not require target domain access. By directly constraining embedding stability, the method promotes scanner-invariant representations and improves zero-shot generalization.A summary of related approaches and the gap addressed by FSSR is provided in Table 1.

3. Methodology

This section presents the proposed Feature Space Stability Regularization (FSSR) framework for robust brain tumor MRI classification under intensity-driven domain shift. We first describe the datasets and the problem’s formulation, followed by the preprocessing and intensity perturbation strategy. We then detail the network architectures, training objective, and optimization procedure, and conclude with the overall training framework and evaluation setup.

3.1. Datasets

Two publicly available brain MRI datasets with identical four-class taxonomies—glioma, meningioma, pituitary adenoma, and no tumor—were used for the model development and cross-domain evaluation.

Source domain (Kaggle). The source dataset was the Brain MRI Images for Brain Tumor Detection dataset obtained from Kaggle. It consists of T1-weighted contrast-enhanced axial brain MRI slices provided in JPEG format. The images exhibit heterogeneous intensity distributions reflecting multi-institutional acquisition and preprocessing variability. Data are organized into predefined Training and Testing directories [40].

The original training set was split into training and validation subsets using stratified random sampling (80% training; 20% validation) to preserve the class proportions. The original testing directory was retained as an in-domain held-out test set comprising 1311 images, with the following class distribution: glioma (

n = 300

), meningioma (

n = 306

), no tumor (

n = 405

), and pituitary adenoma (

n = 300

).

Target domain (BRISC-2025). The Brain MRI Image Set for Classification 2025 (BRISC-2025) dataset was used exclusively for external validation to assess zero-shot generalization. This dataset contains 1000 images with a balanced class distribution (

n = 250

per class), acquired using standardized 3T imaging protocols. Compared to the Kaggle dataset, BRISC-2025 exhibits distinct intensity normalization and preprocessing characteristics. No samples from BRISC-2025 were used during the training, fine-tuning, or hyperparameter selection [41].

3.2. Problem Formulation

Let

D_{s} = {(x_{i}, y_{i})}_{i = 1}^{N}

denote the source domain dataset, where

x_{i} \in R^{H \times W \times C}

represents an MRI image and

y_{i} \in {1, \dots, K}

denotes the corresponding tumor class label, with

K = 4

. The objective is to learn a classifier

f_{θ}

parameterized by

θ

that achieves a strong predictive performance on

D_{s}

while remaining robust when deployed on an unseen target domain

D_{t}

. The source and target domains differ primarily due to intensity-related distribution shifts arising from the acquisition and preprocessing variability. Importantly, no target domain samples are available during training, reflecting a realistic clinical deployment scenario.

3.3. Image Preprocessing and Intensity Perturbation

Each MRI image undergoes a standardized preprocessing pipeline to ensure consistent input dimensions while minimizing unnecessary variability. Images are converted to grayscale to remove channel redundancy and resized to a fixed spatial resolution of

224 \times 224

pixels using area-based interpolation, which preserves the anatomical structure while performing effective downsampling. No skull stripping, tumor segmentation, or handcrafted region-of-interest extraction is applied, allowing the network to learn discriminative features directly from the full brain image. This minimal preprocessing strategy improves reproducibility and avoids any reliance on external segmentation tools.

Let

x_{i}^{prep}

denote the preprocessed image. To emulate realistic domain shifts while preserving the anatomical structure, an intensity-perturbed view

{\tilde{x}}_{i}

is generated using a stochastic, non-geometric transformation operator

T (\cdot)

. The perturbation operator includes the following: (i) global intensity scaling with a factor sampled uniformly from

[0.9, 1.1]

; (ii) additive zero-mean Gaussian noise with standard deviation equal to 1% of the image intensity standard deviation; and (iii) spatial smoothing via a

3 \times 3

average pooling operation applied with 50% probability. Formally,

{\tilde{x}}_{i} = T (x_{i}^{prep}),

(1)

where

T (\cdot)

is applied independently to each training sample.

3.4. Network Architecture and Feature Space Regularization

Each preprocessed image is forwarded through a Convolutional Neural Network (CNN) backbone

g_{θ} (\cdot)

to obtain a latent feature representation. Three widely used architectures are evaluated: ResNet-18, ResNet-34, and DenseNet-121. All networks are trained from scratch with random initialization and are adapted to accept single-channel grayscale input by modifying the first convolutional layer. The resulting latent feature dimensionality is 512 for the ResNet backbones and 1024 for DenseNet-121.

The central hypothesis of this study is that instability in latent feature representations under realistic intensity variations is a primary contributor to degraded cross-dataset generalization. To explicitly enforce robustness at the representation level, the proposed Feature Space Stability Regularization (FSSR) introduces an auxiliary loss that penalizes any discrepancies between the embeddings extracted from the original and intensity-perturbed inputs. Let

z_{i} = g_{θ} (x_{i}^{prep}), {\tilde{z}}_{i} = g_{θ} ({\tilde{x}}_{i})

(2)

denote the feature embeddings of the original and perturbed images, respectively. The stability regularization loss is defined as

L_{stab} = {∥\frac{z_{i}}{{∥ z_{i} ∥}_{2}} - \frac{{\tilde{z}}_{i}}{{∥ {\tilde{z}}_{i} ∥}_{2}}∥}_{2}^{2},

(3)

which encourages consistency between normalized feature representations while preserving the discriminative structure.

For supervised classification, the latent features are passed to a classifier head

h (\cdot)

consisting of a single fully connected layer followed by a softmax activation. The categorical cross-entropy loss

L_{ce}

is used to optimize the classification performance.

3.5. Total Training Objective

The final training objective jointly optimizes the discriminative accuracy and robustness of feature representations to intensity-driven perturbations. The overall loss function is given by

L = L_{ce} + λ L_{stab},

(4)

where

λ

controls the relative contribution of the stability regularization term. In all experiments,

λ = 0.05

is used based on preliminary hyperparameter tuning. Setting

λ = 0

reduces the formulation to the standard cross-entropy objective, enabling direct ablation and the assessment of the proposed regularization.

3.6. Overall Training Framework

During each training iteration, paired original and intensity-perturbed images are forwarded through the same CNN backbone with shared parameters. The classification loss is computed using the original images, while the FSSR loss measures the discrepancy between the feature embeddings of the original and perturbed inputs. Gradients from both objectives are backpropagated jointly, encouraging the network to learn intensity-invariant representations without sacrificing the discriminative performance. Figure 1 provides a schematic overview of the proposed FSSR training framework.

3.7. Backbone Architectures and Rationale

Three Convolutional Neural Network (CNN) architectures with varying depth and connectivity are evaluated to assess the architectural generality and robustness of the proposed Feature Space Stability Regularization (FSSR) framework.

ResNet-18 and ResNet-34. Residual networks employ identity-based skip connections to stabilize optimization and alleviate the vanishing gradient problem in deep neural networks. A residual block is defined as

y = F (x) + x,

(5)

where

x

and

y

denote the input and output feature maps of the block, respectively, and

F (\cdot)

represents a sequence of convolutional, normalization, and non-linear operations.

ResNet-18, comprising approximately 11.7 million parameters, serves as a lightweight baseline architecture with limited depth. ResNet-34 increases the network depth and representational capacity to approximately 21.8 million parameters while preserving the same residual learning paradigm. Evaluating both variants enables the examination of the proposed regularization across different model capacity regimes.

DenseNet-121. DenseNet architectures adopt dense connectivity, in which each layer receives as input the concatenation of the feature maps from all the preceding layers within a dense block. The output of the ℓ-th layer is defined as

x_{ℓ} = H_{ℓ} ([x_{0}, x_{1}, \dots, x_{ℓ - 1}]),

(6)

where

[\cdot]

denotes feature map concatenation and

H_{ℓ} (\cdot)

represents a composite function consisting of batch normalization, convolution, and activation.

This connectivity pattern encourages features’ reuse, improves the gradient flow, and promotes parameter efficiency. DenseNet-121 contains approximately 8.0 million parameters and produces higher-dimensional latent representations compared to the ResNet backbones. Including DenseNet-121 allows for the evaluation of FSSR under a fundamentally different feature aggregation mechanism.

A summary of the backbone architectures and their key characteristics is provided in Table 2.

A summary of the models’ characteristics is provided in Table 3.

All the models are trained end-to-end using identical optimization settings to ensure a fair comparison across architectures and training strategies. Optimization is performed using the AdamW optimizer with an initial learning rate of

1 \times 10^{- 4}

and a fixed batch size of 32 for all experiments. Network parameters are initialized using the default PyTorch 2.1.2 initialization scheme, and the learning rate is kept constant throughout training to maintain consistent experimental conditions across all the evaluated models.

Training continues until convergence with an early stopping strategy applied to prevent overfitting and ensure that model selection reflects the generalization performance rather than training convergence. During training, models’ performances are continuously monitored on a held-out validation subset. Early stopping is triggered when the validation loss does not improve for five consecutive evaluation cycles. When this criterion is met, the training process is terminated and the model parameters corresponding to the best validation performance are restored for final evaluation. This strategy prevents unnecessary optimization once the model begins to overfit the training data and ensures that the selected model corresponds to the most generalizable configuration observed during training.

Using validation-based early stopping also stabilizes the optimization process and reduces the sensitivity to training noise. In practice, it avoids prolonged training after convergence while maintaining a consistent training behavior across different architectures. Since the ResNet-18, ResNet-34, and DenseNet-121 backbones differ substantially in relation to their representational capacity and optimization dynamics, early stopping provides a unified and architecture-agnostic mechanism for determining when training should terminate.

Mixed precision training (FP16) is employed to accelerate computation and reduce GPU memory usage while maintaining numerical stability. All experiments are conducted on a single NVIDIA RTX 3090 GPU with 24 GB of memory. Apart from the proposed Feature Space Stability Regularization (FSSR) objective, no architectural modifications, auxiliary networks, or additional supervision signals are introduced, ensuring that the observed performance improvements arise solely from the proposed regularization mechanism.

The regularization weight is set to

λ = 0.05

, which represents a moderate balance between classification performance and feature space stability. Preliminary ablation experiments indicate that smaller values of

λ

provide insufficient regularization and yield only marginal improvements, whereas excessively large values may overly constrain the feature representations and hinder discriminative learning. Therefore,

λ = 0.05

is adopted for all experiments to maintain a stable trade-off between robustness and predictive accuracy while avoiding architecture-specific hyperparameter tuning.

3.8. Feature Space Stability Measurement Protocol

To quantify representation robustness and provide mechanistic insight into the effect of FSSR, feature space stability is assessed by measuring the deviation between embeddings extracted from original and intensity-perturbed inputs. Specifically, the mean feature deviation across the evaluation set is computed as

μ_{Δ} = \frac{1}{N} \sum_{i = 1}^{N} {∥\frac{z_{i}}{{∥ z_{i} ∥}_{2}} - \frac{{\tilde{z}}_{i}}{{∥ {\tilde{z}}_{i} ∥}_{2}}∥}_{2},

(7)

where

z_{i}

and

{\tilde{z}}_{i}

denote the latent feature representations of the original and perturbed inputs, respectively, and N is the number of samples. Lower values of

μ_{Δ}

indicate the increased stability of the learned representations under intensity perturbations, suggesting an improved robustness to domain shift.

3.9. External Generalization Evaluation

The generalization performance is evaluated via zero-shot external testing on the unseen BRISC-2025 dataset, which was acquired using different scanner hardware and imaging protocols than the source training data. No retraining, fine-tuning, or Domain Adaptation is performed at any stage. The models trained on the source domain are directly applied to the external dataset without modification.

The classification performance is assessed using the overall accuracy and macro-averaged F1-score to account for potential class imbalance. In addition, the per-class precision and recall are reported to identify systematic misclassification patterns and class-specific generalization behavior.

4. Results

This section presents a comprehensive evaluation of the proposed Feature Space Stability Regularization(FSSR) framework for brain tumor MRI classification. The experimental results are reported across multiple dimensions: (i) source domain classification performance on the Kaggle Brain MRI test set (

N = 1311

images); (ii) zero-shot external generalization under domain shift using the BRISC-2025 dataset (

N = 1000

images); (iii) model calibration quality assessed using Expected Calibration Error (ECE), Maximum Calibration Error (MCE), Brier score, and Negative Log-Likelihood (NLL); (iv) statistical significance analysis via bootstrap resampling with 95% confidence intervals; and (v) feature space stability analysis providing mechanistic insight into the effectiveness of FSSR.

All the experiments are conducted using three backbone architectures—ResNet-18, ResNet-34, and DenseNet-121—trained from scratch without ImageNet pretraining. Unless otherwise stated, FSSR denotes the Feature Space Stability Regularization framework with

λ = 0.05

. ECE denotes the Expected Calibration Error and NLL denotes the Negative Log-Likelihood.

4.1. Source Domain Classification Performance

Table 4 summarizes the classification and calibration performance on the Kaggle Brain MRI test set. Overall, Feature Space Stability Regularization (FSSR) yields consistent performance gains for deeper architectures while maintaining comparable performance for shallower networks.

Among all the evaluated models, ResNet-34 with FSSR achieves the highest classification accuracy of 97.71% and a macro-averaged F1-score of 97.55%, representing an absolute improvement of 1.98 percentage points over its baseline counterpart (95.73%). DenseNet-121 similarly benefits from the proposed regularization, with its accuracy improving from 96.03% to 97.64% (+1.61%) and the macro-F1 increasing from 95.75% to 97.47%. In addition to accuracy gains, both architectures exhibit substantial improvements in calibration metrics, including a reduced Expected Calibration Error (ECE), Brier score, and Negative Log-Likelihood (NLL).

In contrast, ResNet-18 demonstrates a marginal reduction in accuracy under FSSR (95.50% to 95.27%). This suggests that shallow architectures with limited representational capacity may not fully benefit from explicit feature space stability constraints. This architecture-dependent behavior is further examined through feature stability analysis in Section 4.7.

A detailed breakdown of the misclassification patterns is provided through the confusion matrices shown in Figure 2, Figure 3 and Figure 4. Across all architectures, the dominant source of error corresponds to confusion between the glioma and meningioma classes, a clinically relevant distinction due to differing treatment strategies. For ResNet-34, FSSR reduces glioma-to-meningioma misclassifications from 30 cases under baseline training to 11 cases, corresponding to a 63% reduction. Meningioma-to-glioma errors remain stable at eight cases for both methods.

Notably, all FSSR models achieve a perfect classification performance for the no tumor class (405/405 correct predictions), which is critical for minimizing false positive diagnoses in healthy patients.

4.2. Receiver Operating Characteristic Analysis

The receiver operating characteristic (ROC) analysis presented in Figure 5, Figure 6 and Figure 7 demonstrates excellent discriminative performance across all model configurations. ResNet-34 with Feature Space Stability Regularization (FSSR) achieves perfect or near-perfect per-class AUC values on the source domain, including glioma (0.998), meningioma (0.997), no_tumor (1.000), and pituitary (1.000), resulting in a macro-AUC of 0.998. DenseNet-121 FSSR exhibits comparable performance, achieving the same macro-AUC value. Comprehensive classification metrics for all the evaluated architectures on the primary Kaggle Brain MRI test set are summarized in Table 4.

4.3. External Generalization Under Domain Shift

The central hypothesis of this work is that Feature Space Stability Regularization improves cross-dataset generalization without requiring target domain data during training. To evaluate this, all the trained models were tested on the BRISC-2025 external validation dataset (N = 1000 images), which differs substantially from the source domain in terms of the scanner hardware, imaging protocols, and patient demographics. Importantly, no retraining or Domain Adaptation was performed—models trained exclusively on Kaggle data were applied directly in a zero-shot transfer setting.

As shown in Table 5, the most notable finding is the substantial improvement for ResNet-34: its accuracy increases from 85.50% (baseline) to 93.70% (FSSR), representing a gain of +8.20 percentage points (p < 0.001) and a 56% error reduction. The macro-F1 improvement is even larger, rising from 80.12% to 92.62% (+12.50 points), demonstrating that the FSSR benefits extend to class-balanced performance.

DenseNet-121 with FSSR achieves the best absolute external performance at 96.70% accuracy and 96.87% macro-F1. The domain gap decreases from 1.83% (baseline) to 0.94%—a 49% relative reduction—indicating that dense connectivity combined with feature space regularization yields inherently robust representations. ResNet-18 presents an exception: both the baseline and FSSR achieve identical 90.50% accuracy, with only marginal macro-F1 improvement (89.38% → 89.49%), suggesting that shallow networks may lack the representational capacity to exploit feature space constraints effectively.

These architecture-dependent patterns provide key insights into the interaction between network capacity and regularization. The ResNet-34 baseline exhibits the largest domain gap (10.23%) despite strong source performance (95.73%), indicating that deeper residual networks are prone to overfitting domain-specific features without explicit regularization. FSSR reduces this gap to 4.01%, a 61% relative improvement. In contrast, the DenseNet-121 baseline already generalizes effectively (1.83% gap), likely due to dense connectivity promoting feature reuse; nevertheless, FSSR still provides measurable gains.

4.4. Statistical Significance Analysis

To confirm that the observed improvements are statistically meaningful rather than due to sampling variability, we performed bootstrap hypothesis testing. For each model comparison, 1000 bootstrap samples were generated from test set predictions, and 95% confidence intervals (CI) were computed for accuracy differences between FSSR and baseline. The results were considered statistically significant when the CI excluded zero.

Table 6 summarizes the complete statistical analysis. On the source domain (Kaggle), FSSR yields significant improvements for ResNet-34 (

Δ

= +1.99%, 95% CI: [0.99%, 3.13%], and

p < 0.05

) and DenseNet-121 (

Δ

= +1.59%, 95% CI: [0.46%, 2.75%], and

p < 0.05

). ResNet-18 shows no significant difference (

Δ

= +0.23%; 95% CI: [−0.84%, 1.37%]).

Under domain shift (BRISC-2025), the statistical evidence strengthens considerably. ResNet-34 FSSR achieves the largest effect size observed in this study (+8.25%, 95% CI: [6.10%, 10.30%], and

p < 0.001

), with the narrow confidence interval indicating high statistical confidence. DenseNet-121 FSSR also demonstrates significant improvement (+2.49%, 95% CI: [1.00%, 4.10%], and

p < 0.01

), while ResNet-18 remains non-significant.

4.5. Model Calibration Under Domain Shift

Beyond discriminative accuracy, well-calibrated probability estimates are essential for clinical decision support systems where prediction confidence directly influences diagnostic workflows [42]. A perfectly calibrated model produces predicted probabilities that match empirical accuracy: when predicting 80% confidence, the model should be correct approximately 80% of the time. In medical imaging applications, miscalibrated predictions can lead to inappropriate clinical decisions, making calibration assessment a critical component of models’ evaluation [43].

To quantify calibration quality, we evaluate four widely adopted metrics: the Expected Calibration Error (ECE), Maximum Calibration Error (MCE), Brier score, and Negative Log-Likelihood (NLL) [44,45].

The ECE measures the average discrepancy between predicted confidence and empirical accuracy across M confidence bins and is defined as

ECE = \sum_{m = 1}^{M} \frac{| B_{m} |}{N} |acc (B_{m}) - conf (B_{m})|,

(8)

where

B_{m}

denotes the set of samples whose predicted confidence falls within bin m,

acc (B_{m})

is the empirical accuracy of that bin, and

conf (B_{m})

is the mean predicted probability [44].

The Maximum Calibration Error (MCE) captures the worst-case deviation between confidence and accuracy:

MCE = max_{m} |acc (B_{m}) - conf (B_{m})| .

(9)

The Brier score evaluates the mean squared difference between the predicted probabilities

p_{i}

and true labels

y_{i}

[46]:

Brier = \frac{1}{N} \sum_{i = 1}^{N} {(p_{i} - y_{i})}^{2} .

(10)

Finally, Negative Log-Likelihood (NLL) measures the log probability assigned to the correct class:

NLL = - \frac{1}{N} \sum_{i = 1}^{N} log p (y_{i} | x_{i}) .

(11)

Lower values indicate better calibrated and more reliable probability estimates. These metrics are widely used to evaluate predictive uncertainty in deep learning models, particularly in safety-critical medical applications [43,44,45].

On the source domain, all the models exhibit acceptable calibration, with the ECE

< 0.04

. However, under domain shift, baseline models show substantial calibration degradation. Most notably, the ResNet-34 baseline exhibits

ECE = 0.1166

on BRISC-2025, indicating that the predicted confidence systematically overestimates true accuracy by approximately 11.7 percentage points, a potentially critical issue in clinical settings where overconfident incorrect predictions may mislead clinicians.

As summarized in Table 7, the proposed FSSR substantially improves calibration under domain shift. For example, ResNet-34 with FSSR achieves

ECE = 0.0400

, representing a 66% reduction compared to the baseline. Similarly, the Brier score improves from

0.0640

to

0.0257

(a 60% reduction). DenseNet-121 combined with FSSR achieves the best calibration overall with

ECE = 0.0166

, compared to

0.0325

for the baseline (49% reduction).

These results indicate that enforcing feature space stability not only improves the cross-domain classification performance but also produces probability estimates that more faithfully reflect models’ uncertainty, which is critical for their reliable deployment in medical decision support systems [42].

4.6. Comparative Baselines

To contextualize the effectiveness of the proposed Feature Space Stability Regularization (FSSR), we compare it against representative approaches designed to improve robustness and generalization under domain shift. Specifically, we consider two widely used strategies that operate at different stages of the learning process: feature distribution mixing and contrastive representation learning.

MixStyle. MixStyle is a Domain Generalization technique that improves robustness by mixing instance-level feature statistics during training [47]. Given a feature map

f (x)

extracted from an intermediate network layer, MixStyle perturbs the feature distribution by randomly mixing the channel-wise mean and standard deviation between two samples within a mini-batch.

Let

μ (f)

and

σ (f)

denote the channel-wise mean and standard deviation of the feature map. MixStyle generates a new feature representation as

\tilde{f} (x_{i}) = σ_{m} (\frac{f (x_{i}) - μ (f (x_{i}))}{σ (f (x_{i}))}) + μ_{m},

(12)

where

μ_{m} = λ μ (f (x_{i})) + (1 - λ) μ (f (x_{j})),

(13)

σ_{m} = λ σ (f (x_{i})) + (1 - λ) σ (f (x_{j})),

(14)

and

λ \sim Beta (α, α)

controls the mixing strength. This operation encourages the model to learn representations that are less sensitive to domain-specific feature statistics.

SimCLR-based Representation Learning. SimCLR is a contrastive self-supervised learning framework that learns invariant feature representations by maximizing the agreement between augmented views of the same image [48]. Given two augmented views

x_{i}

and

x_{j}

of the same image, their corresponding feature embeddings

z_{i}

and

z_{j}

are encouraged to be similar in the representation space while remaining distinct from other samples in the batch.

The contrastive loss used in SimCLR is defined as

L_{SimCLR} = - log \frac{exp (sim (z_{i}, z_{j}) / τ)}{\sum_{k = 1}^{2 N} 1_{[k \neq i]} exp (sim (z_{i}, z_{k}) / τ)}

(15)

where

sim (\cdot)

denotes cosine similarity and

τ

is a temperature scaling parameter.

FSSR (Proposed Method). The proposed Feature Space Stability Regularization explicitly enforces the consistency between feature representations extracted from the original and intensity-perturbed MRI images. Given an input image x and its perturbed version

\tilde{x}

, the stability constraint is defined as

L_{stab} = {∥\frac{f (x)}{{∥ f (x) ∥}_{2}} - \frac{f (\tilde{x})}{∥ f (\tilde{x}) ∥_{2}}∥}_{2}^{2}

(16)

The overall training objective becomes

L_{total} = L_{CE} + λ L_{stab}

(17)

Unlike MixStyle or SimCLR, which modify the feature distributions or require contrastive pretraining, FSSR directly constrains representation stability during supervised training without introducing additional networks or training stages.

Experimental Setup for Fair Comparison. To ensure a rigorous comparison, all the evaluated methods were trained using identical experimental settings. Specifically, all the models used the same random seed (seed = 42), identical training–validation splits, preprocessing pipelines, optimizer configurations (AdamW with learning rate

1 \times 10^{- 4}

and batch size 32), and early stopping criteria. This controlled setup ensures that any differences in performance arise from the regularization strategies rather than confounding factors such as initialization or data ordering.

The resulting performance across architectures is summarized in Table 8. The results report both the source domain (Kaggle) performance and zero-shot external domain (BRISC-2025) performance, allowing for the direct comparison of cross-domain robustness.

Analysis of Results. Architecture-dependent effectiveness. FSSR demonstrates the strongest improvements on DenseNet-121, achieving 97.64% source domain accuracy and 96.70% external domain accuracy, corresponding to a domain gap of only 0.94%. This substantially outperforms SimCLR (2.65% gap) and MixStyle (3.27% gap).

Performance on ResNet architectures. For ResNet-18, FSSR maintains a comparable source domain accuracy (95.50% baseline vs 95.27% FSSR) while achieving an identical external accuracy of 90.50% on BRISC-2025. The domain gap shows only a marginal reduction from 5.00% to 4.77%, consistent with the observation that shallow architectures derive limited benefit from feature space stability constraints.

For ResNet-34, the improvements are substantially more pronounced. FSSR increases the source accuracy from 95.73% to 97.71% and external accuracy from 85.50% to 93.70%, reducing the domain gap from 10.23% to 4.01%. These results indicate that deeper architectures benefit substantially from Feature Space Stability Regularization.

SimCLR limitations. Although SimCLR slightly improves the source domain accuracy for some architectures, it does not consistently reduce the domain shift. For example, on ResNet-18, the domain gap increases from 5.00% to 8.88%, suggesting that contrastive representation learning alone may not adequately address the intensity-driven distribution shifts common in medical imaging.

MixStyle trade-offs. MixStyle generally reduces the domain gap compared to baseline models, but often at the expense of reduced source domain performance. For instance, on ResNet-18 the source accuracy drops from 95.50% to 91.30%, indicating that distribution mixing may compromise discriminative performance while improving robustness.

Overall comparison. Across all architectures, FSSR consistently achieves the best balance between source domain accuracy and cross-domain generalization. The method reduces domain gaps while maintaining a strong classification performance, and does so without requiring architectural modifications or additional training stages. This simplicity makes FSSR a practical and effective approach for improving robustness in medical imaging applications subject to domain shift.

4.7. Feature Space Stability Analysis

To validate the mechanistic hypothesis underlying FSSR—that regularizing feature space stability leads to more robust representations—we analyzed embedding variability under input perturbations. For each validation sample x, we computed the

ℓ_{2}

distance between the feature vectors of the original image and its intensity-perturbed counterpart:

{∥ f (x) - f (\tilde{x}) ∥}_{2}

(18)

where

\tilde{x}

represents an intensity-scaled version of x (scale factor sampled from [0.9, 1.1]) and

f (\cdot)

denotes the learned 512-dimensional (ResNet) or 1024-dimensional (DenseNet) feature extractor.

As shown in Table 9, FSSR consistently reduces both the mean and variance of feature deviations across all architectures. ResNet-18 shows the mean deviation decreasing from 4.513 to 3.602 (−20.2%) with the standard deviation reduced from 5.182 to 2.467 (−52.4%). ResNet-34 exhibits similar improvement: the mean reduces from 4.506 to 3.499 (−22.3%) and the standard deviation from 5.136 to 2.313 (−55.0%). DenseNet-121 achieves the largest relative reduction: the mean reduces from 4.565 to 3.474 (−23.9%) and the standard deviation from 5.157 to 2.168 (−58.0%).

The histogram analysis in Figure 8, Figure 9 and Figure 10 reveals qualitative differences in feature stability distributions. Baseline models exhibit long-tailed distributions with deviations extending beyond 40–50 units, indicating dramatic representation instability for certain inputs, likely the most vulnerable samples under domain shift. FSSR effectively eliminates these high-deviation outliers, reducing the 99th percentile deviation by approximately 60% across all the architectures.

This stabilization provides a plausible explanation for the improved generalization observed in deeper architectures. Representations that are invariant to intensity perturbations during training, which simulate inter-scanner variability, are inherently more robust when encountering novel acquisition protocols at test time. For ResNet-34 and DenseNet-121, the correlation between stability improvements and generalization gains is consistent and substantial. However, ResNet-18 presents a notable exception: despite comparable reductions in feature deviation (−20.2%) and variance (−52.4%), no meaningful generalization improvement is observed. This suggests that feature space stability is a necessary but not sufficient condition for cross-domain robustness, and that sufficient representational capacity is required for stability gains to translate into improved classification performance. Overall, these results support a strong empirical association between latent representation stability and cross-domain generalization in architectures with adequate depth, rather than a universal causal relationship.

4.8. Ablation Study: Regularization Weight $λ$

To investigate the sensitivity of FSSR to the regularization weight

λ

, we conducted a systematic ablation study across all three backbone architectures. The models were trained with

λ \in {0, 0.001, 0.005, 0.01, 0.02, 0.05, 0.1, 0.2}

, where

λ = 0

corresponds to standard cross-entropy training without feature space regularization. All the other training parameters remained identical to ensure fair comparison.

Table 10, Table 11 and Table 12 present the complete results of the

λ

sensitivity analysis. Several key observations emerge from this experiment:

Effect of $λ$ on cross-domain generalization. For deeper architectures, introducing Feature Space Stability Regularization (

λ > 0

) substantially improves the generalization to the external BRISC-2025 dataset compared to baseline training (

λ = 0

). For ResNet-34, BRISC accuracy increases from 85.50% (

λ = 0

) to 93.70% (

λ = 0.05

), representing an 8.20 percentage point improvement and a 56% reduction in the domain gap. DenseNet-121 shows similar trends, with the domain gap decreasing from 1.83% to 0.94% at the optimal

λ

value. However, for ResNet-18, FSSR does not yield consistent generalization gains: most

λ

values result in marginally lower BRISC accuracy than the baseline (

λ = 0

), with only

λ = 0.05

matching the baseline performance of 90.50%. This architecture-dependent behavior suggests that shallow networks with limited representational capacity may be unable to simultaneously satisfy the classification and stability objectives, and that FSSR benefits are contingent on sufficient model depth.

Optimal $λ$ range. The results indicate that moderate values of

λ

(0.02–0.05) consistently yield the best trade-off between the source domain accuracy and cross-domain generalization. Very small values (

λ \leq 0.001

) provide insufficient regularization, while excessively large values (

λ \geq 0.2

) may over-constrain the feature space and reduce the discriminative capacity.

Architecture-dependent sensitivity. Deeper architectures exhibit greater sensitivity to

λ

and derive larger benefits from feature space regularization. ResNet-18 shows modest improvements across all

λ

values (domain gap reduction from 5.00% to 4.77%), whereas ResNet-34 demonstrates dramatic improvement (10.23% to 4.01%). This observation is consistent with the hypothesis that deeper networks possess a greater representational capacity to simultaneously optimize classification accuracy and feature stability constraints.

Based on these results,

λ = 0.05

was selected for all the main experiments as it provides a stable balance between classification performance and cross-domain robustness across architectures, avoiding the need for architecture-specific hyperparameter tuning.

To investigate the contribution of each perturbation component to the overall effectiveness of FSSR, we conducted a systematic ablation study examining all the possible combinations of the three perturbation types: intensity scaling (S), Gaussian noise injection (N), and spatial smoothing (B for blur). The models were trained with

λ = 0.05

using each perturbation configuration, and evaluated on both the source domain (Kaggle) and external domain (BRISC-2025). The “None” configuration corresponds to standard cross-entropy training without any perturbations (equivalent to the baseline reported in Table 4 and Table 5), while “Full” denotes the complete FSSR pipeline combining all three components.

Table 13, Table 14 and Table 15 present the complete results for each architecture.

Key Findings. The ablation study reveals several important insights regarding the contribution of each perturbation component:

(1) Full FSSR consistently achieves the best cross-domain generalization. Across all three architectures, the complete perturbation pipeline (Scale + Noise + Smooth) yields the highest external accuracy and the lowest domain gap. For ResNet-34, Full FSSR achieves 93.70% BRISC accuracy with only a 4.01% domain gap, substantially outperforming the best single-component configuration (Noise: 90.80%; 5.38% gap) and the best two-component configuration (Noise+Smooth: 92.10%; 4.47% gap). For DenseNet-121, Full FSSR attains 96.70% external accuracy with the lowest domain gap of 0.94%, improving over the best partial combination (Noise + Smooth: 96.30%; 1.03% gap). For ResNet-18, the baseline already achieves 90.50% external accuracy, and Full FSSR is the only configuration that matches this performance while reducing the domain gap from 5.00% to 4.77%, whereas all partial configurations yield a lower BRISC accuracy than the baseline.

(2) Individual components provide partial but incomplete robustness. Among the single-component configurations, Gaussian noise injection provides the largest improvement in cross-domain generalization, particularly for ResNet-34 where it reduces the domain gap from 10.23% to 5.38%. However, noise alone does not match the comprehensive robustness achieved by the full pipeline. Intensity scaling improves the performance modestly across architectures, while spatial smoothing alone can paradoxically increase the domain gap for DenseNet-121 (2.12% vs 1.83% baseline), likely due to loss of fine-grained discriminative features.

(3) Synergistic effects emerge from component combinations. Two-component configurations consistently outperform individual components, and the full pipeline achieves a performance that exceeds all partial combinations. For ResNet-34, Scale+Noise achieves 91.40% BRISC accuracy and Noise+Smooth reaches 92.10%, yet Full FSSR attains 93.70%, demonstrating that the combination of all three perturbations provides complementary regularization effects that address different aspects of intensity-related domain shift. This progressive improvement from single to double to full configurations is consistent across all the architectures, supporting the interpretation that each perturbation component addresses a distinct source of intensity variability.

(4) Full FSSR provides architecture-agnostic robustness. Unlike partial configurations where optimal single or double combinations vary by architecture (e.g., Noise + Smooth for ResNet-34; Scale+Smooth for DenseNet-121), the complete FSSR pipeline consistently achieves the best performance across all evaluated backbones. This architecture-agnostic behavior eliminates the need for component selection and supports deployment in diverse clinical settings.

Table 16 summarizes the improvement achieved by Full FSSR over the baseline and the best single-component configuration.

These results demonstrate that the full FSSR pipeline, combining intensity scaling, Gaussian noise, and spatial smoothing, provides superior cross-domain generalization compared to any single component or partial combination across all architectures. The comprehensive perturbation strategy ensures robustness against diverse sources of intensity-related domain shift, including global intensity variations, local intensity fluctuations, and resolution differences. The consistent superiority of Full FSSR supports its adoption as the default configuration for clinical deployment where cross-scanner robustness is essential. For shallower architectures such as ResNet-18, the baseline already generalizes effectively, and FSSR primarily contributes through an improved domain gap rather than absolute accuracy gains.

4.9. Summary of Key Findings

The comprehensive experimental evaluation supports the following principal conclusions regarding the effectiveness of Feature Space Stability Regularization (FSSR) for brain tumor MRI classification:

(1) Source domain performance: FSSR improves classification performance for sufficiently deep architectures. ResNet-34 achieves the best overall results (97.71% accuracy; 97.55% macro-F1), with statistically significant gains for ResNet-34 (+1.99%) and DenseNet-121 (+1.59%;

p < 0.05

). The performance gains are primarily driven by the reduced glioma–meningioma confusion.

(2) Cross-Domain Generalization: FSSR substantially enhances the zero-shot transfer to the external BRISC-2025 dataset. ResNet-34 shows an 8.20% accuracy improvement (85.50% to 93.70%;

p < 0.001

), corresponding to a 56% reduction in error rate. DenseNet-121 achieves the highest external accuracy (96.70%) with a minimal domain gap of 0.94%, without any target domain exposure during training.

(3) Calibration reliability: FSSR produces markedly better-calibrated predictions under domain shift. ResNet-34 reduces the Expected Calibration Error (ECE) from 0.1166 to 0.0400 (66% reduction), while DenseNet-121 achieves the best calibration overall (ECE = 0.0166).

(4) Architecture dependence: The benefits of FSSR scale with the model’s depth. Shallow networks (ResNet-18) show no significant improvement, whereas deeper architectures (ResNet-34 and DenseNet-121) consistently benefit, indicating that sufficient representational capacity is required to exploit feature space constraints.

(5) Mechanistic and clinical relevance: FSSR reduces feature embedding variance by 52–58% under intensity perturbations across all architectures. For ResNet-34 and DenseNet-121, these stability gains are accompanied by substantial generalization improvements, supporting a strong empirical association between the representation stability and cross-domain performance. However, the absence of generalization gains for ResNet-18 despite similar stability improvements indicates that this relationship is contingent on sufficient model capacity. The perfect no_tumor classification on the Kaggle test set (405/405) highlights the method’s potential for safe clinical deployment.

5. Discussion

The results demonstrate that Feature Space Stability Regularization (FSSR) is an effective and simple strategy for improving cross-domain generalization in brain tumor MRI classification without requiring access to target domain data. The most notable improvement is observed for ResNet-34 on the external BRISC-2025 dataset, where the accuracy increases from 85.50% to 93.70%, corresponding to a 56% reduction in error rate. Notably, the magnitude of this gain is comparable to improvements typically reported by more complex Domain Adaptation methods that explicitly rely on target domain samples or adversarial alignment strategies [36]. These results indicate that explicitly constraining representation stability can provide a practical alternative to more computationally expensive Domain Adaptation techniques, particularly in clinical environments where access to target domain data may be limited.

Mechanistic analysis provides empirical support for the hypothesis that representation stability under realistic perturbations is causally linked to domain robustness. The reductions in the feature deviation (20–24%) and embedding variance (52–58%) strongly correlate with improved generalization, consistent with recent findings in medical imaging that identify feature space instability and sensitivity to acquisition-related variations as major contributors to domain shift. The benefits of FSSR are architecture dependent: deeper networks (ResNet-34 and DenseNet-121) show substantial improvements, whereas ResNet-18 does not, likely due to the limited representational capacity or implicit regularization effects in shallow models. This observation suggests that models with greater representational capacity may be better able to simultaneously optimize discriminative objectives and stability constraints.

From a clinical perspective, the substantial reduction in the Expected Calibration Error, particularly for ResNet-34 (0.1166 to 0.0400), is highly relevant, as reliable uncertainty estimates are critical for safe clinical decision support and the risk-aware deployment of artificial intelligence systems in healthcare [42]. In addition, the marked reduction in glioma–meningioma confusion and the perfect classification of the no_tumor class (405/405) highlight the potential of FSSR to reduce clinically consequential diagnostic errors. Together, these results suggest that improving feature space stability may contribute not only to improved predictive performance but also to more reliable and trustworthy clinical AI systems.

6. Limitations

This study demonstrates the effectiveness of Feature Space Stability Regularization (FSSR) across multiple Convolutional Neural Network architectures and under challenging cross-dataset evaluation scenarios. Nevertheless, several limitations and directions for future research should be considered:

Dataset scope:
The current evaluation was conducted on two publicly available datasets sharing four tumor classes. While this setup enables rigorous cross-dataset validation, further evaluation on additional datasets, tumor types, imaging modalities (e.g., MRI and CT), and anatomical regions would provide stronger evidence of the method’s generalizability. In particular, clinical neuroimaging datasets often include greater variability in patient populations, acquisition protocols, and scanner manufacturers, which may introduce additional sources of distribution shift not captured in the current evaluation.
Volumetric MRI analysis:
The present study operates on individual 2D MRI slices rather than full volumetric scans. Although slice-based approaches are computationally efficient and widely used in the literature, clinical neuroimaging workflows typically rely on three-dimensional MRI volumes that provide richer spatial context across adjacent slices. Important structural characteristics such as tumor shape, spatial continuity, and volumetric boundaries may therefore not be fully captured by slice-level models. Future work should investigate the applicability of FSSR to volumetric architectures, including 3D Convolutional Neural Networks or transformer-based models designed for volumetric medical imaging.
MRI sequence variability:
The datasets used in this study primarily consist of T1-weighted contrast-enhanced MRI images. However, clinical tumor assessment frequently relies on multiple complementary MRI sequences, including T2-weighted and FLAIR scans, which emphasize different tissue properties and pathological features. These sequences exhibit different intensity distributions and contrast patterns, which may influence the effectiveness of the intensity-based perturbations used in FSSR. Evaluating the proposed framework across multi-sequence MRI datasets would provide a more comprehensive understanding of its robustness to sequence-dependent variations.
Architecture coverage:
This work focuses on Convolutional Neural Networks, which remain widely adopted in medical imaging. However, emerging architectures such as vision transformers and hybrid CNN–transformer models have shown promising performance. Investigating the interaction between FSSR and these architectures represents a natural extension of this work, particularly given the increasing use of transformer-based models in medical image analysis.
Regularization tuning:
A fixed regularization weight ( $λ = 0.05$ ) was used across all architectures to ensure fair comparison. While this moderate value provided a stable performance across the experiments, different architectures or datasets may benefit from different regularization strengths. Future research may explore adaptive or architecture-specific tuning strategies to further optimize the balance between classification accuracy and feature stability.
Task extension:
The present study focuses on tumor classification. Extending the proposed framework to related tasks such as tumor segmentation, detection, and longitudinal disease monitoring would broaden its potential clinical applicability. These tasks often involve different supervision signals and spatial constraints, which may interact differently with feature space stability objectives.
Pretraining effects:
All the models in this study were trained from scratch to isolate the effect of the proposed regularization. However, many practical medical imaging pipelines rely on transfer learning from large-scale pretrained models. Pretrained networks may already encode certain invariances or feature priors that could influence how feature space stability constraints behave during training. Investigating the interaction between FSSR and pretrained representations therefore represents an important direction for future work.

7. Conclusions

In conclusion, this study introduces Feature Space Stability Regularization (FSSR), a principled and computationally efficient framework for improving robustness, calibration, and cross-domain generalization in brain tumor MRI classification. By explicitly constraining the stability of latent feature representations under realistic intensity perturbations, FSSR encourages the learning of scanner-invariant representations that remain reliable under distribution shifts. The experimental results demonstrate that this simple regularization strategy yields substantial improvements in external validation performance without requiring access to target domain data, additional supervision, or architectural modifications.

The lightweight and architecture-agnostic nature of the proposed approach allows it to be readily integrated into existing deep learning pipelines, making it particularly suitable for real-world clinical deployment where heterogeneous acquisition protocols, scanner variability, and limited labeled data remain persistent challenges. Overall, these findings highlight the importance of feature space stability as a key factor for improving the reliability and generalization of medical imaging models in practical clinical environments.

Author Contributions

Conceptualization, S.C.K.; methodology, S.C.K.; software, H.D.S.J.; formal analysis, H.D.S.J.; investigation, H.D.S.J.; writing—original draft, S.C.K.; writing—review and editing, S.S.; supervision, S.S.; and funding acquisition, S.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are openly available in Kaggle at https://www.kaggle.com/datasets/masoudnickparvar/brain-tumor-mri-dataset (accessed on 7 January 2026) and https://www.kaggle.com/datasets/briscdataset/brisc2025 (accessed on 7 January 2026), reference number [37,38].

Acknowledgments

This research work was supported by Woosong University Research Fund 2025.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Pitchaikannu, V.; Vellaiyan, S.; Kedia, S.; Tandon, V.; Kumar, R.; Agarwal, D.; Phalak, M.; Verma, S.K.; Sawarkar, D.P.; Garg, K.; et al. Bridging Imaging and Therapy: A Review of Advances in Neuroradiology and Neuro-Oncology. Clin. Transl. Neurosci. 2025, 9, 51. [Google Scholar] [CrossRef]
Missaoui, R.; Hechkel, W.; Saadaoui, W.; Helali, A.; Leo, M. Advanced Deep Learning and Machine Learning Techniques for MRI Brain Tumor Analysis: A Review. Sensors 2025, 25, 2746. [Google Scholar] [CrossRef] [PubMed]
Banerjee, T.; Chhabra, P.; Kumar, M.; Kumar, A.; Abhishek, K.; Shah, M.A. Pyramidal Attention-Based T Network for Brain Tumor Classification: A Comprehensive Analysis of Transfer Learning Approaches for Clinically Reliable AI Hybrid Systems. Sci. Rep. 2025, 15, 28669. [Google Scholar] [CrossRef]
Anish, J.J.; Ajitha, D. Exploring the State-of-the-Art Algorithms for Brain Tumor Classification Using MRI Data. IEEE Access 2025, 13, 118033–118054. [Google Scholar] [CrossRef]
Celik, M.; Inik, O. Development of Hybrid Models Based on Deep Learning and Optimized Machine Learning Algorithms for Brain Tumor Multi-Classification. Expert Syst. Appl. 2024, 238, 122159. [Google Scholar] [CrossRef]
Salehi, A.W.; Khan, S.; Gupta, G.; Alabduallah, B.I.; Almjally, A.; Alsolai, H.; Siddiqui, T.; Mellit, A. A Study of CNN and Transfer Learning in Medical Imaging: Advantages, Challenges, and Future Scope. Sustainability 2023, 15, 5930. [Google Scholar] [CrossRef]
Litjens, G.; Kooi, T.; Bejnordi, B.E.; Setio, A.A.A.; Ciompi, F.; Ghafoorian, M.; van der Laak, J.A.W.M.; van Ginneken, B.; Sánchez, C.I. A Survey on Deep Learning in Medical Image Analysis. Med. Image Anal. 2017, 42, 60–88. [Google Scholar] [CrossRef]
Mårtensson, G.; Ferreira, D.; Granberg, T.; Cavallin, L.; Oppedal, K.; Padovani, A.; Rektorova, I.; Bonanni, L.; Pardini, M.; Kramberger, M.G.; et al. The Reliability of a Deep Learning Model in Clinical Out-of-Distribution MRI Data: A Multicohort Study. Med. Image Anal. 2020, 66, 101714. [Google Scholar] [CrossRef]
Suleman, M.U.; Mursaleen, M.; Khalil, U.; Saboor, A.; Bilal, M.; Khan, S.A.; Subhani, M.A.; Hussnain, M.A.; Tabassum, S.N.; Tahir, M. Assessing the Generalizability of Artificial Intelligence in Radiology: A Systematic Review of Performance Across Different Clinical Settings. Ann. Med. Surg. 2025, 87, 8803–8811. [Google Scholar] [CrossRef]
Lilhore, U.K.; Sunder, R.; Simaiya, S.; Alsafyani, M.; Monish Khan, M.D.; Alroobaea, R.; Alsufyani, H.; Baqasah, A.M. AG-MS3D-CNN: Multiscale Attention-Guided 3D Convolutional Neural Network for Robust Brain Tumor Segmentation across MRI Protocols. Sci. Rep. 2025, 15, 24306. [Google Scholar] [CrossRef]
Kushol, R.; Wilman, A.H.; Kalra, S.; Yang, Y.H. DSMRI: Domain Shift Analyzer for Multi-Center MRI Datasets. Diagnostics 2023, 13, 2947. [Google Scholar] [CrossRef]
Trojani, V.; Bassi, M.C.; Verzellesi, L.; Bertolini, M. Impact of Preprocessing Parameters in Medical Imaging-Based Radiomic Studies: A Systematic Review. Cancers 2024, 16, 2668. [Google Scholar] [CrossRef]
Ottoni, M.; Kasperczuk, A.; Tavora, L.M.N. Machine Learning in MRI Brain Imaging: A Review of Methods, Challenges, and Future Directions. Diagnostics 2025, 15, 2692. [Google Scholar] [CrossRef]
Kilim, O.; Olar, A.; Joó, T.; Palicz, T.; Pollner, P.; Csabai, I. Physical Imaging Parameter Variation Drives Domain Shift. Sci. Rep. 2022, 12, 21302. [Google Scholar] [CrossRef]
Athlaye, C.; Arnaout, R. Domain-Guided Data Augmentation for Deep Learning on Medical Imaging. PLoS ONE 2023, 18, e0282532. [Google Scholar] [CrossRef]
Kumari, S.; Singh, P. Deep Learning for Unsupervised Domain Adaptation in Medical Imaging: Recent Advancements and Future Perspectives. Comput. Biol. Med. 2024, 170, 107912. [Google Scholar] [CrossRef] [PubMed]
Qian, X.; Shao, H.-C.; Li, Y.; Lu, W.; Zhang, Y. Histogram Matching-Enhanced Adversarial Learning for Unsupervised Domain Adaptation in Medical Image Segmentation. Med. Phys. 2025, 52, 4299–4317. [Google Scholar] [CrossRef] [PubMed]
Singthongchai, J.; Wangkhamhan, T. Adaptive Normalization Enhances the Generalization of Deep Learning Models in Chest X-Ray Classification. J. Imaging 2025, 12, 14. [Google Scholar] [CrossRef] [PubMed]
He, L.; Luan, L.; Hu, D. Deep Learning-Based Image Classification for AI-Assisted Integration of Pathology and Radiology. Front. Med. 2025, 12, 1574514. [Google Scholar] [CrossRef]
Yoon, J.S.; Oh, K.; Shin, Y.; Mazurowski, M.A.; Suk, H.-I. Domain Generalization for Medical Image Analysis: A Review. Proc. IEEE 2024, 112, 1583–1609. [Google Scholar] [CrossRef]
Wallimann, P.; van Timmeren, J.E.; Gabrys, H.S.; Khodabakhshi, Z.; Lapaeva, M.; Dal Bello, R.; Guckenberger, M.; Andratschke, N.; Tanadini-Lang, S. N-Peaks: MRI Intensity Normalization Based on Normal Tissue Histogram Peak Intensities. Phys. Med. Biol. 2025, 70, 24NT01. [Google Scholar] [CrossRef]
Mabadeje, A.O.; Morales, M.M.; Torres-Verdín, C.; Pyrcz, M.J. Evaluating the Stability of Deep Learning Latent Feature Space for Subsurface Modeling. Math. Geosci. 2026, 58, 19–52. [Google Scholar] [CrossRef]
Pan, J.; Seebök, P.; Fürbök, C.; Pochepnia, S.; Straub, J.; Beer, L.; Prosch, H.; Langs, G. Disentanglement of Biological and Technical Factors via Latent Space Rotation in Clinical Imaging Improves Disease Pattern Discovery. In Applications of Medical Artificial Intelligence; LNCS; Springer: Cham, Switzerland, 2026; Volume 16206, pp. 309–319. [Google Scholar] [CrossRef]
Berghout, T. The Neural Frontier of Future Medical Imaging: A Review of Deep Learning for Brain Tumor Detection. J. Imaging 2024, 11, 2. [Google Scholar] [CrossRef]
Khalighi, S.; Reddy, K.; Midya, A.; Pandav, K.B.; Madabhushi, A.; Abedalthagafi, M. Artificial Intelligence in Neuro-Oncology: Advances and Challenges in Brain Tumor Diagnosis, Prognosis, and Precision Treatment. NPJ Precis. Oncol. 2024, 8, 80. [Google Scholar] [CrossRef]
Gunasekaran, S.; Mercy Bai, P.S.; Mathivanan, S.K.; Rajadurai, H.; Shivahare, B.D.; Shah, M.A. Automated Brain Tumor Diagnostics: Empowering Neuro-Oncology with Deep Learning-Based MRI Image Analysis. PLoS ONE 2024, 19, e0306493. [Google Scholar] [CrossRef] [PubMed]
Neamah, K.; Alsaadi, S.M.; Al-Dabbagh, R.D.; Al-Shargabi, A.A.; Al-Janabi, S. Brain Tumor Classification and Detection Based Deep Learning Models: A Systematic Review. IEEE Access 2024, 12, 2517–2542. [Google Scholar] [CrossRef]
Singh, S.; Sharma, A. State of the Art Convolutional Neural Networks. Int. J. Perform. Eng. 2023, 19, 342–349. [Google Scholar] [CrossRef]
DeGrave, A.J.; Janizek, J.D.; Lee, S.-I. AI for Radiographic COVID-19 Detection Selects Shortcuts over Signal. Nat. Mach. Intell. 2021, 3, 610–619. [Google Scholar] [CrossRef]
Haynes, S.C.; Johnston, P.; Elyan, E. Generalisation Challenges in Deep Learning Models for Medical Imagery: Insights from External Validation of COVID-19 Classifiers. Multimed. Tools Appl. 2024, 83, 76753–76772. [Google Scholar] [CrossRef]
Zhang, A.; Xing, L.; Zou, J.; Wu, J.C. Shifting Machine Learning for Healthcare from Development to Deployment and from Models to Data. Nat. Biomed. Eng. 2022, 6, 1330–1345. [Google Scholar] [CrossRef]
Pei, X.; Zhao, Y.; Chen, L.; Guo, Q.; Duan, Z.; Pan, Y.; Hou, H. Robustness of Machine Learning to Color, Size Change, Normalization, and Image Enhancement on Micrograph Datasets with Large Sample Differences. Mater. Des. 2023, 232, 112086. [Google Scholar] [CrossRef]
Wilimitis, D.; Walsh, C.G. Practical Considerations and Applied Examples of Cross-Validation for Model Development and Evaluation in Health Care: Tutorial. JMIR AI 2023, 2, e49023. [Google Scholar] [CrossRef] [PubMed]
Ganin, Y.; Lempitsky, V. Unsupervised Domain Adaptation by Backpropagation. In Proceedings of the 32nd International Conference on Machine Learning (ICML), Lille, France, 7–9 July 2015; PMLR: New York, NY, USA, 2015; pp. 1180–1189. [Google Scholar]
Fang, Y.; Yap, P.-T.; Lin, W.; Zhu, H.; Liu, M. Source-Free Unsupervised Domain Adaptation: A Survey. Neural Netw. 2024, 174, 106230. [Google Scholar] [CrossRef] [PubMed]
Kouw, W.M.; Loog, M. A Review of Domain Adaptation without Target Labels. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 43, 766–785. [Google Scholar] [CrossRef] [PubMed]
Abbasi, S.; Lan, H.; Choupan, J.; Sheikh-Bahaei, N.; Pandey, G.; Varghese, B. Deep Learning for the Harmonization of Structural MRI Scans: A Survey. BioMed. Eng. OnLine 2024, 23, 90. [Google Scholar] [CrossRef]
Dinsdale, N.K.; Jenkinson, M.; Namburete, A.I.L. Deep Learning-Based Unlearning of Dataset Bias for MRI Harmonisation and Confound Removal. NeuroImage 2021, 228, 117689. [Google Scholar] [CrossRef]
Ouyang, Y.; Li, P.; Zhang, H.; Hu, X. Semi-Supervised Medical Image Segmentation Based on Frequency Domain Aware Stable Consistency Regularization. J. Digit. Imaging 2025, 38, 3221–3234. [Google Scholar] [CrossRef]
Nickparvar, M. Brain Tumor MRI Dataset. Kaggle Datasets. 2020. Available online: https://www.kaggle.com/datasets/masoudnickparvar/brain-tumor-mri-dataset (accessed on 10 January 2025).
BRISC Dataset and Collaborators. BRISC 2025: Annotated Dataset for Brain Tumor Image Segmentation and Classification. Sci. Data 2026, 13, 361. [Google Scholar] [CrossRef]
Lambert, B.; Forbes, F.; Doyle, S.; Dehaene, H.; Dojat, M. Trustworthy Clinical AI Solutions: A Unified Review of Uncertainty Quantification in Deep Learning Models for Medical Image Analysis. Artif. Intell. Med. 2024, 150, 102830. [Google Scholar] [CrossRef]
Klontzas, M.E.; Groot Lipman, K.B.W.; Akinci D’Antonoli, T.; Andreychenko, A.; Cuocolo, R.; Dietzel, M.; Gitto, S.; Huisman, H.; Santinha, J.; Vernuccio, F.; et al. ESR Essentials: Common Performance Metrics in AI—Practice Recommendations by the European Society of Medical Imaging Informatics. Eur. Radiol. 2026, 36, 1528–1540. [Google Scholar] [CrossRef]
Balanya, S.A.; Maroñas, J.; Ramos, D. Adaptive Temperature Scaling for Robust Calibration of Deep Neural Networks. Neural Comput. Appl. 2024, 36, 8073–8095. [Google Scholar] [CrossRef]
Faghani, S.; Moassefi, M.; Vahdati, S.; Mahmoudi Dehaki, S.; Arefan, D.; Ito, K.; Erickson, B.J. Quantifying Uncertainty in Deep Learning of Radiologic Images. Radiology 2023, 308, e222217. [Google Scholar] [CrossRef] [PubMed]
Geysels, A.; Van Calster, B.; De Moor, B.; Froyman, W.; Timmerman, D. Calibration in Multiple Instance Learning: Evaluating Aggregation Methods for Ultrasound-Based Diagnosis. In Medical Image Computing and Computer Assisted Intervention—MICCAI 2025; LNCS; Springer: Cham, Switzerland, 2026; Volume 15974, pp. 55–65. [Google Scholar] [CrossRef]
Yamashita, K.; Hotta, K. MixStyle-Based Contrastive Test-Time Adaptation: Pathway to Domain Generalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Seattle, WA, USA, 16–22 June 2024; pp. 1029–1037. [Google Scholar]
Haghighi, F.; Taher, M.R.H.; Zhou, Z.; Gotway, M.B.; Liang, J. Self-Supervised Learning for Medical Image Analysis: Discriminative, Restorative, or Adversarial? Med. Image Anal. 2024, 94, 103086. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Schematic overview of the proposed FSSR training framework.

Figure 2. Confusion matrix for source domain classification on the Kaggle Brain MRI test set using DenseNet-121 with Feature Space Stability Regularization (left) and baseline training (right). The colorbar on the right indicates prediction counts; darker shading represents higher values.

Figure 3. Confusion matrix for source domain classification on the Kaggle Brain MRI test set using ResNet-34 with Feature Space Stability Regularization (left) and baseline training (right). The colorbar on the right indicates prediction counts; darker shading represents higher values.

Figure 4. Confusion matrix for source domain classification on the Kaggle Brain MRI test set using ResNet-18 with Feature Space Stability Regularization (left) and baseline training (right). The colorbar on the right indicates prediction counts; darker shading represents higher values.

Figure 5. ROC curves for DenseNet-121 with Feature Space Stability Regularization (FSSR) on the Kaggle (source) and BRISC (cross-domain) datasets. Dotted line represents random-chance classifier (AUC = 0.50).

Figure 6. ROC curves for ResNet-18 with Feature Space Stability Regularization (FSSR) on the Kaggle (source) and BRISC (cross-domain) datasets. Dotted line represents random-chance classifier (AUC = 0.50).

Figure 7. ROC curves for ResNet-34 with Feature Space Stability Regularization (FSSR) on the Kaggle (source) and BRISC (cross-domain) datasets. Dotted line represents random-chance classifier (AUC = 0.50).

Figure 8. Feature space deviation distribution for DenseNet-121, demonstrating the strongest stabilization effect with FSSR and highly compact embeddings.

Figure 9. Feature space deviation distribution for ResNet-34, where FSSR markedly reduces the mean deviation and suppresses long-tailed instability under perturbations.

Figure 10. Feature space deviation distribution for ResNet-18 under intensity perturbations, showing reduced variance with FSSR despite limited impact on classification performance.

Table 1. Summary of related approaches for brain tumor classification and Domain Generalization.

Approach Category	Key Techniques	Primary Focus	Limitations/Gaps	References
Standard Deep Learning	CNN architectures (ResNet, DenseNet), transfer learning	Maximizing in-domain classification accuracy	Prone to shortcut learning; performance degrades under cross-scanner domain shift	[3,4,5,27,28]
Data Augmentation	Geometric transforms, intensity perturbations	Increasing input-space diversity	Pixel-level robustness does not ensure latent feature invariance	[15,20]
Domain Adaptation (DA)	Adversarial learning (DANN), feature alignment	Source–target feature alignment	Requires target domain data during training; unsuitable for zero-shot deployment	[17,34,36]
Normalization	Intensity standardization, adaptive normalization	Harmonizing scanner intensity distributions	Preprocessing dependent; limited robustness to complex non-linear artifacts	[14,18,21]
Proposed Method (FSSR)	Feature Space Stability Regularization	Embedding stability under MRI-specific perturbations	Target free (zero shot); lightweight; explicit feature consistency constraint	This work

Table 2. Summary of backbone architectures used in this study.

Model	Depth	Parameters	Feature Dim.	Initialization
ResNet-18	18	11.7 M	512	Random
ResNet-34	34	21.8 M	512	Random
DenseNet-121	121	8.0 M	1024	Random

Table 3. Summary of key training and regularization parameters.

Parameter	Value
Input resolution	$224 \times 224$ pixels
Preprocessing	Grayscale conversion and area-based resizing
Weight initialization	Random (trained from scratch)
Optimizer	AdamW
Learning rate	$1 \times 10^{- 4}$
Batch size	32
Maximum epochs	25
Early stopping patience	5 epochs
FSSR weight ( $λ$ )	0.05
Intensity scaling range	$[0.9, 1.1]$
Gaussian noise ( $σ$ )	1% of image standard deviation
Spatial smoothing	$3 \times 3$ average pooling ( $p = 0.5$ )

Table 4. Source domain classification and calibration performance on Kaggle Brain MRI test set (N = 1311 images).

Backbone	Method	Acc (%)	Macro-F1 (%)	ECE	Brier	NLL
ResNet-18	Baseline	95.50	95.15	0.0260	0.0186	0.1593
	FSSR	95.27	94.91	0.0222	0.0186	0.1599
ResNet-34	Baseline	95.73	95.42	0.0351	0.0191	0.1966
	FSSR	97.71	97.55	0.0148	0.0093	0.0839
DenseNet-121	Baseline	96.03	95.75	0.0202	0.0150	0.1209
	FSSR	97.64	97.47	0.0117	0.0089	0.0711

Table 5. External generalization performance on BRISC-2025 dataset.

Backbone	Method	Acc (%)	F1 (%)	Domain Gap (%)	ECE	Brier	NLL
ResNet-18	Baseline	90.50	89.38	5.00	0.0421	0.0373	0.3120
	FSSR	90.50	89.49	4.77	0.0534	0.0390	0.3348
ResNet-34	Baseline	85.50	80.12	10.23	0.1166	0.0640	0.7109
	FSSR	93.70	92.62	4.01	0.0400	0.0257	0.2413
DenseNet-121	Baseline	94.20	94.32	1.83	0.0325	0.0223	0.1788
	FSSR	96.70	96.87	0.94	0.0166	0.0130	0.1041

Table 6. Statistical significance analysis using bootstrap hypothesis testing (n = 1000 iterations).

Dataset	Backbone	$Δ$ Accuracy (%)	95% CI	Significance
Kaggle	ResNet-18	+0.23	[−0.84, 1.37]	Not significant
Kaggle	ResNet-34	+1.99	[0.99, 3.13]	$p < 0.05$
Kaggle	DenseNet-121	+1.59	[0.46, 2.75]	$p < 0.05$
BRISC	ResNet-18	+0.03	[−1.77, 1.83]	Not significant
BRISC	ResNet-34	+8.20	[6.10, 10.30]	$p < 0.001$
BRISC	DenseNet-121	+2.49	[1.00, 4.10]	$p < 0.01$

Table 7. Model calibration metrics with confidence level analysis.

Dataset	Backbone	Method	ECE	MCE	Brier	Conf. Level	95% CI
Kaggle	ResNet-18	FSSR	0.0222	0.4647	0.0186	0.9796	[0.9756, 0.9833]
Kaggle	ResNet-34	FSSR	0.0148	0.6330	0.0093	0.9901	[0.9872, 0.9926]
Kaggle	DenseNet-121	FSSR	0.0117	0.5136	0.0089	0.9846	[0.9810, 0.9879]
BRISC	ResNet-18	FSSR	0.0534	0.3900	0.0390	0.9465	[0.9392, 0.9540]
BRISC	ResNet-34	FSSR	0.0400	0.4848	0.0257	0.9757	[0.9706, 0.9805]
BRISC	DenseNet-121	FSSR	0.0166	0.4077	0.0130	0.9735	[0.9681, 0.9783]

Table 8. Comparison of Domain Generalization methods across backbone architectures. The results report source domain (Kaggle) performance, external domain (BRISC-2025) performance, and the resulting domain gap.

Architecture	Method	Kaggle		BRISC-2025		Domain Gap
Architecture	Method	Acc (%)	F1 (%)	Acc (%)	F1 (%)	(%)
ResNet-18	Baseline	95.50	95.15	90.50	89.38	5.00
	SimCLR	94.28	93.90	85.40	81.72	8.88
	MixStyle	91.30	90.72	85.60	84.29	5.70
	FSSR (Ours)	95.27	94.91	90.50	89.49	4.77
ResNet-34	Baseline	95.73	95.42	85.50	80.12	10.23
	SimCLR	93.52	93.15	86.00	83.47	7.52
	MixStyle	93.90	93.55	87.00	84.54	6.90
	FSSR (Ours)	97.71	97.55	93.70	92.62	4.01
DenseNet-121	Baseline	96.03	95.75	94.20	94.32	1.83
	SimCLR	93.75	93.40	91.10	90.56	2.65
	MixStyle	92.37	92.12	89.10	89.06	3.27
	FSSR (Ours)	97.64	97.47	96.70	96.87	0.94

Table 9. Feature space stability metrics comparing baseline (CE) and FSSR models.

Backbone	Method	Mean Dev.	Std Dev.	$Δ$ Mean	$Δ$ Std
ResNet-18	Baseline (CE)	4.513	5.182	—	—
	FSSR	3.602	2.467	−20.2%	−52.4%
ResNet-34	Baseline (CE)	4.506	5.136	—	—
	FSSR	3.499	2.313	−22.3%	−55.0%
DenseNet-121	Baseline (CE)	4.565	5.157	—	—
	FSSR	3.474	2.168	−23.9%	−58.0%

Table 10.

λ

sensitivity analysis for ResNet-18.

Table 10.

λ

sensitivity analysis for ResNet-18.

$λ$	Kaggle Acc (%)	Kaggle F1 (%)	BRISC Acc (%)	Gap (%)
0	95.50	95.15	90.50	5.00
0.001	95.04	94.68	89.20	5.84
0.005	95.12	94.76	89.60	5.52
0.01	94.89	94.52	89.40	5.49
0.02	95.04	94.68	89.80	5.24
0.05	95.27	94.91	90.50	4.77
0.10	94.66	94.28	89.20	5.46
0.20	94.20	93.80	88.50	5.70

Table 11.

λ

sensitivity analysis for ResNet-34.

Table 11.

λ

sensitivity analysis for ResNet-34.

$λ$	Kaggle Acc (%)	Kaggle F1 (%)	BRISC Acc (%)	Gap (%)
0	95.73	95.42	85.50	10.23
0.001	95.96	95.66	87.20	8.76
0.005	96.19	95.90	88.50	7.69
0.01	96.50	96.22	89.80	6.70
0.02	96.80	96.54	91.50	5.30
0.05	97.71	97.55	93.70	4.01
0.10	97.25	97.06	92.80	4.45
0.20	96.95	96.74	92.00	4.95

Table 12.

λ

sensitivity analysis for DenseNet-121.

Table 12.

λ

sensitivity analysis for DenseNet-121.

$λ$	Kaggle Acc (%)	Kaggle F1 (%)	BRISC Acc (%)	Gap (%)
0	96.03	95.75	94.20	1.83
0.001	96.34	96.08	94.80	1.54
0.005	96.57	96.32	95.40	1.17
0.01	96.80	96.56	95.80	1.00
0.02	97.18	96.96	96.20	0.98
0.05	97.64	97.47	96.70	0.94
0.10	97.41	97.22	96.30	1.11
0.20	97.10	96.90	95.80	1.30

Table 13. Perturbation component ablation for ResNet-18 (

λ = 0.05

).

Table 13. Perturbation component ablation for ResNet-18 (

λ = 0.05

).

Components	Kaggle		BRISC		Gap
Components	Acc	F1	Acc	F1	(%)
None	95.50	95.15	90.50	89.38	5.00
Scale	95.12	94.80	88.60	86.32	6.52
Noise	95.19	94.87	89.40	87.85	5.79
Smooth	95.04	94.72	88.20	85.94	6.84
Scale + Noise	95.27	94.95	89.70	88.12	5.57
Scale + Smooth	95.35	95.05	89.50	87.73	5.85
Noise + Smooth	95.19	94.90	89.80	88.25	5.39
Full (S + N + B)	95.27	94.91	90.50	89.49	4.77

Full FSSR: 4.6% reduction in domain gap relative to baseline.

Table 14. Perturbation component ablation for ResNet-34 (

λ = 0.05

).

Table 14. Perturbation component ablation for ResNet-34 (

λ = 0.05

).

Components	Kaggle		BRISC		Gap
Components	Acc	F1	Acc	F1	(%)
None	95.73	95.42	85.50	80.12	10.23
Scale	95.96	95.69	87.80	84.56	8.16
Noise	96.18	95.97	90.80	89.15	5.38
Smooth	96.11	95.85	87.50	84.23	8.61
Scale + Noise	96.72	96.48	91.40	89.87	5.32
Scale + Smooth	96.34	96.10	89.90	87.45	6.44
Noise + Smooth	96.57	96.35	92.10	90.78	4.47
Full (S + N + B)	97.71	97.55	93.70	92.62	4.01

Full FSSR: +8.20% BRISC accuracy; 61% reduction in domain gap.

Table 15. Perturbation component ablation for DenseNet-121 (

λ = 0.05

).

Table 15. Perturbation component ablation for DenseNet-121 (

λ = 0.05

).

Components	Kaggle		BRISC		Gap
Components	Acc	F1	Acc	F1	(%)
None	96.03	95.75	94.20	94.32	1.83
Scale	96.88	96.70	95.50	95.42	1.38
Noise	96.95	96.78	95.30	95.18	1.65
Smooth	96.72	96.55	94.60	94.15	2.12
Scale + Noise	97.10	96.94	95.80	95.65	1.30
Scale + Smooth	97.18	97.02	96.00	95.88	1.18
Noise + Smooth	97.33	97.18	96.30	96.15	1.03
Full (S + N + B)	97.64	97.47	96.70	96.87	0.94

Full FSSR: +2.50% BRISC accuracy; 49% reduction in domain gap.

Table 16. Summary: Full FSSR vs. baseline and best single component.

Architecture	Baseline	Best Single	Full FSSR	Improvement
Architecture	BRISC	BRISC	BRISC	vs. Base	vs. Single
ResNet-18	90.50	89.40 (N)	90.50	+0.00	+1.10
ResNet-34	85.50	90.80 (N)	93.70	+8.20	+2.90
DenseNet-121	94.20	95.50 (S)	96.70	+2.50	+1.20
Average	90.07	91.90	93.63	+3.57	+1.73

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Kakon, S.C.; Jamwal, H.D.S.; Singh, S. Improving Cross-Domain Generalization in Brain MRIs via Feature Space Stability Regularization. Mathematics 2026, 14, 1082. https://doi.org/10.3390/math14061082

AMA Style

Kakon SC, Jamwal HDS, Singh S. Improving Cross-Domain Generalization in Brain MRIs via Feature Space Stability Regularization. Mathematics. 2026; 14(6):1082. https://doi.org/10.3390/math14061082

Chicago/Turabian Style

Kakon, Shawon Chakrabarty, Harishik Dev Singh Jamwal, and Saurabh Singh. 2026. "Improving Cross-Domain Generalization in Brain MRIs via Feature Space Stability Regularization" Mathematics 14, no. 6: 1082. https://doi.org/10.3390/math14061082

APA Style

Kakon, S. C., Jamwal, H. D. S., & Singh, S. (2026). Improving Cross-Domain Generalization in Brain MRIs via Feature Space Stability Regularization. Mathematics, 14(6), 1082. https://doi.org/10.3390/math14061082

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Improving Cross-Domain Generalization in Brain MRIs via Feature Space Stability Regularization

Abstract

1. Introduction

2. Related Work

2.1. Deep Learning for Automated Brain Tumor Classification

2.2. Domain Shift and Generalization in Medical Imaging

2.3. Regularization and Adaptation Strategies

2.4. Feature Space Stability and Motivation for FSSR

3. Methodology

3.1. Datasets

3.2. Problem Formulation

3.3. Image Preprocessing and Intensity Perturbation

3.4. Network Architecture and Feature Space Regularization

3.5. Total Training Objective

3.6. Overall Training Framework

3.7. Backbone Architectures and Rationale

3.8. Feature Space Stability Measurement Protocol

3.9. External Generalization Evaluation

4. Results

4.1. Source Domain Classification Performance

4.2. Receiver Operating Characteristic Analysis

4.3. External Generalization Under Domain Shift

4.4. Statistical Significance Analysis

4.5. Model Calibration Under Domain Shift

4.6. Comparative Baselines

4.7. Feature Space Stability Analysis

4.8. Ablation Study: Regularization Weight λ

4.9. Summary of Key Findings

5. Discussion

6. Limitations

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

4.8. Ablation Study: Regularization Weight $λ$