1. Introduction
Brain diseases such as tumors, Alzheimer’s disease, multiple sclerosis, and stroke affect millions worldwide, leading to significant health and societal burdens [
1]. Magnetic resonance imaging (MRI) has become the central modality for studying these conditions due to its non-invasive nature, superior soft tissue contrast, and ability to capture diverse anatomical and physiological information across multiple sequences. While originally developed for clinical decision-making, the rapid expansion of publicly available MRI datasets has transformed neuroimaging into a data-driven domain where large-scale machine learning, particularly foundation models, now plays a pivotal role.
Foundation models, first established in natural language processing and computer vision [
2], are increasingly being explored for medical imaging [
3,
4,
5,
6,
7]. Their promise lies in learning generalizable representations from heterogeneous data and transferring them to a wide range of downstream tasks. However, the success of such models in brain MRI crucially depends on the availability of harmonized, large-scale datasets. Unlike other imaging domains, brain MRI suffers from high heterogeneity: multiple acquisition protocols, diverse sequence types (e.g., T1w, T2w, FLAIR, DWI), inconsistent annotations, fragmented repositories, and variable licensing terms. This fragmentation presents a unique challenge for developing general-purpose models.
A number of surveys and benchmarks have attempted to catalog medical imaging datasets, but they fall short in key ways when viewed from the perspective of brain MRI foundation models. For instance, MedSegBench [
8] aggregates 35 datasets across modalities yet includes only one brain MRI dataset and provides no analysis of voxel-level heterogeneity or preprocessing standards. Similarly, Dishner et al. [
9] cataloged 110 radiology datasets (49 brain MRI) but spanned too broad an anatomical scope to address brain-specific challenges such as sequence diversity and harmonization [
10]. Other focused reviews, e.g., on glioma datasets [
11,
12], provide valuable clinical and molecular context but rarely analyze imaging metadata (e.g., voxel resolution, intensity distributions, or missing modalities) that directly influence pretraining strategies. Even highly influential initiatives like the BraTS Challenge [
13,
14,
15,
16] have advanced reproducibility and benchmarking but rely on heavily preprocessed data, which reduces heterogeneity and thus limits real-world generalization. In short, prior surveys tend to be either too broad (spanning many anatomical domains) or too narrow (focusing on a single disease), and they often omit the image- and preprocessing-level variability most relevant for foundation model development.
This review addresses these gaps. We provide a structured and multi-level assessment of public brain MRI datasets with a specific focus on their suitability for foundation model training. Unlike prior works, we move beyond cataloging and explicitly quantify variability across dataset-level and image-level properties. We also evaluate the effects of preprocessing choices, which remain a largely underexplored source of covariate shift. Our analysis is designed to bridge the disconnect between dataset curation and model pretraining, highlighting practical considerations for building harmonized resources.
Our contributions are fourfold:
- (i)
Dataset-level review: We review 54 adult 3D structural brain MRI datasets covering over 538,031 subjects. This includes detailed analysis of modality composition, disease coverage, dataset scale, and licensing diversity, revealing major imbalances between healthy and clinical populations that influence pretraining data design.
- (ii)
Image-level profiling: We perform a quantitative comparison of voxel spacing, image orientation, and intensity statistics across 14 representative datasets. This analysis exposes strong variation in geometric resolution and contrast distribution, which can affect how foundation models learn anatomical and pathological features.
- (iii)
Quantitative evaluation of preprocessing variability: We measure how bias field correction, intensity normalization, skull stripping, registration, and interpolation modify voxel-level statistics and geometry across datasets.
- (vi)
Feature-space analysis of residual covariate shift: Using a 3D DenseNet121, we quantify cross-dataset divergence that remains after full preprocessing, linking voxel-level variability to learned representations.
Together, these contributions provide the first structured review that unifies dataset-, image-, and preprocessing-level analyses, offering practical guidelines for building harmonized and generalizable brain MRI foundation models.
2. Review Methodology
2.1. Data Collection and Selection Process
We performed a structured search for publicly available brain MRI datasets between May and June 2025. Sources included Google, Google Dataset Search, PubMed, Scientific Data, and major neuroimaging repositories such as TCIA, OpenNeuro, NITRC, CONP Portal and Synapse. Search terms combined phrases such as “public brain MRI dataset,” “open access brain MRI,” “3D structural brain MRI for AI,” and “MRI segmentation dataset,” with variations replacing “dataset” by “database.” No date restrictions were applied. Each repository entry or publication was manually reviewed to determine eligibility, and the process was repeated iteratively until no new datasets were identified, achieving data saturation.
This review focused exclusively on datasets containing 3D structural MRI of the adult human brain. Datasets were included only if they satisfied all of the following criteria:
- (i)
volumetric 3D structural MRI scans were available (not 2D slices or statistical maps);
- (ii)
subjects were adults;
- (iii)
at least one structural modality (e.g., T1-weighted) was included, rather than only functional or diffusion modalities (e.g., fMRI, DTI, MRA);
- (vi)
acquisitions were 3D static volumes (not 4D dynamic or time-resolved scans);
- (v)
at least 20 unique 3D scans were provided.
For multimodal datasets that additionally included fMRI, DTI, PET, or clinical assessments, only the structural MRI scans were considered in this review.
2.1.1. Screening Outcome
Our search yielded more than one hundred candidate entries across repositories and publications. After removing duplicates and excluding pediatric-only cohorts, 2D or statistical map datasets, collections with fewer than 20 scans, and datasets without accessible images, a total of 54 datasets were retained. Together, these cover 538,031 subjects and form the basis of our review.
2.1.2. Standardization of Modalities and Cohort Labels
To enable consistent comparison across heterogeneous datasets, we standardized both imaging modalities and cohort labels. The detailed mapping rules are summarized in
Appendix A Table A1 (modalities) and
Table A2 (cohorts). These datasets span a broad range of neurological and psychiatric conditions alongside healthy controls and vary in imaging protocols, scanner characteristics, and subject demographics. To maintain readability, only representative datasets are shown here in
Table 1. A full version of this table with all 54 datasets is provided in
Appendix A,
Table A3.
2.1.3. Subset for Image-Level Analysis
Due to licensing restrictions and regional access limitations, only a portion of the identified datasets could be downloaded for direct inspection. We selected a group of datasets that, together, represent several major brain conditions seen in structural MRI, including brain tumors, multiple sclerosis, stroke, epilepsy, neurodegenerative diseases, and healthy controls. This gives a good spread of different clinical situations while keeping the analysis manageable. We also chose datasets with clear and consistent NIfTI files and no obvious subject overlap so that voxel-level measurements could be compared reliably. The selected datasets therefore offer a balanced and representative sample while keeping the analysis focused and tractable. To avoid redundancy, we excluded benchmark collections that merely aggregate scans from other public sources, retaining only the original datasets. The subset used for image-level profiling includes MSLesSeg [
17], MS-60 [
18], MSSEG-2 [
19], BraTS25-MET [
20], BraTS25-SSA [
21,
22], BraTS25-MEN [
23,
24,
25], ISLES22 [
26], EPISURG [
27], OASIS-1 [
28], OASIS-2 [
29], IXI [
30], UMF-PD [
31], NFBS [
32], and BrainMetShare [
33].
2.2. Metadata Extraction
To enable consistent cross-dataset analysis, we programmatically loaded each image file and extracted key metadata. For every scan, we recorded spatial attributes (image dimensions, voxel spacing, orientation codes, affine matrix) and non-image attributes (modality, subject ID, session ID when available). Images outside the inclusion scope, such as DTI sequences in IXI, were excluded at this stage. All extracted metadata were stored in standardized per-dataset CSV files following a uniform schema. This structured resource forms the foundation for subsequent dataset- and image-level analyses presented in this review and is designed to facilitate reproducibility and reuse by the wider community.
Computational Resources
Computations were performed on a workstation with an NVIDIA Quadro RTX 8000 (48 GB) running CUDA 12.5 and 128 GB RAM, using Python 3.10.12 for preprocessing and PyTorch 2.2.1 for feature extraction.
3. Dataset-Level Analysis
3.1. Disease Coverage
The disease distribution analysis shown in
Figure 1 reveals a pronounced imbalance across public brain MRI datasets. After separating combined cohort labels and removing the undefined “Multiple Diseases” category, Healthy subjects form the largest group, followed by Neurodegenerative disorders (approximately 8800 subjects) and Brain Tumors (around 8400 subjects). Medium-scale categories include Stroke (2300 subjects), Autism (2200 subjects), and Epilepsy (870 subjects). Smaller datasets correspond to Psychiatric Disorders (455 subjects), Multiple Sclerosis (319 subjects), and White Matter Hyperintensities (170 subjects).
This distribution highlights the structural bias of the open neuroimaging landscape. The abundance of healthy and neurodegenerative cohorts reflects the historical focus on population-based and aging studies, while chronic, diffuse, or subtle pathologies remain underrepresented. Despite the diversity of available datasets, the dominance of a few diagnostic categories implies that current public MRI data cannot fully capture the clinical heterogeneity of the brain. This skewed representation constrains comparative analysis across disease types and may perpetuate overrepresentation of high-resource conditions in future benchmarks.
For foundation models, the imbalance in disease coverage directly influences representational learning. Pretraining dominated by T1-weighted healthy and Alzheimer’s data encourages the model to learn structural regularities and global contrast variations, while subtle lesion characteristics typical of demyelinating or vascular diseases remain statistically rare. Such bias limits transferability to small-lesion or microstructural disorders. To mitigate this, pretraining datasets should deliberately balance disease composition, incorporate underrepresented conditions (e.g., MS, WMH, psychiatric disorders), and include healthy scans primarily as anatomical anchors. Transparent reporting of disease proportions is essential for understanding bias propagation during large-scale pretraining.
3.2. Dataset Scale
The analysis of dataset sizes (
Figure 2) exposes an extreme imbalance in the public brain MRI landscape. A single dataset—UKBioBank—accounts for more than 500,000 subjects, while nearly all other datasets range from a few dozen to a few thousand participants. Yet when examined alongside disease coverage, the relationship between scale and content becomes more revealing: the largest datasets are almost exclusively composed of healthy or aging populations, whereas smaller datasets concentrate on specific pathologies such as brain tumors, stroke, and multiple sclerosis. In other words, data abundance is inversely correlated with clinical complexity.
For foundation models, the insight from this scale–disease relationship is profound. Pretraining must not simply accumulate images—it must balance information density against population scale. Large healthy datasets can anchor the model’s low-level feature representation, but meaningful generalization arises only when smaller, heterogeneous clinical datasets are interleaved to inject structural variability and abnormal morphology. The optimal training corpus is therefore not the largest one, but the one that combines datasets across scales and disease domains in a way that maximizes representational complementarity.
When merging datasets, several considerations follow:
Sampling balance: Naive aggregation will cause population-scale datasets to dominate optimization; adaptive weighting or stratified sampling is necessary to preserve rare clinical features.
Harmonization: Resolution, voxel spacing, and intensity normalization must be aligned to prevent the model from interpreting acquisition differences as anatomical variations.
Domain alignment: Cross-dataset normalization in feature space (e.g., domain-adversarial training or latent alignment) can reduce the domain gap between healthy and disease cohorts.
The scale analysis reveals that the most informative foundation model will not come from the largest dataset, but from the strategic fusion of small, diverse datasets with large, stable ones. Quantity establishes the foundation; diversity defines intelligence. A model pretrained under this philosophy learns both the invariant anatomy of the healthy brain and the variable morphology of disease, achieving robustness not through volume but through representational balance.
3.3. Modality Composition
The modality co-occurrence analysis (
Figure 3) reveals distinct pairing patterns among structural MRI sequences across public datasets. The most frequent combination is between T1-weighted and FLAIR scans, followed by T1–T2 and T1–T1C pairs. These sequences commonly co-occur within multi-contrast structural datasets such as BraTS, ADNI, and MSSEG, where complementary contrasts are used to capture both anatomical boundaries and pathological hyperintensities. Moderate co-occurrence is also observed among FLAIR, T2, and T1C, indicating a tendency for lesion-focused studies to integrate multiple structural contrasts that highlight different tissue characteristics. In contrast, single-modality datasets remain prevalent, particularly among population studies (e.g., IXI, OASIS), which provide only T1-weighted scans.
This co-occurrence pattern demonstrates that public brain MRI datasets—though diverse—are structurally interlinked through a limited but consistent set of core modalities. The strong correlation between T1 and FLAIR availability reflects a shared acquisition strategy for anatomical delineation and lesion sensitivity, while the partial inclusion of T2 and T1C indicates dataset-specific clinical emphasis (e.g., edema or contrast enhancement). The heatmap also reveals that cross-dataset modality overlap is incomplete: no single dataset provides full structural coverage, and different combinations dominate different disease domains. This partial alignment introduces redundancy in some modalities but gaps in others when datasets are combined.
For foundation models trained on aggregated public datasets, these co-occurrence dynamics carry important consequences. The uneven intersection of modalities across datasets means that multi-contrast information is not uniformly available for all subjects. This heterogeneity can lead to modality imbalance during pretraining and complicate cross-dataset harmonization. To address this, foundation models must incorporate modality-aware mechanisms—such as learned modality embeddings or masked reconstruction objectives—that can leverage overlapping contrasts while remaining robust to missing ones. The observed co-occurrence structure also suggests that structural modalities share sufficient anatomical redundancy to enable joint representation learning: by training across datasets with partially overlapping contrasts (e.g., T1 + FLAIR from one source, T1 + T2 from another), the model can implicitly learn a unified structural feature space that generalizes across acquisition protocols. Consequently, modality co-occurrence is not merely a dataset property but a key enabler of scalable, harmonized pretraining across heterogeneous MRI corpora.
4. Image-Level Analysis
At the image level, heterogeneity in voxel geometry, orientation, and intensity introduces latent biases that can substantially affect representation learning. These properties define the physical scale, spatial consistency, and dynamic range of brain MRI data—factors that determine whether a foundation model learns anatomical invariants or dataset-specific artifacts. Our image-level analysis quantifies these factors across 14 public datasets and provides interpretative insights for model design and harmonization.
4.1. Voxel Spacing
Voxel spacing defines the physical size of each voxel along the x, y, and z axes in millimeters, determining how finely anatomical structures are represented in the image and directly influencing the learning behavior of foundation models. When voxel spacing varies across datasets, the same convolution or attention kernel covers different physical regions, leading to inconsistent representation of anatomical details, blurred or missing small lesions in thicker slices, and domain shifts when combining data. This makes voxel spacing not just a technical aspect of MRI acquisition but a key factor that shapes model generalization. It affects architectures differently: CNNs may learn biased features when scale changes, transformers can misalign patches or positional encodings, and SAM-style models often lose boundary accuracy when slices are uneven—making anisotropy a hidden source of error that limits transferability.
Figure 4 shows the 3D distribution of voxel spacings across 14 representative datasets. Most datasets cluster near isotropic spacing around
mm, indicating uniform resolution across all axes. The three BraTS collections (BraTS-MET, BraTS-SSA, BraTS-MEN), OASIS-1/2, NFBS, and IXI fall into this group, providing consistent high-quality data for model pretraining. In contrast, multiple sclerosis datasets (MS-60, MSLesSeg, MSSEG-2) and BrainMetShare exhibit moderate anisotropy, with fine in-plane resolution (
mm) but thicker slices along the
z-axis (
mm). This reduces sensitivity to small or thin lesions that appear across only one or two slices. Stroke and surgical datasets, such as ISLES22 and EPISURG, show the widest variability, including cases with very thick slices (
mm) and variable in-plane spacing up to
mm. Such heterogeneity reflects differences in acquisition protocols across centers and scanners. Finally, mixed clinical datasets like UMFPD and BrainMetShare include both near-isotropic and anisotropic scans, representing real-world diversity in clinical imaging practices. These observations lead to three key insights that have direct implications for the development of foundation models: (i) many research datasets share near-isotropic resolution and are well-suited for standardized pretraining; (ii) clinical and disease-specific datasets tend to be anisotropic, introducing geometric inconsistencies that require explicit modeling; and (iii) spacing variability alone can cause measurable distribution shifts between datasets, even after resampling.
To further characterize these differences, we grouped each image into three categories based on the degree of anisotropy. We computed, for each image, the ratio between the largest and smallest spacing values among the three axes. If all spacings were equal (ratio = 1.0), the image was labeled as isotropic. If the ratio was greater than 1.0 but less than 2.0, it was labeled mildly anisotropic. Ratios of 2.0 or higher were labeled highly anisotropic.
As shown in
Table 2, most images fall into the isotropic or mildly anisotropic categories—7968 and 7152 images, respectively. However, over 1700 images are highly anisotropic, indicating substantial geometric distortion, especially in slice thickness. If left uncorrected, these differences can lead to biased model learning and performance degradation across datasets.
4.2. Orientation
The orientation of MRI volumes defines how the anatomical axes of the brain are mapped to the voxel coordinate system. Each MRI scan stores its orientation using a three-letter code (e.g., RAS, LAS, LPS), which specifies the direction of the x, y, and z axes relative to the patient’s anatomy. While orientation may appear as a technical metadata field, it has a direct and critical influence on the learning behavior of foundation models. When images are stored in inconsistent orientations across datasets, identical brain structures appear in different spatial locations or mirrored configurations. This leads to misalignment in anatomical correspondences, causing the model to learn orientation-specific patterns instead of generalizable anatomical features. Therefore, harmonizing orientation is essential for foundation models to learn consistent spatial representations that can generalize across diverse datasets.
Table 3 summarizes the orientation distribution across datasets. The most common orientation is RAS (6592 images), which is the standard convention in neuroimaging software such as FSL and FreeSurfer. However, a considerable number of datasets adopt alternative conventions, including LPS (5012 images) and LAS (3473 images). These three orientations together account for over 90% of all images analyzed. Notably, several datasets contain multiple orientations internally—for instance, BraTS-MET and EPISURG each include images in both RAS and LPS forms. Less frequent orientations such as RSA, PSR, or ASL are observed in smaller datasets (e.g., OASIS, NFBS, UMFPD). The presence of such variability reflects the absence of a unified orientation policy among dataset providers, even within well-curated public repositories.
The observed orientation heterogeneity introduces a subtle but significant source of distributional shift that can impair model transferability. Models trained on mixed-orientation data without explicit normalization may implicitly encode orientation-specific spatial priors. For example, left–right inversions between RAS and LAS orientations can confuse the model’s learned feature alignment, leading to inconsistent activation patterns for homologous brain regions. Similarly, inconsistent superior–inferior axis definitions can distort 3D spatial context, reducing the model’s ability to capture global anatomical symmetry.
For foundation model pretraining, these inconsistencies compound across large-scale datasets. Since pretraining relies on learning generic spatial and structural representations, uncorrected orientation differences can fragment the learned latent space, causing the model to associate the same anatomy with distinct feature embeddings depending on orientation. This weakens the universality of learned representations and increases the burden on fine-tuning.
Hence, orientation harmonization is not merely a preprocessing detail but a foundational requirement for effective cross-dataset learning. Converting all volumes to a common convention (typically RAS) before model training ensures that spatial relationships are consistent across datasets. For large-scale pretraining pipelines, we recommend enforcing explicit orientation standardization as part of dataset ingestion. Such harmonization minimizes unnecessary domain shifts, allowing the foundation model to focus on learning biologically meaningful anatomy rather than orientation artifacts.
4.3. Image Intensity Distribution
Image intensity represents the voxel-wise signal values within MRI scans and encapsulates the physical properties of tissues as captured by different imaging sequences. Intensity distributions are shaped by scanner hardware, acquisition protocols, and post-processing pipelines such as bias-field correction or intensity normalization. For foundation models, which depend on large-scale data aggregation from diverse sources, inconsistent intensity scaling or contrast profiles can substantially affect representation learning. A model trained on non-harmonized intensity profiles may implicitly overfit to dataset-specific brightness ranges, thereby reducing its ability to generalize across unseen domains.
Figure 5 illustrates the distribution of median voxel intensities across representative datasets. Datasets such as EPISURG, OASIS-1, OASIS-2, and IXI exhibit wide intensity variability, whereas others (e.g., the BraTS series, ISLES22, MSLesSeg, and BrainMetShare) show lower and more stable median values. This disparity likely arises from differences in scanner calibration, rescaling conventions (e.g., 0–255 versus z-scored), and preprocessing intensity normalization methods. The OASIS datasets, for example, show extensive dispersion with median intensities exceeding 300, reflecting a broad dynamic range and the absence of uniform scaling. In contrast, the BraTS and MS-related datasets exhibit tight clusters around zero, suggesting that bias correction and standardized normalization were consistently applied.
These differences have several implications for foundation model development. First, heterogeneous intensity distributions introduce latent biases that may lead a model to associate tissue contrast with dataset identity rather than underlying anatomy. This undermines the objective of learning scanner- and modality-invariant representations. Second, extreme intensity outliers—particularly in datasets with mixed acquisition conditions—can destabilize loss optimization during pretraining by distorting the input statistics used by normalization layers. Conversely, datasets with highly standardized intensity ranges, while beneficial for stable convergence, may limit the model’s exposure to real-world variability and thus reduce robustness during fine-tuning on unnormalized clinical data.
From a model design perspective, these findings highlight the importance of preprocessing-aware normalization strategies. Dynamic intensity scaling or adaptive histogram alignment could be implemented within the data loading pipeline to ensure consistent contrast across datasets. Alternatively, self-supervised objectives that promote intensity-invariant representations (e.g., histogram-matching augmentations or contrast consistency losses) may help the model decouple anatomical features from brightness variations. Ultimately, balancing intensity harmonization for stable training with sufficient distributional diversity for adaptability remains a key challenge for developing robust and generalizable MRI foundation models.
To quantitatively assess whether these intensity differences are statistically significant, we applied the Kruskal–Wallis H test to the per-image median values grouped by dataset. The result was highly significant (
,
), confirming that the observed inter-dataset variations are not due to random fluctuation. This non-parametric test evaluates whether at least one group differs in median from the others, without assuming a specific underlying distribution. The extremely low
p-value supports the visual findings in
Figure 5, indicating that intensity scaling differences across datasets are real, systematic, and substantial.
5. Intra-Dataset Patient-Level Analysis
To understand how variability appears not only across datasets but also within a single dataset, we conducted patient-level analyses on three representative collections: MSLesSeg, BraTS-MET, and IXI. These datasets were chosen because they highlight different kinds of internal heterogeneity. MSLesSeg shows variation in the number of longitudinal timepoints per patient, BraTS-MET illustrates multi-center and multi-orientation effects even within a curated challenge dataset, and IXI demonstrates scanner-related differences in a healthy cohort. For each dataset, we extracted patient-level metadata to summarize characteristics that may affect model design, training stability, or evaluation reliability. A broader summary of patient-level heterogeneity across all datasets considered in this study—including variation in timepoints, sites, and scanner field strengths—is provided in
Table A4. The complete metadata in CSV format is available in the
Supplementary Materials.
In MSLesSeg, patients do not all have the same number of MRI timepoints. Most patients () have only one timepoint, while others have two (), three (), or four (). This means the dataset mixes mostly single-timepoint data with a smaller amount of longitudinal follow-up. Such differences matter because patients with several timepoints show how lesions change over time, while single-timepoint patients provide only a snapshot. Models that use temporal information need to account for this mixture to avoid biasing toward the much more common single-timepoint cases.
BraTS-MET, despite being a standardized competition dataset, exhibits its own form of intra-dataset variation due to its multi-center construction. The cohort aggregates scans from numerous institutions with differing acquisition practices, leading to variability in voxel geometry, intensity behavior, and image orientation. Orientation is one clear example of this internal diversity: while most patients follow the RAS convention (), a substantial subset uses LPS (), and a very small number use LAS (). Such differences can affect both preprocessing and training, as inconsistent handling of axes may result in flipped anatomy, misaligned labels, or unintended orientation biases. Even in competition-grade datasets, this type of internal heterogeneity underscores the need for careful preprocessing and standardized orientation handling, especially when training large foundation models.
The IXI dataset illustrates yet another kind of internal variability. Although often treated as a uniform dataset, IXI contains images from different hospitals and MRI scanners. As shown in
Figure 6, the scanners differ in how many scans they contribute, the voxel spacing they use, their intensity distributions, and how much of the brain each scan covers. As a result, a model trained solely on IXI is still exposed to multiple acquisition styles rather than a single, consistent imaging domain.
Variation inside a single dataset is not unusual; similar effects appear in natural-image datasets, but MRI variation is more difficult for models to handle because voxel geometry and intensity scale are tied directly to the physical acquisition process. If such differences are not addressed, a model may learn scanner-specific or site-specific cues instead of anatomical patterns, leading to less robust and less generalizable representations. This risk is especially strong when one scanner or orientation dominates the dataset, allowing its acquisition style to overshadow minority subgroups.
In practice, medical foundation models benefit from simple but important preprocessing steps such as resampling to a common spacing, intensity normalization, orientation standardization, and balanced sampling across subgroups (e.g., scanners, orientations, or timepoint counts). These steps help the model focus on anatomy rather than acquisition identity and help reduce internal domain shifts. Addressing intra-dataset variability is necessary even before combining datasets, as meaningful differences already exist within each dataset on its own.
6. Evaluation of Preprocessing Effects on Image Harmonization
To systematically evaluate the impact of preprocessing on data harmonization, we randomly sampled images from the curated datasets and applied a standardized pipeline comprising bias-field correction, intensity normalization, skull stripping, and spatial registration. The resulting images were analyzed through voxel-wise statistical comparisons and qualitative visual inspection to assess improvements in inter-dataset consistency and anatomical fidelity.
6.1. Intensity Normalization
Intensity normalization is the process of adjusting MRI voxel values to a common scale so that images from different scanners or subjects become comparable. The most common techniques include z-score normalization, histogram matching, and WhiteStripe normalization. Z-score normalization rescales each image to have zero mean and unit variance, reducing intensity range differences; it is best used as a simple, general method when datasets are diverse or lack a consistent reference. Histogram matching aligns the intensity distribution of each image to that of a reference scan or template, making it ideal for multi-site datasets with large scanner or protocol variability. WhiteStripe normalization uses the intensity range of normal-appearing white matter to anchor scaling, which is most effective for brain studies where maintaining tissue contrast is important.
As summarized in
Table 4, the original voxel intensities span a wide range, reflecting strong contrast between bright enhancement regions and darker tissues. After applying z-score normalization, the intensity distribution becomes centered around zero with reduced variance, resulting in a more uniform and balanced appearance across tissues. However, this transformation also alters the visual contrast, as shown in
Figure 7: some brain regions appear brighter, while fine structural details become less pronounced. This effect occurs because z-score normalization rescales voxel values relative to the global mean and standard deviation, thereby compressing the overall dynamic range and reducing intensity extremes.
When building foundation models, intensity normalization should be applied consistently across all datasets to prevent artificial domain shifts. The chosen method must preserve relative tissue contrast while harmonizing global intensity ranges. It is also beneficial to expose the model to multiple normalization styles during pretraining, helping it learn invariance to contrast variations. Finally, combining preprocessing-based normalization with learnable normalization layers (e.g., instance or adaptive layer normalization) allows the model to adapt dynamically to unseen data while maintaining stable, harmonized feature representations.
6.2. Bias Field Correction
Bias field correction adjusts MRI images to remove gradual brightness variations caused by uneven magnetic fields or coil sensitivity. These variations make some regions look brighter or darker even when the tissue is the same, so correction helps make the intensity more uniform across the brain. The popular methods include N4ITK (N4 bias field correction), N3 (nonparametric nonuniform intensity normalization), and SPM’s unified segmentation approach. In this review, we applied N4ITK bias correction with modality-specific tuning, such as adjustments included enhanced smoothing for FLAIR, brain masking for T1C images, and balanced settings for T1 and T2, using the SimpleITK implementation.
Representative examples are shown in
Figure 8. The raw image (top-left) displays uneven brightness—the left side of the brain appears darker due to scanner-related field inhomogeneity. After correction (top-second), the preprocessed image shows more uniform brightness across tissue regions, while the estimated bias field map (top-third) captures the smooth multiplicative field responsible for this nonuniformity. The intensity histograms reveal that voxel intensities have shifted and become more compact, indicating reduced variation between bright and dark areas. The horizontal and vertical profiles show that peaks corresponding to white matter and gray matter are now closer in amplitude, confirming improved intensity consistency. The intensity correlation plot (r = 0.823) shows that the correction maintains overall intensity relationships but rescales them toward a more uniform distribution. Quantitatively, as shown in
Table 5, the coefficient of variation decreases (0.207 → 0.163), meaning intensity variability within tissue is reduced, while the signal-to-noise ratio (SNR) remains similar (6.87 → 6.50), suggesting correction did not distort contrast or amplify noise. The difference map highlights smooth intensity shifts, with no sharp artifacts.
While bias correction helps standardize input intensities for foundation model training, its effects vary with modality, anatomy, and pathology. Overcorrection may reduce lesion contrast or introduce distortions, while undercorrection can leave scanner-specific artifacts. Hence, visual and quantitative validation is essential, particularly when aggregating multi-source data.
6.3. Skull Stripping
The primary goal of skull stripping is to remove non-brain tissue, such as the skull, scalp, and dura mater, from the image. This is a critical step as these tissues have high-intensity signals that can interfere with intensity normalization and confuse segmentation algorithms. Common tools include FSL’s Brain Extraction Tool (BET) [
34], AFNI’s 3dSkullStrip [
35], and more recently, deep learning-based methods like HD-BET [
36], which often provide more accurate results. While most datasets in our analysis are provided pre-stripped (e.g., BraTS, ISLES22), the specific algorithm used often varies or is not documented, leading to subtle differences in the final brain mask.
Figure 9 illustrates the effect of skull stripping on a PD image from the IXI dataset, where non-brain tissues such as the scalp and skull are successfully removed, leaving only the intracranial structures for further analysis.
From a foundation model standpoint, skull stripping can influence both pretraining and downstream transfer. When training models across multiple datasets, consistent skull stripping helps reduce non-biological variability and ensures that the model focuses on relevant brain structures. However, inconsistency across datasets—where some scans are stripped and others are not—can lead to feature-space fragmentation, causing the model to learn dataset-specific biases rather than generalizable brain representations. Therefore, strict harmonization of preprocessing pipelines, including identical skull stripping tools, thresholds, and quality-control procedures, is essential.
Moreover, the choice to strip or retain the skull should align with the model’s target scope. For models designed to capture brain-centric features—such as lesion segmentation, cortical parcellation, or morphometric analysis—skull stripping is generally beneficial, as it directs attention to intracranial tissues. Conversely, for models intended to generalize across multi-modal or multi-organ contexts (e.g., MRI–CT alignment, PET fusion, or structural-to-functional transfer), removing the skull can limit cross-modality correspondence and reduce anatomical completeness. A practical strategy for large-scale foundation model pretraining is to include both stripped and unstripped variants of each scan and use metadata tags or preprocessing embeddings to inform the model about their origin. This dual representation encourages robustness to preprocessing differences and enables the model to learn invariance to skull presence—an increasingly important capability for generalizable medical foundation models.
6.4. Spatial Registration to MNI152
Spatial registration aims to align MRI volumes into a common anatomical space, reducing spatial variability across datasets. Using a modality-aware ANTs pipeline with rigid–affine–SyN transformations, we aligned representative scans to the MNI152 template. This process standardizes brain geometry but also exposes how registration can reshape anatomical statistics in subtle, dataset-specific ways.
Figure 10 shows the registration effect for a T2-weighted image from BraTS-MEN. The aligned scan closely matches the MNI template, and quantitative metrics confirm high structural similarity (mutual information = 0.974, structural similarity = 0.641). However, resampling expanded the image volume by 26.9%, and local correlation (
) indicates that voxel intensity relationships were partially altered. Overlay maps and checkerboard comparisons highlight that most deviations occur near lesion borders and ventricles—regions where pathology or intensity nonuniformity interact poorly with the template deformation.
These findings reveal an essential trade-off. Registration improves spatial consistency across datasets, supporting template-based feature extraction and patch sampling. Yet, excessive geometric forcing can distort pathological anatomy and attenuate lesion contrast, especially in heterogeneous clinical data. For foundation model pretraining, this suggests that full MNI normalization may be beneficial only for structural harmonization, while native-space training augmented with local spatial perturbations could better preserve disease-specific variability and improve cross-domain generalization.
6.5. Interpolation of Thin-Slice Volumes
Several clinical datasets, such as MS-60, contain scans with limited z-axis coverage or thick slices, producing anisotropic volumes that hinder 3D convolutional learning. To mitigate this, we applied an automated interpolation procedure that increases through-plane resolution while maintaining anatomical scale. This step is not simply geometric resampling—it directly determines how well small, low-contrast lesions are represented in 3D feature space.
Figure 11 illustrates a FLAIR image from the MS-60 dataset before and after interpolation. The original scan (13 slices) shows severe discontinuities and collapsed tissue boundaries, whereas the interpolated version (64 slices) restores smoother cortical contours and continuous sulcal structures without distorting global shape. Quantitatively, the effective slice thickness decreased by approximately 4.8×, enabling isotropic patch extraction for pretraining and consistent input dimensions across datasets.
From a foundation model perspective, interpolation functions as a structural equalizer: it harmonizes volumetric resolution across sources, improving patch uniformity and kernel receptive fields. However, it also generates synthetic voxels that may obscure very small hyperintensities or produce interpolation artifacts along lesion edges. Thus, interpolation should be applied selectively—preferably on high-anisotropy datasets or in conjunction with uncertainty-aware augmentations—to balance geometric consistency and lesion fidelity.
7. Residual Covariate Shift After Preprocessing
Despite standardized preprocessing, inherent heterogeneity in MRI data from diverse sources introduces residual covariate shift, which can impede the generalizability of deep learning models. This shift appears as subtle variations in noise patterns, intensity scaling, and remaining artifacts that preprocessing cannot fully remove. To examine this effect, we analyzed T1-weighted MRI scans of healthy subjects from two public datasets—NFBS (125 images) and a subset of IXI (54 images)—after applying a uniform pipeline consisting of skull stripping, N4 bias field correction, MNI152 registration, and intensity normalization. A single central axial slice was extracted from each volume to reduce computational cost while preserving key contrast differences. To characterize the remaining variability in feature space, we used DenseNet121 pretrained on ImageNet as a fixed feature extractor. This model offers a neutral and widely adopted representation that does not depend on either dataset, allowing observed differences to reflect genuine dataset variation rather than model training effects. From this network, 1024-dimensional feature vectors were obtained from the penultimate layer without fine-tuning. Quantitative assessment of the resulting feature differences between datasets is summarized in
Table 6.
Despite a high cosine similarity, indicative of similar vector directions, a substantial Euclidean distance and average Wasserstein distance highlight significant shifts in the magnitude and distribution of features. Statistical analysis further confirmed this divergence: 83.89% of all features exhibited statistically significant differences () after Bonferroni correction). These findings conclusively demonstrate that standard preprocessing is insufficient for complete MRI data harmonization. The persistent residual covariate shift in the learned feature space critically impairs model robustness and transferability across unseen domains. Therefore, developing and implementing explicit domain adaptation strategies—such as disentangled representation learning, meta-learning for domain generalization, and robust uncertainty estimation—is paramount for building truly generalizable and clinically reliable models. This is particularly crucial for the advancement of foundation models in high-stakes medical imaging applications.
8. Practical Considerations for Using Public Brain MRI Datasets
When combining open-source datasets from many public repositories—many of which contain raw scans with artifacts and large differences in acquisition—machine learning models often struggle to learn useful patterns. Without consistent preprocessing, the noise and inconsistencies make training less effective. In addition, because hospitals and imaging centers use different storage formats and protocols, patient data can be split across systems, duplicated, or incomplete. These issues create misleading patterns and hide important clinical signals, even for advanced models.
To reduce these problems, researchers should standardize all scans with steps such as resampling to isotropic voxel spacing, intensity normalization, and consistent orientation. For datasets collected from multiple sites or using different protocols, it is important to use consistent sequence names, keep scanner and site information as metadata, and apply corrections like z-score normalization or bias-field correction when needed. Missing modalities should be handled with flexible model designs (e.g., modality-aware embeddings or placeholder channels). Balanced sampling or stratified batches can help prevent models from overfitting to specific subgroups in diverse datasets. Together, these choices support the creation of strong and reliable pretraining datasets for brain MRI foundation models.
Beyond preprocessing, strict standardization and quality checks—including detecting errors—are essential. Expert clinical validation is also critical. Statistical improvements alone are not enough: a model that performs well on one dataset may fail—or even cause harm—when applied in a different clinical setting.
9. Limitations
This review has several limitations that reflect the scope we intentionally defined for this study. First, we focused exclusively on adult structural MRI datasets. This allowed us to compare voxel geometry, intensity behavior, and preprocessing effects in a consistent way, but it means that diffusion MRI, fMRI, quantitative MRI, and pediatric datasets were not included. These modalities and populations involve different acquisition characteristics and would require separate, modality-specific analyses.
Second, the landscape of publicly available brain MRI datasets is shaped by what institutions are able and willing to release. Most datasets come from Western or other highly developed regions, which creates a geographic imbalance that reflects the current availability of open data rather than a choice made in this review. Public datasets also carry selection bias: they are often collected in research-oriented or well-resourced clinical environments, and therefore tend to include higher-quality scans and focus on certain conditions such as Alzheimer’s disease, brain tumors, and healthy adults. Other clinical populations—such as psychiatric disorders, multiple sclerosis, vascular diseases, and routine lower-quality scans—are less represented or absent. As a result, the dataset-level and image-level patterns observed here may not fully generalize to the broader clinical landscape.
Finally, this study did not evaluate annotation quality or perform model benchmarking. These analyses are important for understanding how dataset variability affects downstream performance but fall outside the scope of the present work. Future studies could extend this review by examining annotation consistency, including more diverse datasets, and assessing how different pretraining corpora influence model behavior.
10. Conclusions & Discussion
In this study, we conducted a structured, multi-level analysis of 54 publicly accessible adult structural brain MRI datasets to characterize the variability most relevant to foundation model development. At the dataset level, our review highlights substantial imbalance in both scale and diagnostic coverage: large healthy and aging cohorts dominate the landscape, whereas clinically complex populations such as stroke, multiple sclerosis, and psychiatric disorders remain comparatively underrepresented. This skewed distribution implies that naïvely aggregated pretraining corpora may disproportionately reflect healthy anatomical priors while providing limited exposure to subtle or heterogeneous pathology.
Image-level profiling further revealed considerable heterogeneity in voxel spacing, orientation conventions, and intensity distributions across datasets. Although many research collections share near-isotropic resolution, clinical datasets frequently exhibit strong anisotropy and mixed orientation formats. Intensity statistics likewise vary systematically between datasets, as confirmed by the Kruskal–Wallis analysis. These findings indicate that foundational representation learning is shaped not only by biological variation but also by acquisition- and site-dependent factors that can introduce measurable covariate shifts.
Our quantitative evaluation of preprocessing pipelines demonstrates that standard steps such as bias-field correction, intensity normalization, skull stripping, registration, and interpolation improve within-dataset consistency but do not fully harmonize cross-dataset distributions. The feature-space assessment using a 3D DenseNet121 supports this observation: even after full preprocessing, non-trivial divergence persists between datasets, suggesting that harmonization must extend beyond conventional preprocessing. This underscores the need for preprocessing-aware architectures, modality-robust sampling, and domain adaptation strategies capable of handling real-world variability.
Together, our analyses provide an integrated view of dataset-, image-, and preprocessing-level variability and outline practical considerations for developing harmonized, robust, and generalizable brain MRI foundation models.