Dataset-Aware Preprocessing for Hippocampal Segmentation: Insights from Ablation and Transfer Learning

Khan, Faizaan Fazal; Kim, Jun-Hyung; Kim, Ji-In; Kwon, Goo-Rak

doi:10.3390/math13203309

Open AccessArticle

Dataset-Aware Preprocessing for Hippocampal Segmentation: Insights from Ablation and Transfer Learning

Department of Information and Communication Engineering, Chosun University, Dong-gu, Gwangju 61452, Republic of Korea

^*

Author to whom correspondence should be addressed.

Mathematics 2025, 13(20), 3309; https://doi.org/10.3390/math13203309

Submission received: 22 August 2025 / Revised: 27 September 2025 / Accepted: 13 October 2025 / Published: 16 October 2025

(This article belongs to the Special Issue The Application of Deep Neural Networks in Image Processing)

Download

Browse Figures

Versions Notes

Abstract

Accurate hippocampal segmentation in 3D MRI is essential for neurodegenerative disease research and diagnosis. Preprocessing pipelines can strongly influence segmentation accuracy, yet their impact across datasets and in transfer learning scenarios remains underexplored. This study systematically compares a No Preprocessing (NP) pipeline and a Full Preprocessing (FP) pipeline for hippocampal segmentation on the EADC-ADNI HarP clinical dataset and the multi-site MSD dataset using a 3D U-Net with residual connections and dropout regularization. Evaluations employed standard overlap metrics, Hausdorff Distance (HD), and Wilcoxon signed-rank tests, complemented by qualitative analysis. Results show that NP consistently outperformed FP in Dice, Jaccard, and F1 metrics on HarP (e.g., Dice 0.8876 vs. 0.8753, p < 0.05), while FP achieved superior HD, indicating better boundary precision. Similar trends emerged in transfer learning from MSD to HarP, with NP improving overlap measures and FP maintaining lower HD. To test whether the findings generalize across architectures, experiments on Harp Dataset were also repeated with a 3D V-Net backbone, which reproduced the same trend. Comparative analysis with recent studies confirmed the competitiveness of the proposed approach despite lower input resolution and reduced model complexity. These findings highlight that preprocessing choice should be tailored to dataset characteristics and the target evaluation metric. The results provide practical guidance for selecting segmentation workflows in clinical and multi-center neuroimaging applications.

Keywords:

3D U-Net; hippocampal segmentation; magnetic resonance imaging (MRI); preprocessing pipelines; transfer learning; medical image analysis; V-Net

MSC:

68T07; 68U10; 68T05

1. Introduction

Accurate segmentation of the hippocampus in three-dimensional MRI scans is critical for early diagnosis and monitoring of Alzheimer’s disease. Automated methods based on three-dimensional (3D) U-Net architectures often achieve high performance in well-preprocessed research datasets. For example, DeepHipp, a 3D dense-block network with attention mechanisms, reported a Dice score of approximately 0.836 on curated multi-center data [1]. Similarly, our previous work using a resource-efficient 3D U-Net on the HarP clinical dataset achieved a Dice score of 0.88 [2].

Although many 3D-like U-Net variants have been proposed in broader areas such as volumetric or video data processing, we focus here on U-Net architectures that are directly applied to hippocampal segmentation. Using this established baseline ensures fair comparison with recent works and allows us to isolate the effect of preprocessing pipelines without introducing additional variables from more complex architectures.

The success of these models depends heavily on complex preprocessing pipelines. These pipelines often include skull stripping, intensity normalization, contrast-limited adaptive histogram equalization (CLAHE), and wavelet-based denoising. While these steps aim to enhance anatomical details and reduce image noise, their individual contributions to model performance have not been systematically evaluated. There are no ablation studies isolating the impact of full preprocessing (FP) versus no preprocessing (NP), especially across different datasets [3].

Another challenge arises from domain shifts between datasets. Images from different scanners, sites, or protocols exhibit substantial variability, making models trained on one set less effective on others. Although transfer learning has been proposed to address this issue, the interaction between preprocessing and model adaptation remains unclear. A CapsNet-3D model recently achieved Dice scores up to 0.92 on Alzheimer’s classification tasks, showing promise for domain robustness, but it did not analyze the role of preprocessing in that context [4].

In addition to these examples, a number of recent studies have advanced hippocampal segmentation through variations of U-Net and related architectures. Yang et al. [5] proposed an improved 3D U-Net with attention mechanisms, reporting Dice scores above 0.86 on multi-center MRI data. Widodo et al. [6] applied transfer learning with volumetric U-Net models, demonstrating that pretrained weights improved performance when adapting across datasets. Qiu et al. [7] introduced a multitask edge-aware framework that enhanced boundary delineation in hippocampal segmentation. Prajapati and Kwon [8] developed SIP-UNet, which processed sequential inputs in parallel to improve brain tissue segmentation more broadly, while Sghirripa et al. [9] evaluated deep learning and subfield-specific methods for hippocampal segmentation across multiple protocols. Collectively, these works highlight progress in network design and training strategies. However, relatively few studies have systematically assessed how preprocessing pipelines—such as normalization, histogram equalization, or wavelet filtering—affect segmentation outcomes across datasets.

Our study addresses these gaps by performing a rigorous ablation analysis of preprocessing pipelines and their effect on both in-domain and cross-domain performance. We compare FP (CLAHE, 3D wavelet filtering) with NP (raw MRI volumes only normalized) across two datasets: HarP (clinical) and MSD (multi-center). We also evaluate how preprocessing affects transfer learning performance by fine-tuning MSD-trained models on HarP under both pipelines.

We measure outcomes using standard metrics, such as Dice coefficient, Jaccard index, Hausdorff distance, over- and under-segmentation ratios. We validate results with Wilcoxon signed-rank tests and provide qualitative overlays of segmentation outputs to illustrate visual differences.

This work offers the first controlled evaluation of how preprocessing affects both segmentation performance and cross-domain adaptability. Our findings will guide practical preprocessing choices for deep learning-based hippocampal segmentation in real-world clinical and research settings.

2. Materials and Methods

2.1. Datasets

We used two MRI datasets for hippocampus segmentation: The EADC-ADNI Harmonized Hippocampal Protocol (HarP) dataset [10,11] includes clinical T1-weighted MRIs. To focus on the region of interest and standardize model inputs, all images and labels were resized to a shape of (64, 64, 96). Each image and its label have shape (64, 64, 96). Voxel intensities range from 169.0 to 6409.0. Labels are binary: 0 for background, 1 for hippocampus. Figure 1 shows the image slice (slice 67) and its label highlighting the hippocampus location. An intensity histogram shows the distribution in Figure 2.

The Medical Segmentation Decathlon (MSD) dataset [12] is multi-center. It contains 260 volumes with varying original shapes. To standardize input, we padded all images and labels to a unified shape of (43, 59, 47), the maximum across volumes. Labels were binarized by merging anterior (1) and posterior (2) hippocampus classes shown in Figure 3 into a single class (1 for hippocampus).

Summary statistics across MSD volumes show:

1.: Mean intensity: ~29,412
2.: Standard deviation: ~13,406
3.: Minimum intensity: 0
4.: Maximum intensity: ~1.7 million
5.: Label distribution: ~94.8% background and ~5.2% hippocampus post-binarization

Aligning image dimensions and label formats makes both datasets compatible. This setup supports both within-domain analysis and cross-domain transfer learning under consistent preprocessing conditions

Detailed instructions for downloading and organizing the HarP dataset are provided in the Supplementary Materials.

2.2. Preprocessing Pipelines

We used two preprocessing pipelines to prepare the 3D MR images for segmentation: a Full Preprocessing (FP) pipeline and a No Preprocessing (NP) pipeline. Both were designed to ensure consistent input size and label formatting across datasets. The detailed steps for each pipeline are summarized in Table 1.

Details:

We have broken down our preprocessing techniques into following major steps:

1.

CLAHE (Contrast-Limited Adaptive Histogram Equalization) [13]

a.: Scale intensities to [0, 1].

I_{n o r m a l i z e d} (x, y, z) = \frac{I (x, y, z) - \min (I)}{\max (I) - m i n (I)}

(1)

b.: Mask empty background.

I_{m a s k e d} (x, y, z) = I_{n o r m a l i z e d} (x, y, z) \times m a s k (x, y, z)

(2)

c.: Apply 3D CLAHE with clip limit 0.03 to enhance contrast.

I_{C L A H E} (x, y, z) = C L A H E (I_{m a s k e d} (x, y, z), c l i p_{l i m i t})

(3)

2.

SCE-3DWT (Selective Coefficient-Enhanced 3D Wavelet Transform)

a.: Zero out voxels < 0.1.

I (x, y, z) = \{\begin{matrix} I (x, y, z) i f I (x, y, z) \geq T \\ 0 i f I (x, y, z) < T \end{matrix}

(4)

b.: Perform 3-level wavelet decomposition (Coiflet5).

c o e f f s = {C_{a p p r o x}, C_{d e t a i l}^{1}, C_{d e t a i l}^{2}, \dots, C_{d e t a i l}^{l e v e l}}

(5)

c.: Enhance detail coefficients by 0.8 and approximation coefficients by 0.4.

C_{d e t a i l}^{i} = C_{d e t a i l}^{i} \times e n h a n c e_f a c t o r

(6)

C_{a p p r o x}^{i} = C_{d e t a i l}^{i} \times e n h a n c e_f a c t o r

(7)

d.: Reconstruct the volume and restore threshold.

I_{r e c o n s t r u c t e d} = D W T^{- 1} (c o e f f s)

(8)

3.

Padding/Cropping

a.: HarP images are resized to (64, 64, 96); MSD images are padded to (43, 59, 47).
b.: Ensures all volumes fit the 3D U-Net input requirement.

4.

Normalization & Binarization

a.: Images are converted to float32 and normalized.
b.: Labels are binarized: combining anterior and posterior hippocampus into one class for MSD.

For validation and visualization, we processed three sample volumes from each dataset. Results show that the FP pipeline increased slice SNR from approximately 1.89 to 2.04 on HarP shown in Figure 4 and from 2.27 to 2.60 on MSD shown in Figure 5. Histograms indicate sharper peaks, and edge comparisons reveal that SCE-3DWT preserves structural details more effectively than CLAHE.

By restricting NP to normalization and resizing only, and matching dataset labels, we isolate the effects of contrast enhancement and wavelet-based filtering. This setup enables a clear evaluation of preprocessing benefits in downstream segmentation and transfer learning tasks.

2.3. Three-Dimensional U-Net Model Architecture and Training

We employed a custom 3D U-Net architecture for binary segmentation of the hippocampus (Figure 6). The model follows the standard encoder–decoder design, enhanced with residual connections, instance normalization, and dropout for stability and regularization.

The encoder consists of five convolutional blocks. Each block applies two 3D convolutions followed by Instance Normalization, ReLU activation, and Dropout (abbreviated as IRD). Residual connections within each block improve gradient flow and reduce the risk of vanishing gradients. After each block, a 2 × 2 × 2 max-pooling layer halves the spatial resolution while doubling the number of channels, increasing representational capacity (progression: 32 → 64 → 128 → 256 → 512).

At the bottleneck layer, features are processed at 512 channels to capture high-level spatial context before reconstruction in the decoder.

The decoder mirrors the encoder using 2 × 2 × 2 transposed convolutions for upsampling. Skip connections link encoder and decoder blocks at the same resolution, allowing high-resolution spatial information to be preserved during reconstruction. Each decoder block applies two convolutional layers with IRD, progressively reducing the number of channels (512 → 256 → 128 → 64 → 32). A final 1 × 1 × 1 convolution maps the output to a single channel, producing the binary hippocampal segmentation mask.

This design balances accuracy with efficiency: residual connections and normalization enhance convergence stability, skip connections retain fine anatomical details, and dropout provides regularization. Overall, the architecture has 22.6 million trainable parameters, occupies ~86 MB in memory, and processes a single MRI volume in approximately 72.5 ms, making it suitable for clinical workflows.

2.3.1. Training and Setup of 3D U-Net Model

All experiments were implemented in PyTorch 1.12.1 + cu113 (+cu113 means the wheel was built with CUDA 11.3) using a batch size of 2. We optimized the model with Adam (learning rate = 1 × 10⁻³, weight decay = 1 × 10⁻⁴) and applied a ReduceLROnPlateau scheduler (factor 0.1, patience = 2). Early stopping was triggered after 7 epochs without validation improvement, with a maximum of 250 epochs. Random seed was fixed at 42 for reproducibility. Training and transfer learning experiments were executed on two NVIDIA Quadro P4000 GPUs (8 GB each).

The following Table 2 summarizes architecture and training details:

Data splitting followed consistent ratios:

HarP dataset (135 volumes): 80% training (108), 10% validation (13), 10% testing (14).
MSD dataset (260 volumes): 70% training (182), 15% validation (39), 15% testing (39).

We first trained models from scratch on MSD under both preprocessing pipelines—MSD FP and MSD NP. After convergence, these pretrained models were fine-tuned on HarP using the same respective pipelines (FP → FP and NP → NP). Transfer learning reused the full model weights, with the same optimizer, learning rate, scheduler, and early stopping settings.

2.3.2. Implementation Details

All experiments were conducted in Python 3.9.13 using PyTorch for model development and training. Model training and inference were performed on a workstation equipped with two NVIDIA Quadro P4000 GPUs (8 GB each) and 64 GB RAM, with CUDA enabled for acceleration.

Data processing and deep learning workflows were managed using Visual Studio Code v1.104.1 and Jupyter Notebooks v2025.8.0. Data was split into training, validation, and test sets in a deterministic manner, ensuring the same split for every experiment. Batch size was fixed at 2 for all loaders.

2.4. V-Net Model Architecture and Training

To examine whether the effect of preprocessing is dependent on network architecture, we implemented a 3D V-Net backbone as an alternative to the 3D U-Net described in Section 2.3. V-Net is a fully convolutional encoder–decoder architecture designed for volumetric segmentation. Unlike U-Net, which integrates residual connections in its convolutional blocks, V-Net employs standard convolutional blocks with deconvolution-based up-sampling to preserve spatial context across scales.

2.4.1. Architecture

The model takes a single-channel input volume of size 64 × 64 × 96. The encoder path consists of three convolutional stages with feature map sizes 16, 32, and 64, each followed by 2 × 2 × 2 max-pooling for down-sampling. A bottleneck stage with 128 feature maps captures high-level semantic information. The decoder path mirrors the encoder with three transposed 3D convolution layers for up-sampling, followed by convolutional blocks with 64, 32, and 16 filters. To restore spatial detail, skip connections are used to concatenate encoder and decoder features at matching resolutions. Unlike residual connections inside convolutional blocks, which directly add inputs to outputs, these skip connections simply pass multi-scale feature maps between network levels. A final 1 × 1 × 1 convolution produces a single-channel output map, followed by a sigmoid activation to generate voxel-wise hippocampus probabilities.Each convolutional block contains two consecutive 3 × 3 × 3 convolutions, each followed by instance normalization and ReLU activation. The complete network has approximately 1.4 million trainable parameters.

2.4.2. Training and Setup of VNET Model

The training configuration for the V-Net backbone was kept consistent with the U-Net experiments to enable a direct comparison. We trained with the Adam optimizer using a learning rate of 1 × 10⁻³ and weight decay of 1 × 10⁻⁴. Binary cross-entropy with logits (BCE) was used as the primary loss function; we also implemented a hybrid Dice + BCE loss, though BCE provided stable convergence. Training was performed for up to 250 epochs with a batch size of 2, using early stopping if validation Dice did not improve for 7 epochs. Mixed precision training (PyTorch autocast and gradient scaling) was enabled to improve efficiency. Only the best-performing model, defined by the highest validation Dice score, was saved for evaluation. The complete architectural and training setup for the 3D V-Net backbone is summarized in Table 3.

2.5. Evaluation Metrics

We evaluated segmentation performance using a comprehensive set of metrics. Table 4 summarizes each metric and its meaning.

2.6. Study Workflow

To provide a clear overview of the methodology, we summarize the entire study pipeline in Figure 7. The workflow begins with two datasets—HarP (clinical, resized to 64 × 64 × 96) and MSD (multi-center, padded to 43 × 59 × 47). Each dataset is processed under two alternative preprocessing pipelines: a Full Preprocessing (FP) pipeline including CLAHE and 3D wavelet filtering, and a No Preprocessing (NP) pipeline that uses only normalization.

Both pipelines were used to train two segmentation backbones: a custom 3D U-Net model with residual and skip connections (baseline) and a 3D V-Net model (alternate backbone check). Evaluation was performed under two scenarios: in-domain testing (training and testing within the same dataset) and transfer learning (pretraining on MSD and fine-tuning on HarP).

Results were assessed using overlap-based metrics (Dice, Jaccard, F1, Accuracy), boundary-based metrics (Hausdorff Distance, OSR, USR), statistical testing (Wilcoxon signed-rank test), and qualitative overlays. The workflow concludes with key findings, showing that NP generally outperforms FP on overlap metrics and also achieved lower Hausdorff distances, while FP retained a marginal advantage in under-segmentation.

3. Results

We present a comprehensive analysis of segmentation performance for both full preprocessing (FP) and no preprocessing (NP) pipelines on the HarP and MSD datasets. All results are reported for training, validation, and test sets, and are supported by statistical comparisons. We further analyze the effectiveness of transfer learning and benchmark our results against existing studies. Detailed qualitative visualizations and supplementary plots are provided where appropriate.

3.1. HarP Dataset: FP vs. NP

We compared the segmentation performance of full preprocessing (FP) and no preprocessing (NP) pipelines on the HarP dataset. Table 5 presents the train, validation, and test set results for both pipelines.

NP achieved the highest test Dice (0.8876) and Jaccard (0.7981), with lower or comparable over- and under-segmentation ratios versus FP. Both approaches yielded high accuracy and low Hausdorff distance.

A Wilcoxon signed-rank test was performed to compare the NP and FP pipelines across all test samples. Results are summarized below in Table 6:

Statistical analysis shows NP significantly outperforms FP for most overlap and region-based metrics (Dice, Jaccard, OSR, USR, accuracy, F1, and loss; p < 0.05). There was no significant difference in Hausdorff distance (p = 0.224), indicating boundary quality is comparable for both methods.

In summary, no preprocessing (NP) provides a measurable advantage over full preprocessing (FP) for hippocampus segmentation on HarP, with statistically significant improvements in most key metrics.

3.2. MSD Dataset: FP vs. NP

Table 7 summarizes the training, validation, and test results for both the no preprocessing (NP) and full preprocessing (FP) pipelines on the MSD dataset.

Both pipelines performed similarly on the test set. NP achieved a Dice of 0.8642, while FP reached 0.8601. Accuracy, Jaccard, F1, IoU, OSR, and USR scores were nearly identical across both approaches. The Hausdorff distance was also comparable.

To assess statistical significance, a Wilcoxon signed-rank test was performed on the test set results for all key metrics shown below in Table 8:

There was no significant difference between FP and NP in Dice, Jaccard, or F1 scores. NP achieved marginally higher accuracy and slightly better OSR and loss, while FP showed a small but statistically significant advantage in Hausdorff distance. However, these differences were minor.

It is noted that in some cases the test loss was lower than the training loss. This behavior is expected because dropout was applied during training but disabled during testing, which reduces noise and lowers loss. In addition, the test set was smaller and more homogeneous than the training set, which can also lead to lower loss values. These effects do not indicate bias or data leakage, and the statistical tests across all metrics confirm the robustness of the reported results.

In summary, on the large, multi-center MSD dataset, no preprocessing (NP) achieves results on par with or better than full preprocessing (FP) for nearly all segmentation metrics. Preprocessing did not lead to meaningful performance gains in this more heterogeneous data setting.

3.3. Transfer Learning: FP vs. NP

We evaluated the effect of preprocessing on transfer learning by fine-tuning models trained on the MSD dataset (source) to the HarP dataset (target), under both FP and NP pipelines. The key metrics for test performance are summarized in Table 9 below.

Statistical analysis using the Wilcoxon signed-rank test (Table 10 below) revealed that the NP pipeline performed significantly better than FP in most metrics: Dice, accuracy, F1, Jaccard, OSR, USR, and loss (all p < 0.01). FP had a slightly lower Hausdorff distance, indicating marginally improved boundary accuracy, but the volumetric and region-based metrics consistently favored NP.

These results indicate that no preprocessing (NP) provides a measurable advantage for transfer learning from MSD to HarP, yielding significantly higher overlap, region accuracy, and lower loss compared to full preprocessing (FP). Only the Hausdorff distance was marginally better for FP, suggesting a minor benefit in boundary alignment, but this did not offset the overall gains observed with NP.

Although transfer learning is often associated with pretraining on very large datasets, such resources are not currently available for hippocampal segmentation. In practice, transfer between smaller datasets remains common in medical imaging because scanner protocols, populations, and acquisition settings still introduce significant domain shifts. Our MSD-to-HarP experiments reflect this setting, showing that even small-to-small transfer can yield measurable performance gains and provide insight into how preprocessing choices affect cross-domain adaptation.

Impact of Preprocessing in Transfer Learning

The findings suggest that extensive preprocessing is not required for optimal cross-domain adaptation in this setting. Models trained and fine-tuned with no preprocessing not only simplify the deployment pipeline but also deliver superior segmentation outcomes in key volumetric and regional metrics after transfer. This supports the use of streamlined dataset-agnostic workflows for multi-site or transfer learning applications in clinical MRI segmentation.

3.4. Qualitative and Visual Analysis

To supplement the quantitative findings, we performed qualitative analysis on representative test slices from both datasets. For the HarP dataset, slice 32 was selected; for the MSD dataset, slice 23 was used. Each comparison displays the ground truth label, the prediction from the no preprocessing (NP) model, and the prediction from the full preprocessing (FP) model.

In Figure 8 below Both NP and FP models localize the hippocampus accurately. The NP prediction shows slightly better coverage of the true hippocampal region and fewer false positives, consistent with its higher Dice and Jaccard scores. The FP prediction is slightly more conservative, occasionally missing peripheral voxels.

On the MSD dataset, Figure 9 shows that both pipelines capture the main hippocampal structure, but the NP model’s output aligns more closely with the ground truth boundaries. The FP result tends to slightly under-segment the region, which matches the observed lower Dice and Jaccard indices for FP on this dataset.

After transfer learning, the NP → HarP model prediction in Figure 10 below covers the ground truth with high fidelity, showing less under-segmentation than the FP → HarP model. These observations are consistent with the quantitative results, where transfer learning using no preprocessing led to higher overlap and accuracy.

Across both in-domain and transfer learning scenarios, the NP pipeline yields predictions that generally better match the true hippocampal contours, reducing over- and under-segmentation errors (see Figure 6, Figure 7 and Figure 8). However, the FP pipeline demonstrates a consistent advantage in boundary alignment as compared to NP pipeline, as reflected by the significantly lower Hausdorff distance in both test and transfer learning results. This suggests that while NP achieves higher volumetric overlap and region-based scores, FP may help preserve finer edge details and improve boundary accuracy, especially in challenging or ambiguous cases. These visual trends reinforce the quantitative findings and highlight that the optimal preprocessing strategy may depend on whether overlap or boundary precision is prioritized in a given application.

3.5. Comparative Analysis

To evaluate the performance of our proposed approach, we compared its best-performing configuration, the No Preprocessing (NP) pipeline on the HarP test set, with three recent open-access journal studies and our own earlier work on HarP. Table 11 summarizes the datasets, experimental setups, and key evaluation metrics. Since these studies used different datasets and experimental conditions, the reported values are not directly comparable; the table is provided to give context and illustrate that our approach achieves competitive performance despite lower resolution inputs and reduced model complexity.

3.6. Alternate Backbone (V-Net)

To examine whether the relative performance of FP and NP pipelines depends on the segmentation architecture, we repeated experiments using a 3D V-Net model (Section 2.4). Results are summarized in Table 12.

On the HarP dataset, V-Net reproduced the same trends observed with the U-Net backbone. The NP pipeline achieved higher Dice, Jaccard, F1, accuracy, and lower Hausdorff distance, demonstrating both superior volumetric overlap and more precise boundary localization. For example, on the HarP test set NP reached a Dice of 0.8833 versus 0.8683 for FP, a Jaccard of 0.7914 versus 0.7676, and a Hausdorff distance of 4.34 versus 4.67. FP showed a marginal advantage in under-segmentation ratio (USR), while NP reduced over-segmentation (OSR).

These results confirm that the superiority of NP is not restricted to the U-Net backbone but generalizes to the V-Net architecture, further strengthening the robustness of our conclusions.

4. Discussion

This study evaluated the effect of preprocessing pipelines on hippocampal segmentation performance using the HarP and MSD datasets. Two pipelines were compared: No Preprocessing (NP) and Full Preprocessing (FP), and a transfer learning setting from MSD to HarP was investigated. Performance was assessed using standard segmentation metrics, statistical significance testing, and qualitative visual analysis.

The results show that NP consistently achieved higher Dice, Jaccard, and F1 scores on the HarP dataset. For example, NP reached a Dice of 0.8876 on the HarP test set, outperforming FP with a Dice of 0.8753. Similar trends were seen for Jaccard and F1. Statistical testing confirmed these differences as significant for most overlap metrics. In contrast, FP achieved better Hausdorff Distance (HD) scores, indicating sharper and more accurate boundary localization, especially in challenging edge cases.

In the MSD dataset, the differences between NP and FP were smaller, and some metrics showed no significant change. This may reflect the more standardized acquisition and preprocessing of MSD images compared to HarP. Nevertheless, the HD advantage of FP persisted.

Transfer learning experiments from MSD to HarP revealed that NP-based fine-tuning achieved higher overlap metrics compared to FP-based fine-tuning. Dice for NP transfer reached 0.8843, while FP transfer reached 0.8736. All tested metrics showed significant differences, with NP outperforming FP in most cases except HD, where FP was superior. These results suggest that minimal preprocessing preserves domain-relevant features that are beneficial when adapting to a new dataset.

While these findings are promising, it should be noted that the domain shift between HarP and MSD may not fully capture the range of variability encountered in routine clinical practice, such as vendor-specific differences, motion artifacts, or older scanner protocols. In addition, the larger size of the MSD dataset compared with HarP introduces an imbalance that may influence transfer learning performance. Thus, the superiority of the NP pipeline should be interpreted in the context of the datasets tested, and further validation on broader and lower-quality datasets is needed to confirm robustness.

Qualitative results supported the quantitative findings. NP predictions generally aligned more closely with manual labels in terms of coverage, while FP predictions tended to capture boundaries with higher sharpness, reducing boundary errors. In some cases with irregular hippocampal shapes, FP produced better boundary adherence, explaining its lower HD values.

When compared to recent studies such as Yang et al. (2024) [5], Widodo et al. (2024) [6], and Qiu et al. (2021) [7], our NP HarP configuration achieved higher Dice and Jaccard scores, despite using lower input resolution and a smaller model. Compared to our previous work on HarP, the proposed approach achieved higher overlap scores and lower HD, reflecting the benefits of architectural refinement and experimental design that included both in-domain and cross-domain evaluation.

These findings indicate that preprocessing strategies should be selected based on the target metric and deployment scenario. NP is advantageous for maximizing overlap-based accuracy, particularly in transfer learning scenarios. FP is preferable for tasks where precise boundary localization is critical.

This study contributes to the ongoing advancements in medical image segmentation, where specialized architectures are continually developed for targeted applications. For example, Prajapati and Kwon [8] proposed the SIP-UNet architecture, which achieved high accuracy in brain tissue segmentation. Similarly, numerous works have focused on Alzheimer’s disease and other neurological disorders by targeting the hippocampal region of the brain [9,14,15,16]. These efforts demonstrate the broad applicability of segmentation techniques across various brain regions and medical conditions.

Another consideration is that the architecture itself may contribute to the observed trends. The custom 3D U-Net employed here integrates residual connections and instance normalization, which are known to improve robustness to intensity variation and stabilize optimization. To address this concern, we repeated the experiments with a 3D V-Net backbone, which does not include residual connections inside convolutional blocks but relies on encoder–decoder skip concatenations. The results with V-Net reproduced the same trends as U-Net: NP achieved higher Dice, Jaccard, F1, accuracy, and lower Hausdorff distances, while FP retained a marginal advantage in under-segmentation. These consistent outcomes across two widely used backbones confirm that the observed NP advantage is not an artifact of a single architecture and strengthens the generalizability of our findings.

Limitations of this study include the use of only two datasets and binary hippocampal segmentation. The results may not generalize to multi-class hippocampal subfield segmentation or other brain structures. Additionally, no domain adaptation methods were applied, and only one network architecture was evaluated. Future work will explore multi-dataset training, harmonization techniques, and lightweight architectures to support deployment in resource-constrained clinical environments.

5. Conclusions

This work presented a systematic comparison of No Preprocessing and Full Preprocessing pipelines for 3D hippocampal segmentation using the HarP and MSD datasets. The evaluation included in-domain training, cross-domain transfer learning, statistical testing, and qualitative analysis.

No Preprocessing consistently outperformed Full Preprocessing in Dice, Jaccard, and F1 metrics, especially on the HarP dataset and in transfer learning from MSD to HarP. Full Preprocessing delivered superior Hausdorff Distance, indicating better boundary precision. The choice between NP and FP should therefore depend on whether overlap accuracy or boundary precision is prioritized.

Compared to recent open-access studies and our own previous work, the proposed approach achieved competitive or superior results while maintaining lower input resolution and model complexity. This demonstrates the effectiveness of targeted preprocessing evaluation and transfer learning strategies for hippocampal segmentation.

The study contributes practical guidance for selecting preprocessing pipelines in clinical and multi-center neuroimaging workflows. Importantly, the consistency of NP’s advantage across both U-Net and V-Net backbones suggests that the findings are robust to network architecture and not restricted to a single model design.

6. Availability of Data and Code

The Medical Segmentation Decathlon (MSD) dataset used in this study is publicly available at http://medicaldecathlon.com/ (accessed on 24 July 2025).

The HarP dataset can be accessed by qualified researchers upon request through the Alzheimer’s Disease Neuroimaging Initiative (ADNI) portal (https://ida.loni.usc.edu/login.jsp?project=ADNI) (accessed on 3 August 2025) and the Harmonized Hippocampal Protocol resource (http://www.hippocampal-protocol.net/SOPs/index.php) (accessed on 3 August 2025). In accordance with ADNI/LONI policy, redistribution or resharing of the dataset or its derivatives is not permitted. The processed versions used in this study can therefore not be shared directly, but can be reproduced by downloading the original HarP data from ADNI/LONI and applying the resizing and preprocessing steps described in Section 2.2.

The code used in this study is available at https://github.com/FaizaanFazal/Ablation-full-code-3DUnet-hippocampus (accessed on 3 August 2025) (Please note: you can not use any chunk of data uploaded, it is only for reference and permission to use should be taken from ADNI as we mentioned above).

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/math13203309/s1. The supplementary document provides detailed step-by-step instructions for accessing and downloading the HarP dataset from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) website. It includes information on dataset registration, search queries, and file structure, along with guidelines for organizing image and label directories to reproduce the experiments described in this study. Additional notes on preprocessing setup, normalization parameters, and file naming conventions are also provided to facilitate full replication of the training and evaluation workflow.

Author Contributions

F.F.K. has generated the idea and conducted the experiments. J.-H.K. and J.-I.K. contributed the basic idea and reviewed at the experimental process and confirmation. G.-R.K. reviewed idea and final verification of results. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Research Foundation of Korea (NRF) funded by the Korean Government Ministry of Science and ICT (MSIT) under Grant NRF-2021R1I1A3050703. And this research was also supported by the BrainKorea21Four Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education under Grant 4299990114316. And this research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (RS-2023-0027033). This research was supported by Global-Learning & Academic research institu-tion for Master’s·PhD students, and Postdocs (LAMP) Program of the National Research Founda-tion of Korea (NRF) grant funded by the Ministry of Education (No. RS-2023-00285353). This work was supported in part by the Alzheimer’s Disease Neuroimaging Initiative (ADNI) funded by the National Institutes of Health under Grant U01 AG024904, and in part by the Department of Defense ADNI under Award W81XWH-12-2-0012. Additionally, this research was supported by the 2025 Gwangju Regional Innovation System & Education (RISE) Project.

Institutional Review Board Statement

We have the IRB acceptance number of Chosun University (2-1041055-AB-N-01-2021-34).

Data Availability Statement

The datasets used in this study were acquired from ADNI homepage, which is available freely for all researchers and scientists for experiments on Alzheimer’s disease and can be easily downloaded from ADNI websites: https://ida.loni.usc.edu/login.jsp (accessed on 3 August 2025).

Acknowledgments

The design and implementation of ADNI (Alzheimer’s Disease Neuroimaging Initiative) involved the contribution of ADNI researchers and the provision of data. However, they were not directly involved in the analysis or writing of this report. A comprehensive list of ADNI investigators can be accessed at (https://adni.loni.usc.edu/wp-content/uploads/how_to_apply/ADNI_Acknowledgement_List.pdf) (accessed on 3 August 2025) ADNI receives support from various sources including the National Institute on Aging, the National Institute of Biomedical Imaging and Bioengineering, as well as several generous donors such as AbbVie, Alzheimer’s Association, Alzheimer’s Drug Discovery Foundation, Araclon Biotech, BioClinica Inc., Biogen, Bristol-Myers Squibb Company, CereSpir Inc., Cogstate, Eisai Inc., Elan Pharmaceuticals Inc., Eli Lilly and Company, EuroImmun, F. Hoffmann-La Roche Ltd. and its affiliate Genentech Inc., Fujirebio, and GE Healthcare. ADNI’s clinical centers in Canada are funded by The Canadian Institutes of Health Research. The Foundation for the National Institutes of Health facilitates contributions from the private sector. The grantee of ADNI is the Northern California Institute for Research and Education, and the study coordinator is the Alzheimer’s Therapeutic Research Institute at the College of Southern California. The ADNI data is disseminated by the Laboratory for Neuro Imaging at the College of Southern California. Correspondence should be addressed to GR-K, grkwon@chosun.ac.kr.

Conflicts of Interest

The authors declare that data used in preparation of this article were obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu). The patients/participant provided their written informed consent to participate in this study. As such, the funder, and the investigators within ADNI contributed to the data collection, but did not participate in analysis, interpretation of data, the writing of this article or the decision to submit it for publication.

Abbreviations

The following abbreviations are used in this manuscript:

ADNI	Alzheimer’s Disease Neuroimaging Initiative
CLAHE	Contrast Limited Adaptive Histogram Equalization
FP	Full Preprocessing
NP	No Preprocessing
HD	Hausdorff Distance
HARP	Harmonized Protocol for Hippocampal Segmentation
IOU	Intersection over Union
MRI	Magnetic Resonance Imaging
MSD	Medical Segmentation Decathlon
OSR	Over-Segmentation Ratio
SCE	Statistical Contrast Enhancement
T1w	T1-weighted MRI
USR	Under-Segmentation Ratio

References

Wang, H.; Lei, C.; Zhao, D.; Gao, L.; Gao, J. DeepHipp: Accurate segmentation of hippocampus using 3D dense-block based on attention mechanism. BMC Med. Imaging 2023, 23, 158. [Google Scholar] [CrossRef] [PubMed]
Khan, F.F.; Kim, J.-H.; Park, C.-S.; Kim, J.-I.; Kwon, G.-R. A Resource-Efficient 3D U-Net for Hippocampus Segmentation Using CLAHE and SCE-3DWT Techniques. IEEE Access 2025, 13, 99923–99938. [Google Scholar] [CrossRef]
Carmo, D.; Silva, B.; Yasuda, C.; Rittner, L.; Lotufo, R. Extended 2D Consensus Hippocampus Segmentation. arXiv 2020. [Google Scholar] [CrossRef]
Rasheed, J.; Shaikh, M.U.; Jafri, M.; Khan, A.U.; Sandhu, M.; Shin, H. Leveraging CapsNet for enhanced classification of 3D MRI images for Alzheimer’s diagnosis. Biomed. Signal Process. Control 2025, 103, 107384. [Google Scholar] [CrossRef]
Yang, Q.; Wang, C.; Pan, K.; Xia, B.; Xie, R.; Shi, J. An improved 3D-UNet-based brain hippocampus segmentation model based on MR images. BMC Med. Imaging 2024, 24, 166. [Google Scholar] [CrossRef] [PubMed]
Widodo, R.S.S.; Purnama, I.K.E.; Rachmadi, R.F. Volumetric Hippocampus Segmentation Using 3D U-Net Based On Transfer Learning. In Proceedings of the 2024 IEEE International Conference on Computational Intelligence and Virtual Environments for Measurement Systems and Applications (CIVEMSA), Xi’an, China, 14–16 June 2024; pp. 1–6. [Google Scholar]
Qiu, Q.; Yang, Z.; Wu, S.; Qian, D.; Wei, J.; Gong, G.; Wang, L.; Yin, Y. Automatic segmentation of hippocampus in hippocampal sparing whole brain radiotherapy: A multitask edge-aware learning. Med. Phys. 2021, 48, 1771–1780. [Google Scholar] [CrossRef] [PubMed]
Prajapati, R.; Kwon, G.-R. SIP-UNet: Sequential Inputs Parallel UNet Architecture for Segmentation of Brain Tissues from Magnetic Resonance Images. Mathematics 2022, 10, 2755. [Google Scholar] [CrossRef]
Sghirripa, S.; Bhalerao, G.; Griffanti, L.; Gillis, G.; Mackay, C.; Voets, N.; Wong, S.; Jenkinson, M.; For the Alzheimer’s Disease Neuroimaging Initiative. Evaluating Traditional, Deep Learning and Subfield Methods for Automatically Segmenting the Hippocampus From MRI. Hum. Brain Mapp. 2025, 46, e70200. [Google Scholar] [CrossRef] [PubMed]
Frisoni, G.B.; Jack, C.R.; Bocchetta, M.; Bauer, C.; Frederiksen, K.S.; Liu, Y.; Preboske, G.; Swihart, T.; Blair, M.; Cavedo, E.; et al. The EADC-ADNI Harmonized Protocol for manual hippocampal segmentation on magnetic resonance: Evidence of validity. Alzheimers Dement. J. Alzheimers Assoc. 2015, 11, 111–125. [Google Scholar] [CrossRef] [PubMed]
Boccardi, M.; Bocchetta, M.; Morency, F.C.; Collins, D.L.; Nishikawa, M.; Ganzola, R.; Grothe, M.J.; Wolf, D.; Redolfi, A.; Pievani, M.; et al. Training labels for hippocampal segmentation based on the EADC-ADNI harmonized hippocampal protocol. Alzheimers Dement. 2015, 11, 175–183. [Google Scholar] [CrossRef] [PubMed]
Antonelli, M.; Reinke, A.; Bakas, S.; Farahani, K.; Kopp-Schneider, A.; Landman, B.A.; Litjens, G.; Menze, B.; Ronneberger, O.; Summers, R.M.; et al. The Medical Segmentation Decathlon. Nat. Commun. 2022, 13, 4128. [Google Scholar] [CrossRef] [PubMed]
Pizer, S.M.; Johnston, R.E.; Ericksen, J.P.; Yankaskas, B.C.; Muller, K.E. Contrast-limited adaptive histogram equalization: Speed and effectiveness. In Proceedings of the First Conference on Visualization in Biomedical Computing, Atlanta, GA, USA, 22–25 May 1990; pp. 337–345. [Google Scholar]
Pang, S.; Lu, Z.; Jiang, J.; Zhao, L.; Lin, L.; Li, X.; Lian, T.; Huang, M.; Yang, W.; Feng, Q. Hippocampus Segmentation Based on Iterative Local Linear Mapping With Representative and Local Structure-Preserved Feature Embedding. IEEE Trans. Med. Imaging 2019, 38, 2271–2280. [Google Scholar] [CrossRef] [PubMed]
Morra, J.H.; Tu, Z.; Apostolova, L.G.; Green, A.E.; Toga, A.W.; Thompson, P.M. Comparison of AdaBoost and Support Vector Machines for Detecting Alzheimer’s Disease Through Automated Hippocampal Segmentation. IEEE Trans. Med. Imaging 2010, 29, 30–43. [Google Scholar] [CrossRef] [PubMed]
Barmpoutis, A.; Vemuri, B.C.; Shepherd, T.M.; Forder, J.R. Tensor Splines for Interpolation and Approximation of DT-MRI With Applications to Segmentation of Isolated Rat Hippocampi. IEEE Trans. Med. Imaging 2007, 26, 1537–1546. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Image slice and corresponding label representation.

Figure 2. Intensity distribution Histogram.

Figure 3. MSD dataset classes before binarization.

Figure 4. Detailed Analytical Image slice of Harp showing the histogram comparison and edge difference between original and preprocessed image. (Note for Barchart: blue (original vs. CLAHE), red (Original vs. SCE-3DWT), green (CLAHE vs. SCE-3DWT).

Figure 5. Detailed Analytical Image slice of MSD showing the histogram comparison and edge difference between origianl and preprocessed image.

Figure 6. 3D UNET model detailed visualization with connections, layers and channel count.

Figure 7. Overall workflow of the study. Two preprocessing pipelines (FP and NP) were evaluated on HarP and MSD datasets using two segmentation backbones: 3D U-Net (baseline) and 3D V-Net (alternate backbone check). Performance was assessed under in-domain and transfer learning scenarios using multiple quantitative metrics and statistical tests.

Figure 8. Visual comparison of segmentation results on the HarP dataset (slice 32): (a) Ground truth segmentation in red, (b) Prediction from the NP pipeline in green, (c) Prediction from the FP pipeline in blue.

Figure 9. Visual comparison of segmentation results on the MSD dataset (slice 23): (a) Ground truth segmentation in red, (b) Prediction from the NP pipeline in green, (c) Prediction from the FP pipeline in blue.

Figure 10. Demonstrates transfer learning performance on HarP (slice 32) after fine-tuning from MSD: (a) Ground truth in red, (b) Prediction from the FP → HarP transfer model in green, (c) Prediction from the NP → HarP transfer model in blue.

Table 1. Preprocessing pipelines and steps used in this study.

Step	FP (Full Preprocessing)	NP (No Preprocessing)
CLAHE	Yes	No
Wavelet Enhancement	Yes (SCE-3DWT)	No
Padding/Cropping	To (64, 64, 96) HarP or (43, 59, 47) MSD	To (64, 64, 96) HarP or (43, 59, 47) MSD
Intensity Norm.	Scaled to [0, 1]	Scaled to float32 & [0, 1]
Label Binarization	0 = background, 1 = hippocampus	0 = background, 1 = hippocampus

Table 2. Summary of Model Architecture and Training details.

Attribute	Value
Input channels	1
Output channels	1 (binary segmentation)
Base channels	32
Dropout rate	0.2
Instance normalization	Used after each convolution block
Batch size	2
Optimizer	Adam
Initial learning rate	1 × 10⁻³
Scheduler	ReduceLROnPlateau (factor 0.1, patience 2)
Early stopping patience	7 epochs
Max epochs (training)	250
Random seed	42
Trainable parameters	~22.6 million
Inference time per volume	~72.5 ms

Table 3. Three-Dimensional V-Net model architecture and training configuration.

Attribute	Value
Input volume size	1 × 64 × 64 × 96 (channels × D × H × W)
Encoder filters	16 → 32 → 64
Bottleneck filters	128
Decoder filters	64 → 32 → 16
Skip connections	Encoder–decoder concatenation (not residual inside blocks)
Conv block structure	Two 3 × 3 × 3 convolutions + InstanceNorm + ReLU
Output layer	1 × 1 × 1 convolution + sigmoid
Total parameters	~1.4 million
Optimizer	Adam (lr = 1 × 10⁻³, weight decay =1 × 10⁻⁴)
Loss function	Binary cross-entropy (BCE) with logits; Dice + BCE also tested
Batch size	2
Training epochs	Max 250, with early stopping (patience = 7)
Mixed precision	Enabled (PyTorch autocast + GradScaler)
Model selection	Best validation Dice score checkpoint

Table 4. Summary of evaluation metrics used.

Metric	Description
Loss	Binary Cross-Entropy (BCE)
Accuracy (acc)	Voxel-wise ratio of correctly predicted labels to total voxels.
Dice coefficient	A measure of volumetric overlap between prediction and ground truth. Higher is better.
Jaccard index (IoU)	Intersection over union of predicted and true segment volumes.
F1 score	Equivalent to Dice for binary segmentation—it balances recall and precision.
OSR (Over-Segmentation Ratio)	Fraction of predicted hippocampus voxels not overlapping ground truth. It quantifies surplus false positive regions.
USR (Under-Segmentation Ratio)	Fraction of ground truth hippocampus voxels not captured by prediction. It measures missing true positives.
Hausdorff distance (HD)	The longest surface distance between prediction and ground truth masks. It highlights boundary alignment errors.

Table 5. Comparison of NP and FP pipelines on Harp Dataset.

Pipeline	NP Pipeline			FP Pipeline
Metric	Train	Val	Test	Train	Val	Test
Loss	0.0112	0.0128	0.0085	0.0138	0.0143	0.0103
Acc	0.9965	0.9958	0.9970	0.9962	0.9955	0.9967
Dice	0.8743	0.8454	0.8876	0.8651	0.8374	0.8753
Jaccard	0.7787	0.7411	0.7981	0.7648	0.7292	0.7784
F1	0.9205	0.9055	0.9281	0.9153	0.9009	0.9206
OSR	0.0639	0.0796	0.0594	0.0705	0.0854	0.0665
USR	0.0619	0.0750	0.0531	0.0645	0.0772	0.0583
HD	3.3991	3.6605	3.1337	3.2654	3.8091	3.0794

Footer Note: Bold value is the best value in a row.

Table 6. Statistical analysis showing significance of NP and FP pipelines on Harp Dataset.

Metric	NP (Mean ± Std)	FP (Mean ± Std)	Wilcoxon p-Value	Significant
Dice	0.8803 ± 0.0672	0.8713 ± 0.0623	9.07 × 10⁻¹⁸	Yes
Acc	0.9967 ± 0.0019	0.9964 ± 0.0018	5.95 × 10⁻¹⁸	Yes
F1	0.9253 ± 0.0311	0.9195 ± 0.0295	1.24 × 10⁻¹⁸	Yes
Jaccard	0.7911 ± 0.0824	0.7761 ± 0.0774	1.79 × 10⁻¹⁸	Yes
OSR	0.0656 ± 0.0392	0.0709 ± 0.0411	1.41 × 10⁻⁷	Yes
USR	0.0542 ± 0.0340	0.0580 ± 0.0307	5.51 × 10⁻⁹	Yes
HD	3.0510 ± 0.9788	3.1373 ± 1.0010	2.24 × 10⁻¹	No
Loss	0.0097 ± 0.0069	0.0112 ± 0.0056	2.04 × 10⁻¹⁹	Yes

Table 7. Comparison of NP and FP pipelines on MSD Dataset.

Pipeline	NP Pipeline			FP Pipeline
Metric	NP Train	NP Val	NP Test	FP Train	FP Val	FP Test
Loss	0.0102	0.0105	0.0093	0.0102	0.0111	0.0097
Acc	0.9966	0.9959	0.9964	0.9966	0.9958	0.9963
Dice	0.8736	0.8467	0.8642	0.8726	0.8401	0.8601
Jaccard	0.7769	0.7350	0.7616	0.7753	0.7252	0.7553
F1	0.9196	0.9035	0.9139	0.9190	0.8997	0.9115
IoU	0.7769	0.7350	0.7616	0.7753	0.7252	0.7553
OSR	0.0609	0.0876	0.0750	0.0578	0.0870	0.0757
USR	0.0657	0.0659	0.0610	0.0697	0.0731	0.0644
HD	3.1669	3.0939	3.1479	3.2071	3.2136	3.2590

Footer Note: Bold value is the best value in a row.

Table 8. Statistical analysis showing significance of NP and FP pipelines on MSD Dataset.

Metric	NP (Mean ± Std)	FP (Mean ± Std)	Wilcoxon p-Value	Significant
Dice	0.8642 ± 0.0376	0.8601 ± 0.0393	0.163	No
Acc	0.9964 ± 0.0027	0.9963 ± 0.0027	0.012	Yes
F1	0.9139 ± 0.0170	0.9115 ± 0.0176	0.498	No
Jaccard	0.7616 ± 0.0373	0.7553 ± 0.0390	0.169	No
OSR	0.0750 ± 0.0161	0.0757 ± 0.0173	0.009	Yes
USR	0.0610 ± 0.0364	0.0644 ± 0.0378	0.533	No
HD	3.1479 ± 1.6077	3.2590 ± 1.7626	1.77 × 10⁻²⁹	Yes (FP)
Loss	0.0093 ± 0.0157	0.0097 ± 0.0151	2.82 × 10⁻²³	Yes (NP)

Table 9. Comparison of NP and FP piplines on transfer learning.

Metric	MSD → HarP NP (Test)	MSD → HarP FP (Test)
Loss	0.0075	0.0083
Acc	0.9969	0.9966
Dice	0.8843	0.8736
Jaccard	0.7928	0.7757
F1	0.9260	0.9196
OSR	0.0638	0.0705
USR	0.0519	0.0560
HD	3.3142	3.1919

Footer Note: Bold value is the best value in a row.

Table 10. Statistical analysis showing significance of NP and FP pipelines on Transfer learning.

Metric	NP (Mean ± Std)	FP (Mean ± Std)	Wilcoxon p-Value	Significant
Dice	0.8764 ± 0.0668	0.8705 ± 0.0627	4.99 × 10⁻¹³	Yes
Acc	0.9966 ± 0.0019	0.9964 ± 0.0018	1.05 × 10⁻¹¹	Yes
F1	0.9229 ± 0.0308	0.9191 ± 0.0297	1.54 × 10⁻¹³	Yes
Jaccard	0.7848 ± 0.0817	0.7750 ± 0.0778	2.82 × 10⁻¹³	Yes
OSR	0.0703 ± 0.0393	0.0733 ± 0.0410	7.67 × 10⁻³	Yes
USR	0.0534 ± 0.0335	0.0563 ± 0.0309	4.89 × 10⁻⁵	Yes
HD	3.2483 ± 1.0311	3.0612 ± 0.8604	5.13 × 10⁻³	Yes (FP)
Loss	0.0089 ± 0.0071	0.0092 ± 0.0061	2.09 × 10⁻¹⁴	Yes (NP)

Table 11. Comparison of the proposed method (NP pipeline on HarP dataset) with recent studies in hippocampal segmentation. Results are reported as in the original works.

Aspect	Proposed Study (NP HarP)	Yang et al., 2024 [5]	Widodo et al., 2024 [6]	Qiu et al., 2021 [7]	Previous Study [2]
Datasets	EADC-ADNI HarP	Hangzhou Cancer Hospital	MSD + EADC-ADNI HarP	247 T1-weighted MRI scans	EADC-ADNI HarP
Dataset size	135 (train: 108)	200 (train: 145)	260 (MSD) + 135 (HarP)	247 (train: 80%)	135 (train: 108)
Input image size	64 × 64 × 96	128 × 128 × 128	64 × 64 × 64	Not specified	64 × 64 × 96
Transfer learning	No	No	Yes (2D U-Net and 3D U-Net)	Not specified	No
Dice coefficient	0.8876	0.8674 ± 0.0257	0.87 (1B experiment)	0.8483 ± 0.0036	0.8838
Jaccard index	0.7981	0.7668 ± 0.0392	≥ 0.78 (1B experiment)	0.75–0.80	0.7920
F1 score	0.9281	Not specified	Not specified	Not specified	0.9258
OSR	0.0594	0.0621 ± 0.0274	Not specified	Not specified	0.0594
USR	0.0531	0.1241 ± 0.0445	Not specified	Not specified	0.0569
HD	3.1337 voxels	3.9032 ± 1.3248 voxels	Not specified	7.5706 ± 1.2330 mm ≅ 3.671 voxels	3.2659 voxels

Footer Note: Bold value is the best value in a row.

Table 12. Test set comparison of FP vs. NP on HarP dataset using V-Net. As V-Net was included as an alternate backbone check, only single test set values are reported (no repeated runs or statistical testing).

Metric	NP (Test)	FP (Test)	NP vs. FP
Dice	0.8833	0.8683	NP ↑
F1	0.9254	0.9163	NP ↑
Recall	0.8948	0.8944	NP ↑
Jaccard	0.7914	0.7676	NP ↑
Hausdorff	4.3425	4.6690	NP ↓ (lower is better)
USR	0.0523	0.0515	FP ↓ (lower is better)
OSR	0.0645	0.0803	NP ↑
Accuracy	0.9972	0.9968	NP ↑
Loss	0.0254	0.0291	NP ↑

Footer Note: Bold value is the best value in a row, in last column NP of FP represent which is better and arrows represent higher or lower which is better.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Khan, F.F.; Kim, J.-H.; Kim, J.-I.; Kwon, G.-R. Dataset-Aware Preprocessing for Hippocampal Segmentation: Insights from Ablation and Transfer Learning. Mathematics 2025, 13, 3309. https://doi.org/10.3390/math13203309

AMA Style

Khan FF, Kim J-H, Kim J-I, Kwon G-R. Dataset-Aware Preprocessing for Hippocampal Segmentation: Insights from Ablation and Transfer Learning. Mathematics. 2025; 13(20):3309. https://doi.org/10.3390/math13203309

Chicago/Turabian Style

Khan, Faizaan Fazal, Jun-Hyung Kim, Ji-In Kim, and Goo-Rak Kwon. 2025. "Dataset-Aware Preprocessing for Hippocampal Segmentation: Insights from Ablation and Transfer Learning" Mathematics 13, no. 20: 3309. https://doi.org/10.3390/math13203309

APA Style

Khan, F. F., Kim, J.-H., Kim, J.-I., & Kwon, G.-R. (2025). Dataset-Aware Preprocessing for Hippocampal Segmentation: Insights from Ablation and Transfer Learning. Mathematics, 13(20), 3309. https://doi.org/10.3390/math13203309

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Dataset-Aware Preprocessing for Hippocampal Segmentation: Insights from Ablation and Transfer Learning

Abstract

1. Introduction

2. Materials and Methods

2.1. Datasets

2.2. Preprocessing Pipelines

2.3. Three-Dimensional U-Net Model Architecture and Training

2.3.1. Training and Setup of 3D U-Net Model

2.3.2. Implementation Details

2.4. V-Net Model Architecture and Training

2.4.1. Architecture

2.4.2. Training and Setup of VNET Model

2.5. Evaluation Metrics

2.6. Study Workflow

3. Results

3.1. HarP Dataset: FP vs. NP

3.2. MSD Dataset: FP vs. NP

3.3. Transfer Learning: FP vs. NP

Impact of Preprocessing in Transfer Learning

3.4. Qualitative and Visual Analysis

3.5. Comparative Analysis

3.6. Alternate Backbone (V-Net)

4. Discussion

5. Conclusions

6. Availability of Data and Code

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI