Enhanced U-Net for Spleen Segmentation in CT Scans: Integrating Multi-Slice Context and Grad-CAM Interpretability

Rahman, Sowad; Raju, Md Azad Hossain; Evna Jafar, Abdullah; Akter, Muslima; Suma, Israt Jahan; Uddin, Jia

doi:10.3390/biomedinformatics5040056

Open AccessArticle

Enhanced U-Net for Spleen Segmentation in CT Scans: Integrating Multi-Slice Context and Grad-CAM Interpretability

by

Sowad Rahman

¹,

Md Azad Hossain Raju

²

,

Abdullah Evna Jafar

³,

Muslima Akter

⁴,

Israt Jahan Suma

⁵ and

Jia Uddin

^6,*

¹

Department of Computer Science, Brac University, Kha 224 Pragati Sarani, Merul Badda, Dhaka 1212, Bangladesh

²

Department of Computer Science, University of South Dakota, Vermillion, SD 57069, USA

³

Economics & Decision Science, University of South Dakota, Vermillion, SD 57069, USA

⁴

Department of Computer Science and Engineering, Port City International University, Chittagong 4000, Bangladesh

⁵

Department of Computer Science and Engineering, Northern University, Dhaka 1210, Bangladesh

⁶

AI and Big Data Department, Woosong University, Daejeon 34606, Republic of Korea

^*

Author to whom correspondence should be addressed.

BioMedInformatics 2025, 5(4), 56; https://doi.org/10.3390/biomedinformatics5040056

Submission received: 23 August 2025 / Revised: 20 September 2025 / Accepted: 28 September 2025 / Published: 8 October 2025

Download

Browse Figures

Versions Notes

Abstract

Accurate spleen segmentation in abdominal CT scans remains a critical challenge in medical image analysis due to variable morphology, low tissue contrast, and proximity to similar anatomical structures. This paper presents an enhanced U-Net architecture that addresses these challenges through multi-slice contextual integration and interpretable deep learning. Our approach incorporates three-channel inputs from adjacent CT slices, implements a hybrid loss function combining Dice and binary cross-entropy terms, and integrates Grad-CAM visualization for enhanced model interpretability. Comprehensive evaluation on the Medical Decathlon dataset demonstrates superior performance, with a Dice similarity coefficient of 0.923 ± 0.04, outperforming standard 2D approaches by 3.2%. The model exhibits robust performance across varying slice thicknesses, contrast phases, and pathological conditions. Grad-CAM analysis reveals focused attention on spleen–tissue interfaces and internal vascular structures, providing clinical insight into model decision-making. The system demonstrates practical applicability for automated splenic volumetry, trauma assessment, and surgical planning, with processing times suitable for clinical workflow integration.

Keywords:

medical image segmentation; U-Net; spleen segmentation; deep learning; CT analysis; Grad-CAM; multi-slice context

1. Introduction

Accurate segmentation of abdominal organs in computed tomography (CT) scans forms the cornerstone of modern medical image analysis, enabling precise quantitative assessment for surgical planning, disease monitoring, radiation therapy planning, and treatment response evaluation [1]. Among abdominal organs, the spleen presents particularly complex segmentation challenges due to its highly variable morphology, relatively low contrast boundaries with adjacent tissues, and susceptibility to pathological alterations that significantly modify its appearance [2,3].

In emergency medicine, rapid and precise evaluation of splenic injuries following abdominal trauma directly influences treatment decisions and patient outcomes, where automated segmentation could significantly reduce diagnostic time while maintaining accuracy. Furthermore, longitudinal tracking of splenic morphological changes during disease progression or treatment response requires consistent and reproducible segmentation methods that automated approaches can provide.

Recent advances in deep learning have revolutionized medical image analysis, with convolutional neural networks (CNNs) achieving unprecedented performance across various segmentation tasks [4,5]. The U-Net architecture has emerged as particularly effective for medical image segmentation due to its symmetric encoder–decoder structure that captures both local and global contextual information through skip connections [6]. However, direct application of standard U-Net architectures to spleen segmentation encounters several fundamental challenges that limit their clinical applicability and accuracy.

The primary technical challenges in automated spleen segmentation include handling significant slice thickness variations in clinical CT acquisitions (ranging from 1.5 mm to 7.5 mm), managing consistently low contrast between splenic parenchyma and adjacent tissues, particularly in arterial phase imaging, maintaining segmentation accuracy across different contrast enhancement phases, and addressing the substantial morphological variability of the spleen across different patient populations and pathological conditions. Additionally, the spleen’s proximity to anatomically similar structures such as the left kidney, gastric fundus, and left adrenal gland creates frequent boundary confusion that standard segmentation approaches struggle to resolve.

We focused specifically on spleen segmentation due to its unique clinical challenges that distinguish it from other abdominal organs. The spleen presents particular difficulties, including highly variable morphology across patients, consistently low contrast boundaries with adjacent tissues, susceptibility to pathological alterations that significantly modify appearance, and proximity to anatomically similar structures like the left kidney and gastric fundus. These challenges make spleen segmentation a compelling standalone problem that requires specialized architectural considerations. Multi-organ approaches often compromise performance on individual organs by optimizing for average performance across all structures. Our targeted approach allows for spleen-specific optimizations, including multi-slice contextual integration and boundary-focused loss formulation.

Our research addresses these challenges through three primary contributions. First, we present an enhanced U-Net architecture that incorporates multi-slice contextual information through three-channel inputs, enabling the model to leverage volumetric information while maintaining the computational efficiency of 2D processing. This approach provides crucial anatomical context that helps to distinguish the spleen from adjacent organs with similar radiodensity. Second, we implement a carefully optimized hybrid loss function that combines region-based Dice loss with boundary-focused binary cross-entropy loss, achieving superior segmentation accuracy particularly at organ boundaries where clinical precision is most critical. Third, we integrate Grad-CAM visualization techniques to provide interpretable insights into model decision-making, enabling clinical validation of automated segmentations and identification of potential failure modes before clinical deployment.

2. Related Work

The recent literature on automated spleen segmentation can be categorized into traditional image processing approaches, machine-learning-based methods, and deep learning architectures. Table 1 provides a comprehensive comparison of key developments in the field, highlighting methodological approaches, performance metrics, and identified limitations.

Traditional approaches to spleen segmentation initially relied on classical image processing techniques combined with anatomical priors. A research study developed adaptive thresholding combined with morphological operations, achieving a Dice coefficient of 0.82 on contrast-enhanced CT scans but struggling with low-contrast regions and requiring extensive parameter tuning [25]. Shape-constrained geodesic active contour models were proposed that incorporated statistical shape priors, improving robustness but remaining computationally intensive and sensitive to initialization. These traditional methods, while providing interpretable results, consistently underperformed in challenging cases involving pathological morphology or suboptimal contrast enhancement.

The introduction of machine learning approaches marked a significant advancement in segmentation accuracy and automation. Multi-atlas registration methods [7] achieved improved consistency by leveraging multiple anatomical templates but required extensive computational resources and comprehensive atlas libraries. Statistical shape models [8] provided robust frameworks for incorporating anatomical constraints, yet struggled with significant morphological variations and pathological cases that deviated from normal anatomical patterns.

Deep learning approaches have demonstrated superior performance compared to traditional methods, with U-Net architectures leading this transformation. The original U-Net [4] established the encoder–decoder framework with skip connections that has become the foundation for medical image segmentation. The study in [26] first applied deep learning to pancreas segmentation, inspiring similar approaches for other abdominal organs and demonstrating the potential of CNNs for complex medical image analysis tasks.

Subsequent U-Net variants have attempted to address specific limitations of the original architecture. The study in [5] first introduced UNet++ with nested dense skip connections, achieving improved feature propagation and a 2% improvement in Dice score for multi-organ segmentation tasks. The study in [27] proposed Attention U-Net, incorporating attention gates to focus computational resources on target structures while suppressing irrelevant background features. The study in [28] developed multi-scale U-Net architectures with dilated convolutions to capture features at varying receptive fields, addressing the challenge of organs with diverse size characteristics.

Several researchers have explored uncertainty quantification and optimization-driven methodologies to enhance segmentation reliability. Uncertainty-driven graph convolutional network refinement was introduced by integrating 2D U-Net with GCN architectures, leveraging uncertainty maps to guide post-processing refinement stages, though this approach struggled with handling lesion size imbalance across diverse anatomical cases [15]. In parallel, proxy data utilization frameworks for hyperparameter optimization were developed, employing CNN-based optimization strategies to automatically tune network parameters using surrogate datasets; however, this methodology faced computational challenges due to expensive optimization procedures that limited scalability [16]. Addressing boundary precision concerns, perimeter-based loss functions were formulated and integrated within U-Net architectures, specifically targeting boundary enhancement through specialized loss formulations that emphasize contour accuracy; however, they encountered computational overhead from expensive contour calculations and gradient instability issues [17]. Enhanced 3D U-Net architectures were specifically designed for abnormal spleen segmentation, combining multiple datasets to improve robustness across diverse pathological presentations, though this work was constrained by limited dataset diversity and insufficient cross-validation protocols [18]. Decoupled two-stage architectures were proposed that separate localization and segmentation tasks using dedicated networks for each stage, aiming to improve initial organ detection before refined segmentation, but they demonstrated particular weaknesses when applied to smaller anatomical structures [19]. Additionally, conditional StyleGAN2 architectures were explored for medical image augmentation, generating synthetic CT images to expand training datasets and improve model generalization through adversarial training mechanisms; however, synthetic augmentation did not consistently provide beneficial effects across different anatomical targets due to domain gap challenges between generated and real medical images [23].

Contemporary research has increasingly focused on foundation model adaptation and sophisticated architectural innovations to address existing segmentation challenges. Zero-shot segmentation capabilities were investigated using the SAM 2 foundation model, adapting large-scale natural image segmentation models for medical imaging applications without task-specific fine-tuning, but it encountered significant domain adaptation challenges and limitations in handling volumetric medical data effectively due to the model’s inherent 2D processing nature [21]. Segmentation networks were combined with variational autoencoders to enable simultaneous segmentation and volume estimation, employing VAE architectures to learn volumetric representations that complement segmentation masks for improved geometric understanding, though 2D-based volume estimation showed inherent limitations for complex three-dimensional anatomical structures [22]. Sophisticated multi-decoder frameworks were developed incorporating MiT-B2 transformer backbones with multiple parallel decoders, each specialized for different anatomical features to better handle morphological heterogeneity inherent in abdominal organs through attention mechanisms and hierarchical feature extraction, though single-decoder variants showed reduced effectiveness for anatomically diverse cases [20]. Interactive AI-guided labeling systems were developed using the MONAI Label integrated with 3D Slicer platforms, focusing on active learning methodologies that iteratively improve annotation quality through human–AI collaboration and real-time feedback mechanisms, though this approach lacked comprehensive prospective validation studies to demonstrate clinical workflow integration effectiveness [24].

Loss function engineering has emerged as a critical factor determining segmentation performance, particularly for medical images with inherent class imbalance. The study in [29] introduced Dice loss to directly optimize the overlap metric most relevant for segmentation evaluation. The study in [30] proposed generalized Dice loss for multi-class segmentation tasks, while more recent work has explored hybrid loss functions combining region-based and boundary-based optimization objectives [31]. These developments highlight the importance of loss function design in achieving clinically relevant segmentation accuracy.

Despite significant progress in deep-learning-based segmentation, interpretability remains a fundamental challenge for clinical adoption. The study in [32] introduced class activation mapping (CAM) to visualize discriminative regions in CNN decision-making. The study in [33] extended this concept to gradient-weighted class activation mapping (Grad-CAM), which has been successfully applied to medical image interpretation [34]. However, limited work has specifically addressed interpretability in spleen segmentation, representing a significant gap in clinical applicability.

Current research gaps identified through this literature review include limited integration of volumetric context in computationally efficient 2D approaches, insufficient attention to boundary accuracy in medical segmentation tasks, lack of comprehensive interpretability analysis for clinical validation, and limited evaluation on diverse pathological conditions. Our proposed approach directly addresses these limitations through enhanced architectural design, optimized loss function formulation, and integrated interpretability analysis.

3. Materials and Methods

3.1. Dataset and Preprocessing

The Medical Segmentation Decathlon dataset [35] provides the foundation for our experimental evaluation, specifically utilizing Task09-Spleen, which comprises 61 portal venous phase contrast-enhanced abdominal CT scans acquired from patients undergoing chemotherapy treatment for liver metastases at Memorial Sloan Kettering Cancer Center, New York, NY, USA. The dataset includes 41 training cases and 20 test cases, with spleen segmentations annotated semi-automatically using the Scout application [35] through a level-set based approach, followed by manual adjustments by an expert abdominal radiologist to ensure clinical accuracy. Acquisition parameters include 120 kVp tube voltage, 500–1100 ms exposure time, 33–440 mA tube current, slice thicknesses ranging from 2.5 to 5.0 mm, a standard reconstruction kernel, and reconstruction diameters of 360–500 mm, with iodinated contrast material (150 mL Omnipaque 300) (GE Healthcare, Cork, Ireland) administered intravenously at rates between 1 and 4 cc/s which can be seen in Figure 1. This heterogeneity, reflective of real-world clinical variability in spleen morphology and field-of-view, along with potential chemotherapy-induced changes such as volume increase, ensures comprehensive evaluation of model robustness and generalizability across varied scenarios.

Dataset preprocessing begins with comprehensive CT windowing to enhance soft tissue contrast critical for spleen visualization. Raw CT values are transformed using abdominal windowing parameters with a window center of 150 HU and window width of 500 HU, corresponding to a display range of −100 to 400 HU. This transformation optimizes the typical spleen attenuation range (40–100 HU) for neural network processing. The windowing transformation is mathematically expressed as follows:

I_{w i n d o w e d} = \frac{clamp (I_{r a w}, W C - \frac{W W}{2}, W C + \frac{W W}{2}) - W C + \frac{W W}{2}}{W W}

(1)

where

I_{r a w}

represents the original CT intensity,

W C

is the window center (150 HU), and

W W

is the window width (500 HU).

Multi-slice input generation represents a key innovation in our approach, combining three adjacent slices as RGB channels to provide volumetric context while maintaining 2D processing efficiency [36]. Each input sample is formulated as follows:

I_{i n p u t} (x, y) = [I_{i - 1} (x, y), I_{i} (x, y), I_{i + 1} (x, y)]

(2)

where

I_{i}

represents the current slice and

I_{i - 1}

,

I_{i + 1}

are the adjacent slices. For boundary slices, we implement symmetric padding by duplicating the available slice. This approach provides crucial anatomical continuity information that helps to distinguish the spleen from adjacent organs with similar radiodensity.Each slice will be generated in red, green, and blue channels so that the model can know more about it and the model will directly obtain the RGB features of a slice from the channels.

3.2. System Architecture

The system architecture Figure 2 outlines the complete workflow for spleen segmentation. First, we split the Medical Decathlon dataset into 41 training and 20 validation scans to ensure robust model training and evaluation. Preprocessing, as described above, applies CT windowing and multi-slice input generation to prepare the data for the model. The preprocessed data is then fed into our enhanced U-Net model, which processes the multi-slice inputs to generate segmentation masks. A hybrid loss function, combining binary cross-entropy and Dice loss, optimizes the model’s performance, balancing pixel-wise accuracy and region overlap. The model is evaluated using standard metrics, including the Dice score, to quantify segmentation accuracy. Finally, Grad-CAM visualization is employed to analyze model attention, providing interpretability by highlighting regions of focus in the segmentation process.

Grad-CAM visualization provides interpretability analysis by computing gradient-weighted feature importance. For the final encoder layer, Grad-CAM generates localization maps through

L_{G r a d - C A M}^{c} = ReLU (\sum_{k} α_{k}^{c} A^{k})

(3)

where

A^{k}

represents feature map activations and

α_{k}^{c}

denotes importance weights:

α_{k}^{c} = \frac{1}{Z} \sum_{i} \sum_{j} \frac{\partial y^{c}}{\partial A_{i j}^{k}}

(4)

This visualization enables clinical validation of model attention regions and identification of potential failure modes.

3.3. Model Architecture

The enhanced U-Net architecture Figure 3 has a modified encoder and decoder described in Figure 4 and Figure 5 which is designed specifically for spleen segmentation, incorporating several key innovations to improve performance and efficiency. The input to the model consists of a three-channel image tensor, where each channel represents an axial CT slice: the central target slice flanked by its immediate adjacent slices (offsets −1 and +1), with boundary clamping to replicate the nearest slice when offsets exceed volume bounds. This multi-slice input strategy is employed because single-slice 2D models often lack volumetric context, leading to inconsistencies across slices; by incorporating adjacent slices, the model captures anatomical continuity and 3D-like features (e.g., spleen shape variations along the z-axis) without the computational overhead of full 3D convolutions, which is particularly beneficial for spleen segmentation where the organ’s elongated shape spans multiple slices and borders variable adjacent structures like the kidney or stomach.

The input images are normalized to [0, 1] after HU clipping to [−100, 400], focusing on soft tissue densities relevant to abdominal organs; this preprocessing step is crucial as it enhances contrast for spleen tissue (typically 40–60 HU) against surrounding fat or muscle, reducing noise from irrelevant high-density structures like bones and improving model generalization across different CT scanners. The encoder path begins with the first convolutional block (dconv_down1), which applies a double convolution to the input: the initial

3 \times 3

convolution with 3 input channels and 64 output channels (kernel size = 3, stride = 1, padding = 1 to preserve spatial dimensions), followed by batch normalization (with epsilon = 1 × 10⁻⁵ and momentum = 0.1 for stability) and ReLU activation (inplace = True for memory efficiency); this is repeated in a second identical

3 \times 3

convolution, batch normalization, and ReLU sequence. The

3 \times 3

kernel size is chosen for its balance between local feature capture and computational efficiency, as larger kernels would increase parameters without proportional gains in medical imaging; the double convolution deepens the block to extract hierarchical features (e.g., edges and textures) while mitigating vanishing gradients via intermediate activations. Batch normalization normalizes activations across the mini-batch, accelerating training and improving stability on varying CT intensities, which is vital for spleen tasks where patient-specific variations (e.g., due to contrast agents) could otherwise cause covariate shifts. ReLU introduces non-linearity to model complex tissue patterns, preventing linear collapse and enabling the network to learn spleen-specific features like homogeneous interior versus irregular boundaries; inplace operation optimizes GPU memory for large

512 \times 512

slices. A

2 \times 2

max pooling layer (stride = 2, no padding) then downsamples the 64-channel feature maps by half in both height and width, expanding the receptive field to incorporate broader contextual information, such as the spleen’s relative position in the abdomen, while reducing parameters to prevent overfitting on limited medical datasets.

Following downsampling, the second encoder block (dconv_down2) processes the reduced feature maps through another double convolution, starting with a

3 \times 3

convolution expanding from 64 to 128 channels (stride = 1, padding = 1), batch normalization, and ReLU, mirrored by a subsequent

3 \times 3

convolution, batch normalization, and ReLU. This progressive channel doubling (to 128) allows for richer mid-level representations, capturing more nuanced patterns like spleen boundary gradients against adjacent organs; the double structure ensures sufficient depth for feature refinement without excessive layers, which is efficient for real-time inference in clinical settings. Batch normalization and ReLU serve similar stabilizing and non-linear roles as before, tailored here to handle downsampled inputs where spatial details are coarser but semantic content (e.g., organ shapes) begins to emerge. Another

2 \times 2

max pooling operation halves the spatial dimensions again, further enlarging the receptive field to detect larger-scale anomalies, such as spleen enlargement (splenomegaly), which spans broader regions and requires contextual awareness beyond local pixels.

The third encoder block (dconv_down3) continues this pattern with a double convolution from 128 to 256 channels: each

3 \times 3

convolution (padding = 1) is succeeded by batch normalization and ReLU, capturing more complex semantic patterns like spleen shape variations or pathological changes (e.g., tumors). At this level, the increased channels enable encoding of abstract features critical for distinguishing the spleen from similar-radiodensity structures like the pancreas or liver segments; the max pooling that follows reduces dimensions to focus computational resources on high-level semantics rather than fine details, which are preserved via skips. The deepest encoder block (dconv_down4), serving as the bottleneck, employs a double convolution from 256 to 512 channels: two sequential

3 \times 3

convolutions (padding = 1), each with batch normalization and ReLU, to encode high-level abstract representations, such as global organ context or inter-organ relationships. This bottleneck compresses information to its most compact form, forcing the model to learn efficient encodings that generalize across diverse spleen morphologies; 512 channels provide ample capacity for complex features without exploding parameters, and the double conv refines these before upsampling, preventing information loss in deep networks prone to degradation.

Skip connections are implemented by storing the output feature maps from each encoder block prior to pooling (conv1 at 64 channels, conv2 at 128, conv3 at 256), which are later concatenated in the decoder to fuse multi-resolution information. These connections are essential because downsampling discards spatial precision, which is critical for accurate spleen boundary delineation where subtle intensity gradients define edges; by bypassing the bottleneck, they mitigate the vanishing gradient problem and enable precise localization, improving IoU in segmentation tasks with irregular organ shapes.

The decoder path symmetrically mirrors the encoder to upsample and refine predictions back to the original resolution. It commences with a bilinear upsampling layer (scale_factor = 2, mode = ‘bilinear’, align_corners = True) on the 512-channel bottleneck output, doubling spatial dimensions without learnable parameters for efficiency. Bilinear interpolation is preferred over transposed convolutions to avoid checkerboard artifacts that could distort smooth organ masks, ensuring artifact-free upscaling suitable for medical accuracy. This upsampled tensor is concatenated along the channel dimension with the corresponding skip connection from conv3 (256 channels), resulting in a 768-channel input (512 + 256); concatenation over summation preserves distinct features from both paths, enhancing fusion of semantic (from deep layers) and spatial (from shallow) information, which is key for resolving ambiguities in spleen-adjacent regions. The third decoder block (dconv_up3) then applies a double convolution to reduce and refine: a

3 \times 3

convolution compressing to 256 channels (padding = 1), batch normalization, and ReLU, followed by another identical

3 \times 3

convolution, batch normalization, and ReLU sequence. This reduces channel redundancy post-concatenation and refines features for better boundary adherence, with batch norm and ReLU ensuring stable, non-linear processing as in the encoder.

Subsequent upsampling (bilinear, scale_factor = 2, align_corners = True) precedes concatenation with conv2 (128 channels), yielding a 384-channel tensor (256 from previous + 128). The second decoder block (dconv_up2) processes this via double convolution to 128 channels: two

3 \times 3

convolutions (padding = 1) each with batch normalization and ReLU, progressively restoring mid-level details like texture transitions at spleen edges. Another upsampling step follows, with concatenation from conv1 (64 channels), producing a 192-channel input (128 + 64). The first decoder block (dconv_up1) refines to 64 channels through its double convolution: sequential

3 \times 3

convolutions (padding = 1), batch normalizations, and ReLUs, focusing on fine-grained spatial recovery for pixel-precise masks. Finally, a

1 \times 1

convolution (conv_last) with 64 input channels and 1 output channel (stride = 1, no padding or activation here) generates per-pixel logits, which are passed through a sigmoid function during inference and loss computation to yield probabilistic binary masks (thresholded at 0.5 for final segmentation). The

1 \times 1

conv efficiently projects features to a single channel without spatial alteration, ideal for binary tasks like spleen vs. background; sigmoid ensures that outputs are probabilities, facilitating loss computation and uncertainty estimation in regions with overlapping densities.

This architecture leverages multi-slice inputs, as described in the preprocessing subsection, to incorporate volumetric context, enhancing the model’s ability to capture anatomical continuity while maintaining the efficiency of 2D processing. The dataset used is the Medical Segmentation Decathlon (MSD) Task 09 Spleen dataset, consisting of 41 training volumes (imagesTr and labelsTr) and 20 test volumes (imagesTs). For preprocessing, CT volumes are loaded using nibabel, and Hounsfield units (HU) are clipped to the range [−100, 400] to focus on spleen-relevant intensities. Pixel values are then normalized to [0, 1] via min-max scaling: (value − min)/(max − min). Only axial slices containing spleen (determined by non-zero sum in the ground truth mask) are selected for training and validation to focus on relevant data; this results in approximately 1000–2000 valid slices depending on the volumes. For multi-slice input, the central slice and its neighbors are stacked; if a neighbor is out of bounds, the nearest valid slice is repeated. The dataset is split into training (80%) and validation (20%) sets using scikit-learn’s train_test_split with random_state = 42 for reproducibility, applied at the slice level.

Our hybrid loss function combines region-based and boundary-based optimization objectives to achieve superior segmentation accuracy:

L_{t o t a l} = λ_{b c e} L_{b c e} + (1 - λ_{b c e}) L_{d i c e}

(5)

The binary cross-entropy component focuses on pixel-wise classification accuracy:

L_{b c e} = - \frac{1}{N} \sum_{i = 1}^{N} [y_{i} log (p_{i}) + (1 - y_{i}) log (1 - p_{i})]

(6)

while the Dice loss component optimizes region overlap:

L_{d i c e} = 1 - \frac{2 \sum_{i = 1}^{N} y_{i} p_{i} + ϵ}{\sum_{i = 1}^{N} y_{i} + \sum_{i = 1}^{N} p_{i} + ϵ}

(7)

where

y_{i}

and

p_{i}

represent ground truth and predicted probabilities, respectively (with sigmoid applied to model logits for

p_{i}

), N is the number of pixels, and

ϵ = 1.0

ensures numerical stability (adjusted from initial experiments with smaller values like 1 × 10⁻⁸ to better handle small structures). Through systematic hyperparameter optimization using grid search on a validation subset, we determined

λ_{b c e} = 0.5

as optimal for balancing both objectives.

Training implementation utilizes the Adam optimizer with an initial learning rate of 1 × 10⁻⁴ and a ReduceLROnPlateau scheduler (factor = 0.1, patience = 3 epochs, monitoring validation loss). The learning rate is reduced when validation loss plateaus. Training employs a batch size of 8 with DataLoader settings (shuffle = True for train, num_workers = 2, pin_memory = True for GPU efficiency). The model is trained for up to 25 epochs, with manual monitoring for convergence (best model saved based on validation Dice score). Comprehensive data augmentation, applied online during training with the same random seed for image and mask pairs to ensure alignment, includes random rotation (±10° with p = 1.0), random affine translation (±10% with p = 1.0), and random horizontal flipping (p = 0.5). Augmentations are implemented via torchvision.transforms.Compose. The model is implemented in PyTorch (version 2.0 or later) for its flexibility in custom layers and dynamic computation graphs and trained on a Kaggle environment with an A100 GPU (NVIDIA, Santa Clara, CA, USA) (approximately 40 GB VRAM, enabling the batch size without gradient accumulation).

During evaluation, the model is set to eval mode, and metrics are computed on the validation set. Dice is calculated as the mean per-slice score:

D i c e = \frac{2 \times i n t e r s e c t i o n}{u n i o n + ϵ}

with

ϵ

= ×10⁻⁸. Predictions are thresholded at 0.5 after sigmoid. For test set inference, full volumes are reconstructed by predicting all slices (including non-spleen ones) and saving as NIfTI files with nibabel, using the original affine matrix for spatial alignment. Visualizations of samples and confusion matrices are generated using matplotlib and seaborn for qualitative assessment.

4. Results

Comprehensive evaluation on the Medical Decathlon validation set demonstrates the superior performance of our enhanced U-Net architecture. The model achieves a Dice similarity coefficient of 0.923 ± 0.04, representing a significant improvement over standard 2D U-Net implementations (0.891) and competitive performance compared to state-of-the-art approaches.

4.1. Our Model’s Results

The performance of our enhanced U-Net was evaluated using multiple metrics to assess segmentation accuracy and robustness across diverse clinical scenarios. The Dice similarity coefficient (DSC) quantifies spatial overlap between predicted and ground truth segmentations [37]:

DSC = \frac{2 | P \cap G |}{| P | + | G |}

(8)

where P and G represent predicted and ground truth regions, respectively. Precision measures the proportion of correctly identified spleen pixels [38]:

Precision = \frac{TP}{TP + FP}

(9)

Recall (Sensitivity) quantifies the proportion of actual spleen pixels correctly identified [38]:

Recall = \frac{TP}{TP + FN}

(10)

Specificity measures the correct identification of non-spleen pixels [38]:

Specificity = \frac{TN}{TN + FP}

(11)

Hausdorff distance (HD) provides boundary accuracy assessment [39]:

HD (P, G) = max {h (P, G), h (G, P)}

(12)

where

h (P, G) = {max}_{p \in P} {min}_{g \in G} ∥ p - g ∥

represents the directed Hausdorff distance. Average surface distance (ASD) quantifies mean boundary deviation [40].

Figure 6 illustrates example segmentation results on a representative CT slice from the Medical Decathlon validation set and Table 2 presents the numerical results. Figure 6a shows the input abdominal CT slice containing the spleen, characterized by its soft tissue radiodensity. Figure 6b presents the ground truth mask, expertly annotated to delineate the spleen’s boundaries and internal structure. Figure 6c displays our model’s predicted mask, which closely aligns with the ground truth, accurately capturing the spleen’s irregular shape and boundaries with minimal discrepancies, highlighting the model’s precision in clinical contexts.

Figure 7 depicts the distribution of key segmentation metrics across the validation set. The histograms for the Dice score (mean 0.923), intersection over union (IoU, mean 0.859), precision (mean 0.934), and recall (mean 0.918) show tight clustering around high values, indicating consistent and reliable performance. Outliers in the lower range correspond to challenging cases, such as post-trauma spleens, where morphological distortions pose difficulties, yet the model maintains robust overall performance.

Detailed analysis reveals performance variations across clinical scenarios, as shown in Table 3. The model achieves superior accuracy for normal anatomical cases (Dice 0.938) and performs robustly across varying slice thicknesses. Pathological cases, particularly post-trauma spleens, show slightly reduced performance (Dice 0.901) due to morphological distortions and altered contrast enhancement patterns, which challenge boundary delineation.

Figure 8 presents the confusion matrix for pixel-wise classification on the validation set. The matrix shows high true positive and true negative rates, with minimal false positives and false negatives, confirming excellent specificity (0.997) and sensitivity (0.918). The strong diagonal dominance indicates accurate differentiation between spleen and non-spleen pixels, with minor errors primarily at boundary regions where radiodensity similarities with adjacent organs occur.

Systematic ablation studies validate the contribution of each architectural component, as shown in Table 4. The multi-slice input provides the most significant improvement (+2.0% Dice score), as it incorporates volumetric context, followed by the hybrid loss function (+1.2%), which balances region overlap and pixel-wise accuracy. The combined effect of all enhancements yields a 3.2% improvement over the standard U-Net implementation.

The multi-slice input analysis indicates optimal performance with three-slice inputs, as five-slice inputs show marginal degradation (+1.8% Dice) likely due to the reduced anatomical relevance of distant slices and increased computational complexity. The loss function comparison confirms the superiority of the hybrid loss, with pure Dice loss achieving reasonable performance (0.911) while pure BCE loss underperforms (0.889) due to class imbalance issues in medical segmentation tasks.

Grad-CAM visualizations provide valuable insights into model decision-making processes and failure modes. Analysis of attention patterns reveals that the network consistently focuses on clinically relevant anatomical features, including spleen–parenchyma interfaces, internal vascular structures, and boundary regions with adjacent organs. In successful segmentations, attention concentrates on the splenic capsule and internal architecture, while failed cases often show dispersed attention or focus on confounding structures. Common failure patterns include confusion at the spleen–kidney interface due to similar radiodensity, reduced attention at thin splenic poles where partial volume effects are prominent, and inconsistent activation in low-contrast regions, particularly in arterial phase imaging. These insights inform future architectural improvements and preprocessing optimizations. The correlation of model failures with clinically challenging regions suggests alignment with human expert challenges, enhancing confidence in clinical applicability.

Figure 9 presents a Grad-CAM heatmap overlaid on a spleen CT slice. Warm colors (red/yellow) indicate high attention regions, primarily focused on the spleen’s boundaries and internal structures, demonstrating the model’s emphasis on key anatomical features for accurate segmentation. Cooler areas (blue) represent low attention, confirming the model’s ability to ignore irrelevant background tissues, thus enhancing segmentation reliability.

Figure 10 combines segmentation results with Grad-CAM visualization. Figure 10a shows the input CT slice, Figure 10b the ground truth mask, Figure 10c the predicted mask with high alignment to the ground truth, and Figure 10d the Grad-CAM heatmap highlighting attention on boundary regions. This integrated visualization reveals how focused attention contributes to accurate predictions, with minor discrepancies in attention patterns explaining segmentation errors in challenging boundary areas.

4.2. Comparison with State-of-the-Art Models

Our enhanced U-Net outperforms several state-of-the-art (SOTA) models across multiple metrics while maintaining computational efficiency suitable for clinical deployment, as shown in Table 5. The multi-slice contextual approach provides particular advantages in boundary region accuracy, achieving the lowest Hausdorff distance (9.47 mm) among compared methods. Below, we analyze the performance of each SOTA model, highlighting why they perform well or poorly for spleen segmentation. The results of SOTA models are taken from the mentioned papers.

Standard U-Net [4] achieves a baseline Dice score of 0.891 but struggles with boundary accuracy (HD 12.34 mm) due to its reliance on 2D processing, which lacks volumetric context. Its fast training (8 h) makes it efficient, but it underperforms in complex cases like post-trauma spleens where anatomical continuity is critical.

UNet++ [5] offers competitive performance (Dice 0.920) by using nested skip connections to capture multi-scale features, improving boundary accuracy (HD 10.12 mm). However, its complex architecture leads to longer training times (18 h) and a higher risk of overfitting on smaller datasets, limiting its efficiency for spleen segmentation.

Attention U-Net [27] incorporates attention gates to focus on relevant regions, achieving a Dice score of 0.905. However, its boundary accuracy (HD 11.67 mm) is suboptimal due to limited volumetric context, and the attention mechanism may overemphasize certain features, reducing robustness in variable clinical scenarios like low-contrast imaging.

nnU-Net [12] delivers strong performance (Dice 0.915) through automated configuration and ensemble methods, but its high computational cost (24 h training) and moderate boundary accuracy (HD 10.89 mm) make it less practical for resource-constrained clinical environments. Its strength lies in generalizability across tasks, but it is not optimized specifically for spleen segmentation.

TransUNet [14] leverages transformers for global context, achieving a Dice score of 0.913. However, its hybrid CNN–Transformer architecture increases training time (32 h) and struggles with local feature extraction in low-contrast spleen regions, leading to moderate boundary accuracy (HD 10.45 mm). Its complexity limits clinical applicability.

Our Enhanced U-Net achieves a Dice score of 0.923 and the lowest HD (9.47 mm) among compared methods, with efficient training (10 h). The multi-slice input approach provides critical anatomical continuity, improving boundary delineation in challenging regions like the spleen–kidney interface. The hybrid loss function optimizes both region overlap and boundary precision, contributing to robust performance across normal and pathological cases. Its computational efficiency makes it highly suitable for clinical deployment compared to more resource-intensive models like nnU-Net and TransUNet. Our enhanced U-Net architecture demonstrates significant improvements in spleen segmentation accuracy while maintaining computational efficiency suitable for clinical deployment. The 3.2% improvement in Dice score over standard approaches represents meaningful clinical advancement, particularly considering the challenging nature of spleen segmentation due to variable morphology and low tissue contrast.

The multi-slice contextual approach addresses a fundamental limitation of standard 2D segmentation methods by incorporating volumetric information without the computational overhead of full 3D processing. This design choice proves particularly effective for spleen segmentation, where anatomical continuity across adjacent slices provides crucial discriminative information for distinguishing the spleen from adjacent organs with similar radiodensity. The optimal performance with three-slice inputs suggests an effective balance between contextual information and computational efficiency.

5. Discussion

Our model’s performance in spleen segmentation demonstrates robust generalization across normal anatomical cases (Dice 0.938) and varying slice thicknesses (0.932 for thin, 0.911 for thick). However, reduced accuracy in pathological cases (0.901 for post-trauma) indicates a need for more diverse training data and domain adaptation strategies. The hybrid loss function, integrating Dice and binary cross-entropy, improves boundary precision (Hausdorff distance 9.47 mm vs. 10.12 mm), which is critical for volumetric measurements and surgical planning. Grad-CAM analysis reveals the model’s focus on clinically relevant features, aligning with human diagnostic challenges, thus supporting its reliability. With a 4.7% volumetric error and 2.3-s processing time, it meets clinical requirements for diagnostics and trauma assessment (sensitivity 0.918, specificity 0.997). Limitations include the 2D processing approach missing 3D context, sensitivity to CT windowing, and the need for adaptive strategies to enhance generalization. Compared to complex models like UNet++, our approach offers competitive accuracy with faster training, facilitating clinical adoption. Interpretability through attention visualization fosters clinician trust and validation. Our training was performed on the Kaggle platform with an A100 GPU, which could affect the computational efficiency of the model. To further enhance the model’s performance, future work might focus on running the model on high-power GPUs like NVIDIA GeForce RTX 4090 and NVIDIA GeForce RTX 5090, which will save a large amount of time. Future work might also focus on 3D contextual information by transitioning to a fully 3D convolutional architecture, which could mitigate the limitations of the current 2D processing approach. Adaptive domain generalization techniques, such as domain-invariant feature learning, will be explored to improve robustness across diverse pathological cases and imaging variations. Additionally, integrating multi-modal data, such as MRI alongside CT, could enhance segmentation accuracy in complex scenarios. Optimizing the model for real-time processing while maintaining a low volumetric error will be prioritized to ensure seamless integration into clinical workflows. Finally, expanding the training dataset with a wider range of pathological cases and collaborating with clinical partners for real-world validation will strengthen the generalizability and practical utility of the model.

6. Conclusions

This enhanced U-Net architecture advances spleen segmentation in CT imaging, achieving a Dice score of 0.923 ± 0.04, a 3.2% improvement over standard U-Net, with competitive performance and superior computational efficiency. Integrating multi-slice contextual analysis and Grad-CAM visualization, it ensures clinically relevant feature focus and interpretability, aligning with human diagnostic challenges. The system supports splenic volumetry (4.7% error), trauma assessment (91.8% sensitivity), and surgical planning, with processing times enabling real-time clinical use. Limitations include 2D processing constraints and reduced performance on pathological cases. Future work will explore hybrid 2D–3D architectures and diverse pathological datasets to enhance generalization, establishing a robust foundation for automated abdominal organ segmentation in quantitative imaging and computer-assisted diagnosis.

Author Contributions

Conceptualization, S.R.; methodology, S.R. and A.E.J.; software, M.A.H.R.; validation, A.E.J.; formal analysis, M.A.H.R. and M.A.; investigation, M.A. and I.J.S.; resources, S.R.; data curation, S.R.; writing—original draft preparation, S.R.; writing—review and editing, J.U.; visualization, S.R.; supervision, J.U.; project administration, J.U.; fund acquisition, J.U. All authors have read and agreed to the published version of the manuscript.

Funding

The APC was funded by Woosong University Academic Research 2025.

Data Availability Statement

The dataset and code used in this study is available at: https://github.com/sowad223/spleen-segmentation (accessed on 23 August 2025).

Acknowledgments

We thank the organizers of Medical Decathlon for the annotated data.

Conflicts of Interest

The author(s) declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this manuscript. This includes, but is not limited to, employment, consultancies, stock ownership, honoraria, paid expert testimony, patent applications or registrations, and grants or other funding from entities that might benefit from the outcomes of this research. The research was carried out in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest. All decisions regarding the study design, data collection, analysis, interpretation, and manuscript preparation were made independently by the author(s) to ensure the integrity and objectivity of the findings.

References

Litjens, G.; Kooi, T.; Bejnordi, B.E.; Setio, A.A.A.; Ciompi, F.; Ghafoorian, M.; van der Laak, J.A.W.M.; van Ginneken, B.; Sánchez, C.I. A survey on deep learning in medical image analysis. Med. Image Anal. 2017, 42, 60–88. [Google Scholar] [CrossRef]
Gibson, E.; Giganti, F.; Hu, Y.; Bonmati, E.; Bandula, S.; Gurusamy, K.; Davidson, B.; Pereira, S.P.; Clarkson, M.J.; Barratt, D.C. Automatic multi-organ segmentation on abdominal CT with dense V-networks. IEEE Trans. Med. Imaging 2018, 37, 1822–1834. [Google Scholar] [CrossRef]
Mebius, R.E.; Kraal, G. Structure and function of the spleen. Nat. Rev. Immunol. 2005, 5, 606–616. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention, Proceedings of the 18th International Conference, Munich, Germany, 5–9 October 2015; Springer: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar] [CrossRef]
Zhou, Z.; Rahman Siddiquee, M.M.; Tajbakhsh, N.; Liang, J. UNet++: A nested U-Net architecture for medical image segmentation. In Deep Learning in Medical Image Analysis, Proceedings of the 4th International Workshop, DLMIA 2018, and 8th International Workshop, ML-CDS 2018, Granada, Spain, 20 September 2018; Springer: Berlin/Heidelberg, Germany, 2018; pp. 3–11. [Google Scholar] [CrossRef]
Çiçek, Ö.; Abdulkadir, A.; Lienkamp, S.S.; Brox, T.; Ronneberger, O. 3D U-Net: Learning dense volumetric segmentation from sparse annotation. In Medical Image Computing and Computer-Assisted Intervention, Proceedings of the 19th International Conference, MICCAI 2016, Athens, Greece, 17–21 October 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 424–432. [Google Scholar] [CrossRef]
Wolz, R.; Chu, C.; Misawa, K.; Fujiwara, M.; Mori, K.; Rueckert, D. Automated abdominal multi-organ segmentation with subject-specific atlas generation. IEEE Trans. Med. Imaging 2013, 32, 1723–1730. [Google Scholar] [CrossRef] [PubMed]
Okada, T.; Linguraru, M.G.; Hori, M.; Summers, R.M.; Tomiyama, N.; Sato, Y. Abdominal multi-organ segmentation from CT images using conditional shape-location priors. Med. Image Anal. 2015, 26, 1–18. [Google Scholar] [CrossRef] [PubMed]
Roth, H.R.; Lu, L.; Farag, A.; Shin, H.-C.; Liu, J.; Turkbey, E.B.; Summers, R.M. Hierarchical 3D fully convolutional networks for multi-organ segmentation. arXiv 2017, arXiv:1704.06382. [Google Scholar] [CrossRef]
Zheng, G.; Zheng, G. Multi-stream 3D FCN with Multi-scale Deep Supervision for Multi-modality Isointense Infant Brain MR Image Segmentation. arXiv 2017, arXiv:1711.10212. [Google Scholar] [CrossRef]
Tang, Y.; Wang, X.; Harrison, A.P.; Lu, L.; Xiao, J.; Summers, R.M. Attention-guided curriculum learning for weakly supervised classification and localization of thoracic diseases on chest radiographs. In Machine Learning in Medical Image Analysis, Proceedings of the 9th International Workshop, MLMI 2018, Granada, Spain, 16 September 2018; Springer: Berlin/Heidelberg, Germany, 2018. [Google Scholar] [CrossRef]
Isensee, F.; Jaeger, P.F.; Kohl, S.A.A.; Petersen, J.; Maier-Hein, K.H. nnU-Net: A self-configuring method for deep learning-based biomedical image segmentation. Nat. Methods 2021, 18, 203–211. [Google Scholar] [CrossRef]
Seo, H.; Huang, C.; Bassenne, M.; Xiao, R.; Xing, L. Modified U-Net (mU-Net) with incorporation of object-dependent high level features for improved liver and liver-tumor segmentation in CT images. IEEE Trans. Med. Imaging 2020, 39, 1316–1325. [Google Scholar] [CrossRef]
Chen, J.; Lu, Y.; Yu, Q.; Luo, X.; Adeli, E.; Wang, Y.; Lu, L.; Yuille, A.L.; Zhou, Y. TransUNet: Transformers make strong encoders for medical image segmentation. arXiv 2021, arXiv:2102.04306. [Google Scholar] [CrossRef]
Soberanis-Mukul, R.D.; Navab, N.; Albarqouni, S. An Uncertainty-Driven GCN Refinement Strategy for Organ Segmentation. arXiv 2020, arXiv:2012.03352. [Google Scholar] [CrossRef]
Nath, V.; Yang, D.; Hatamizadeh, A.; Abidin, A.A.; Myronenko, A.; Roth, H.; Xu, D. The Power of Proxy Data and Proxy Networks for Hyper-Parameter Optimization in Medical Image Segmentation. arXiv 2021, arXiv:2107.05471. [Google Scholar] [CrossRef]
El Jurdi, R.; Petitjean, C.; Honeine, P.; Cheplygina, V.; Abdallah, F. A surprisingly effective perimeter-based loss for medical image segmentation. In Proceedings of the Fourth Conference on Medical Imaging with Deep Learning, Lübeck, Germany, 7–9 July 2021; pp. 158–167. Available online: https://proceedings.mlr.press/v143/el-jurdi21a.html (accessed on 23 August 2025).
Meddeb, A.; Kossen, T.; Bressem, K.K.; Hamm, B.; Nagel, S.N. Evaluation of a Deep Learning Algorithm for Automated Spleen Segmentation in Patients with Conditions Directly or Indirectly Affecting the Spleen. Tomography 2021, 7, 950–960. [Google Scholar] [CrossRef]
Smith, A.G.; Kutnár, D.; Vogelius, I.R.; Darkner, S.; Petersen, J. Localise to segment: Crop to improve organ at risk segmentation accuracy. arXiv 2023, arXiv:2304.04606. [Google Scholar] [CrossRef]
Jha, D.; Tomar, N.K.; Biswas, K.; Durak, G.; Antalek, M.; Zhang, Z.; Wang, B.; Rahman, M.M.; Pan, H.; Medetalibeyoglu, A.; et al. MDNet: Multi-Decoder Network for Abdominal CT Organs Segmentation. arXiv 2024, arXiv:2405.06166. [Google Scholar] [CrossRef]
Shen, C.; Li, W.; Shi, Y.; Wang, X. Interactive 3D Medical Image Segmentation with SAM 2. arXiv 2024, arXiv:2408.02635. [Google Scholar] [CrossRef]
Yuan, Z.; Stojanovski, D.; Li, L.; Gomez, A.; Jogeesvaran, H.; Puyol-Antón, E.; Inusa, B.; King, A.P. DeepSPV: A Deep Learning Pipeline for 3D Spleen Volume Estimation from 2D Ultrasound Images. arXiv 2024, arXiv:2411.11190. [Google Scholar] [CrossRef]
Vu, M.H.; Tronchin, L.; Nyholm, T.; Löfstedt, T. Using Synthetic Images to Augment Small Medical Image Datasets. arXiv 2025, arXiv:2503.00962. [Google Scholar] [CrossRef]
Samir, J.; Ramadass, K.; Saunders, A.M.; Krishnan, A.R.; Remedios, L.W.; McMaster, E.M.; Landman, B.A. The medical segmentation decathlon without a doctorate. In Proceedings of the Medical Imaging 2025: Clinical and Biomedical Imaging, San Diego, CA, USA, 16–21 February 2025; Volume 13410, p. 134101K. [Google Scholar] [CrossRef]
Linguraru, M.G.; Sandberg, J.K.; Jones, E.C.; Summers, R.M. Automated segmentation and quantification of liver and spleen from CT images using normalized probabilistic atlases and enhancement estimation. Med. Phys. 2010, 37, 2771–2783. [Google Scholar] [CrossRef]
Roth, H.R.; Lu, L.; Seff, A.; Cherry, K.M.; Hoffman, J.; Wang, S.; Liu, J.; Turkbey, E.; Summers, R.M. DeepOrgan: Multi-level deep convolutional networks for automated pancreas segmentation. In Medical Image Computing and Computer-Assisted Intervention, Proceedings of the 18th International Conference, Munich, Germany, 5–9 October 2015; Springer: Berlin/Heidelberg, Germany, 2015. [Google Scholar] [CrossRef]
Oktay, O.; Schlemper, J.; Folgoc, L.L.; Lee, M.; Heinrich, M.P.; Misawa, K.; Mori, K.; McDonagh, S.; Hammerla, N.Y.; Kainz, B.; et al. Attention U-Net: Learning where to look for the pancreas. arXiv 2018, arXiv:1804.03999. [Google Scholar] [CrossRef]
Huang, H.; Lin, L.; Tong, R.; Hu, H.; Zhang, Q.; Iwamoto, Y.; Han, X.; Chen, Y.-W.; Wu, J. UNet 3+: A full-scale connected UNet for medical image segmentation. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Barcelona, Spain, 4–8 May 2020; pp. 1055–1059. [Google Scholar] [CrossRef]
Milletari, F.; Navab, N.; Ahmadi, S.-A. V-Net: Fully convolutional neural networks for volumetric medical image segmentation. In Proceedings of the Fourth International Conference on 3D Vision, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 565–571. [Google Scholar] [CrossRef]
Sudre, C.H.; Li, W.; Vercauteren, T.; Ourselin, S.; Jorge Cardoso, M. Generalised dice overlap as a deep learning loss function for highly unbalanced segmentations. In Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support, Proceedings of the Third International Workshop, DLMIA 2017, and 7th International Workshop, ML-CDS 2017, Quebec, QC, Canada, 14 September 2017; Springer: Berlin/Heidelberg, Germany, 2017; pp. 240–248. [Google Scholar] [CrossRef]
Calisto, M.B.; Lai-Yuen, S.K. AdaEn-Net: An ensemble of adaptive 2D–3D fully convolutional networks for medical image segmentation. Neural Netw. 2021, 126, 76–94. [Google Scholar] [CrossRef] [PubMed]
Zhou, B.; Khosla, A.; Lapedriza, A.; Oliva, A.; Torralba, A. Learning deep features for discriminative localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 2921–2929. [Google Scholar] [CrossRef]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 618–626. [Google Scholar] [CrossRef]
Jalali, Y.; Fateh, M.; Rezvani, M.; Abolghasemi, V.; Anisi, M.H. ResBCDU-Net: A deep learning framework for lung CT image segmentation. Sensors 2021, 21, 268. [Google Scholar] [CrossRef] [PubMed]
Simpson, A.L.; Antonelli, M.; Bakas, S.; Bilello, M.; Farahani, K.; van Ginneken, B.; Kopp-Schneider, A.; Landman, B.A.; Litjens, G.; Menze, B.; et al. A large annotated medical image dataset for the development and evaluation of segmentation algorithms. arXiv 2019, arXiv:1902.09063. [Google Scholar] [CrossRef]
Vu, M.H.; Grimbergen, G.; Nyholm, T.; Löfstedt, T. Evaluation of multislice inputs to convolutional neural networks for medical image segmentation. Med. Phys. 2020, 47, 6216–6231. [Google Scholar] [CrossRef]
Sørensen, T. A method of establishing groups of equal amplitude in plant sociology based on similarity of species content and its application to analyses of the vegetation on Danish commons. In Kongelige Danske Videnskabernes Selskab; Munksgaard: Copenhagen, Denmark, 1948; Volume 5, pp. 1–34. Available online: https://books.google.com.bd/books?id=rpS8GAAACAAJ (accessed on 23 August 2025).
Fawcett, T. An introduction to ROC analysis. Pattern Recognit. Lett. 2006, 27, 861–874. [Google Scholar] [CrossRef]
Huttenlocher, D.P.; Klanderman, G.A.; Rucklidge, W.J. Comparing images using the Hausdorff distance. IEEE Trans. Pattern Anal. Mach. Intell. 1993, 15, 850–863. [Google Scholar] [CrossRef]
Maier, O.; Menze, B.H.; von der Goltz, J.; Hänsch, R.; Handels, H.; Hoelter, T.; Jakab, A.; Kalavari, V.; Lancaster, J.L.; Marti-Bonmati, L.; et al. ISLES 2015—A public evaluation benchmark for ischemic stroke lesion segmentation from multispectral MRI. Med. Image Anal. 2017, 35, 250–269. [Google Scholar] [CrossRef]

Figure 1. Spleen slice with corresponding mask.

Figure 2. System workflow for spleen segmentation, including dataset splitting, preprocessing, model execution, loss optimization, evaluation, and Grad-CAM visualization.

Figure 3. Proposed U-Net architecture for spleen segmentation, featuring multi-slice input and hybrid loss function.

Figure 4. Encoder of the proposed U-Net for spleen segmentation with multi-slice input.

Figure 5. Decoder of the proposed U-Net for spleen segmentation with hybrid loss function.

Figure 6. Example segmentation results: (a) input CT slice, (b) ground truth mask, (c) predicted mask.

Figure 7. Distribution of segmentation metrics (IoU, Dice score, precision, recall) on the validation set.

Figure 8. Confusion matrix showing the classification performance of the model on the validation set.

Figure 9. Grad-CAM visualization showing the network’s attention regions on a spleen CT slice.

Figure 10. Example segmentation results: (a) input CT slice, (b) ground truth mask, (c) predicted mask, (d) Grad-CAM heatmap.

Table 1. Comparison of spleen segmentation approaches.

Ref.	Methods	Architecture	Dataset	Dice	Research Gap
[7]	Multi-atlas w/shape priors	Statistical atlas	38 CT	0.89	Computationally intensive
[8]	Shape model, deformable reg.	Active shape model	86 cases	0.84	Limited robustness
[9]	3D FCN	5-layer 3D CNN	BTCV (30)	0.90	High memory needs
[10]	Multi-scale 3D FCN	7-layer CNN	LiTS (131)	0.91	High GPU memory
[11]	Attention U-Net	6-layer U-Net + attn.	Private (200)	0.87	Overfitting risk
[5]	UNet++ nested	5-layer nested U-Net	Med. Decathlon	0.92	Slow convergence
[12]	nnU-Net auto-config	6-layer adaptive U-Net	Med. Decathlon	0.915	Resource-heavy
[13]	Modified U-Net	5-layer U-Net	TCIA (150)	0.89	Boundary issues
[14]	Transformer-U-Net	6-layer CNN-Trans.	Synapse	0.913	Low interpretability
[15]	Uncertainty-driven GCN refinement	2D U-Net + GCN	MSD (61 CT)	–	Lesion size imbalance
[16]	Proxy data for hyperparameter optimization	CNN-based optimization	MSD proxy	–	Expensive doing optimization
[17]	Perimeter-based loss function	U-Net with perimeter loss	MSD (41 CT)	–	Expensive contour losses
[18]	Automated segmentation for abnormal spleens	Enhanced 3D U-Net	MSD (61) + in-house	0.897	Small dataset, no cross-validation
[19]	Two-stage localization + segmentation	Localization + segmentation networks	MSD multi-organ	–	Poor performance for small organs
[20]	Multi-decoder with refinement	MiT-B2 + multi-decoders	MSD (41 CT)	0.917	Single decoder poor for heterogeneity
[21]	Zero-shot segmentation with SAM 2	SAM 2 foundation model	MSD CT	–	Gap to supervised methods
[22]	Segmentation + VAE for volume estimation	Segmentation network + VAE	MSD (60/40 CT)	–	Limited 2D volume estimation accuracy
[23]	Synthetic image augmentation	Conditional StyleGAN2	MSD (one of ten)	–	Synthetic images not always beneficial
[24]	Interactive AI-guided labeling	MONAI Label on 3D Slicer	MSD (34 images)	0.831	Lack of prospective studies on active learning
Ours	Enhanced U-Net	4-layer + context + hybrid loss	Med. Decathlon (61)	0.923	2D limits, preprocessing

Table 2. Quantitative performance metrics on Medical Decathlon Dataset.

Metric	Mean	Standard Deviation
Dice Similarity Coefficient	0.923	0.04
Intersection over Union	0.859	0.06
Precision	0.934	0.05
Recall (Sensitivity)	0.918	0.06
Specificity	0.997	0.002
Hausdorff Distance (mm)	9.47	4.36
Average Surface Distance (mm)	1.21	0.83
Volumetric Error (%)	4.7	3.2
Total Parameters (M)	7.82	-
Trainable Parameters (M)	7.82	-
Non-Trainable Parameters (M)	0	-

Table 3. Performance analysis by clinical case characteristics.

Case Category	Dice Score	Hausdorff Distance (mm)
Normal spleen morphology	0.938	8.12
Splenomegaly (>500 mL)	0.917	10.84
Post-trauma cases	0.901	12.37
Thin slice (<3 mm)	0.932	8.76
Thick slice (>5 mm)	0.911	11.29
Portal phase contrast	0.928	9.03
Arterial phase contrast	0.912	10.52
High contrast enhancement	0.935	8.45
Low contrast enhancement	0.908	11.23

Table 4. Comprehensive ablation study results.

Configuration	Dice Score	Improvement
Baseline U-Net	0.891	-
+ Multi-slice input	0.911	+2.0%
+ Hybrid loss function	0.912	+2.1%
+ Data augmentation	0.918	+2.7%
+ Optimized preprocessing	0.923	+3.2%
Single vs. Multi-slice
Single-slice input	0.903	-
Three-slice input	0.923	+2.0%
Five-slice input	0.921	+1.8%
Loss function comparison
Dice loss only	0.911	-
BCE loss only	0.889	−2.2%
Hybrid loss ( $λ = 0.5$ )	0.923	+1.2%

Table 5. Comparative performance analysis with state-of-the-art approaches.

Method	Dice Score	HD (mm)	Training Strategy
Standard U-Net [4]	0.891	12.34	Encoder–decoder, skip connections, CE loss, basic aug.
UNet++ [5]	0.920	10.12	Nested skip connections, deep supervision, Dice loss
Attention U-Net [27]	0.905	11.67	Attention gates, Dice+CE loss, augmentation
nnU-Net [12]	0.915	10.89	Adaptive preprocessing, Dice+CE loss, heavy aug.
TransUNet [14]	0.913	10.45	Transformer+U-Net, Dice loss, patch-based
Our Enhanced U-Net	0.923	9.47	Multi-slice, hybrid loss, deep supervision aug.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Rahman, S.; Raju, M.A.H.; Evna Jafar, A.; Akter, M.; Suma, I.J.; Uddin, J. Enhanced U-Net for Spleen Segmentation in CT Scans: Integrating Multi-Slice Context and Grad-CAM Interpretability. BioMedInformatics 2025, 5, 56. https://doi.org/10.3390/biomedinformatics5040056

AMA Style

Rahman S, Raju MAH, Evna Jafar A, Akter M, Suma IJ, Uddin J. Enhanced U-Net for Spleen Segmentation in CT Scans: Integrating Multi-Slice Context and Grad-CAM Interpretability. BioMedInformatics. 2025; 5(4):56. https://doi.org/10.3390/biomedinformatics5040056

Chicago/Turabian Style

Rahman, Sowad, Md Azad Hossain Raju, Abdullah Evna Jafar, Muslima Akter, Israt Jahan Suma, and Jia Uddin. 2025. "Enhanced U-Net for Spleen Segmentation in CT Scans: Integrating Multi-Slice Context and Grad-CAM Interpretability" BioMedInformatics 5, no. 4: 56. https://doi.org/10.3390/biomedinformatics5040056

APA Style

Rahman, S., Raju, M. A. H., Evna Jafar, A., Akter, M., Suma, I. J., & Uddin, J. (2025). Enhanced U-Net for Spleen Segmentation in CT Scans: Integrating Multi-Slice Context and Grad-CAM Interpretability. BioMedInformatics, 5(4), 56. https://doi.org/10.3390/biomedinformatics5040056

Article Menu

Enhanced U-Net for Spleen Segmentation in CT Scans: Integrating Multi-Slice Context and Grad-CAM Interpretability

Abstract

1. Introduction

2. Related Work

3. Materials and Methods

3.1. Dataset and Preprocessing

3.2. System Architecture

3.3. Model Architecture

4. Results

4.1. Our Model’s Results

4.2. Comparison with State-of-the-Art Models

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI