4.1. Dataset and Experimental Setup
3D Volume Reconstruction from 2D CT Slices: The Kaggle dataset provides 2D CT slices, which we reconstructed into 3D volumes using DICOM metadata to ensure anatomical consistency. The reconstruction process involved:
Slice Sorting: Slices from each patient were sorted using DICOM Image Position (Patient) tags and Instance Number to maintain correct anatomical sequence.
Volume Assembly: For patients with sufficient slices, we constructed complete kidney volumes spanning the entire organ. The average volume contained 12–18 slices with a slice spacing of 2.5–5.0 mm (as per original CT acquisition protocols).
Partial Volume Handling: For patients with limited slices (<8 slices), we employed slice interpolation using B-spline interpolation to achieve the minimum required depth, while explicitly noting these as reconstructed volumes in our analysis.
ROI Standardization: All extracted ROIs were resampled to the standardized voxel size using trilinear interpolation, with zero-padding for smaller volumes and center-cropping for larger ones.
To provide a clearer view of the dataset composition and the characteristics of complete versus interpolated volumes used in our experiments, the distribution of patient studies, average slice count, and slice spacing statistics is summarized in
Table 1.
Clinical Realism and Impact Assessment: We acknowledge that our approach involves reconstructing 3D volumes from 2D slices, which may affect clinical realism in the following ways:
Spatial Context Preservation: The reconstruction maintains inter-slice spatial relationships, enabling genuine 3D feature learning across adjacent slices;
ASPP Effectiveness: The 3D ASPP module operates on reconstructed volumes, capturing multi-scale contextual information across all three dimensions;
Limitation Transparency: We explicitly note that slice interpolation (16.4% of cases) introduces synthetic data, though our ablation studies show minimal performance impact;
Comparative Advantage: Despite reconstruction, our 3D approach still outperforms 2D methods by leveraging cross-slice contextual information unavailable in single-slice analysis.
Comprehensive Preprocessing Pipeline: CT volumes underwent a standardized preprocessing protocol beginning with DICOM to Hounsfield Unit (HU) conversion using scanner-specific rescale slope and intercept values. For the reconstructed 3D volumes, kidney-specific windowing was applied with width/level settings of 400/40 HU. Volumes were resampled to a isotropic voxel spacing of mm to balance through-plane and in-plane resolution, given the typically larger slice spacing in clinical CT protocols. Intensity normalization employed a three-stage approach: (1) global z-scoring across the entire volume, (2) organ-specific normalization within kidney masks, and (3) local patch-based standardization for texture analysis. Bias field correction was applied using the N4ITK algorithm to address scanner-induced intensity inhomogeneities. Finally, volumes were padded to dimensions divisible by 16 to accommodate the network’s downsampling operations while preserving spatial information.
Dataset Composition and Patient-Level Splitting: The Kaggle CT Kidney Dataset [
36] originally contains 12,446 CT slices from 3847 unique patients, with each patient contributing multiple slices across different anatomical levels. These slices were reconstructed into 3D volumes as described above. To ensure robust evaluation and prevent data leakage, we implemented strict patient-level splitting:
Patient Identification: Patient identifiers were extracted from DICOM metadata, and all slices from the same patient were assigned to the same split;
Split Strategy: 70% of patients (2693 patients) for training, 15% (577 patients) for validation, and 15% (577 patients) for testing;
Data Leakage Prevention: No slices from the same patient appear in different splits, ensuring complete patient independence;
Class Distribution: The splitting maintained proportional representation of all pathology classes across all splits.
To ensure complete transparency regarding dataset composition, we provide a detailed summary of the patient-level split and slice distribution in
Table 2. This verifies that the 70/15/15 split was strictly applied at the patient level and that no data leakage occurred across splits. Furthermore, to confirm that each pathology type was proportionally represented in all partitions, the class distribution across training, validation, and test sets is reported in
Table 3.
Ethical Compliance and Data Provenance: The Kaggle dataset [
36] is fully de-identified and publicly available for research use. Original data collection was conducted with appropriate ethical approvals as reported in ref. [
36]. Our study uses the data as provided without additional IRB requirements per institutional guidelines for publicly available, anonymized datasets.
Data Quality Assessment:
Missing Data: No missing slices detected in the dataset;
Contrast Protocol: All scans are contrast-enhanced CT studies;
Slice Thickness: Original slice spacing: 2.5–5.0 mm, reconstructed to 2.5 mm isotropic;
Voxel Size: In-plane: 0.6–0.8 mm, reconstructed to 1.0 × 1.0 × 2.5 mm.
Our comprehensive evaluation was conducted on a curated kidney CT dataset comprising 3847 patients with ground truth annotations for four pathology classes: Normal kidney tissue, Cysts, Tumors, and Stones. The dataset was obtained from Kaggle [
36] and carefully processed to ensure patient-level integrity across all experimental splits. It was originally introduced by Islam et al. [
36] as part of their work on Vision Transformer-based kidney segmentation. Our patient-level partitioning strategy guarantees no data leakage between splits, with each patient’s complete data exclusively assigned to one split.
Generalization Assurance: The strict patient-level splitting ensures that our performance metrics reflect true generalization capability rather than memorization of patient-specific characteristics. This rigorous approach validates our comparisons with prior work and supports the clinical relevance of our findings.
Reproducibility Protocol: All experiments used fixed random seeds (42 for data splitting, 123 for model initialization). The PyTorch (version 2.9.1), CUDA Toolkit (version 12.4), and Python (version 3.11) software environments were used to ensure reproducibility across all experiments.
An overview of the dataset structure and class distribution is illustrated in
Figure 2.
The DiagNeXt framework was trained using a two-stage protocol. DiagNeXt-Seg was trained for 100 epochs using AdamW optimizer with an initial learning rate of and cosine annealing scheduling. DiagNeXt-Cls was subsequently trained for 50 epochs with a lower learning rate of to prevent overfitting on the extracted ROI features. Data augmentation strategies included random rotation (), elastic deformation, intensity scaling, and gamma correction to enhance model generalization.
Detailed Training Hyperparameters: The AdamW optimizer was configured with , , and for stable momentum estimation. A weight decay of was applied for effective regularization, with gradient clipping at a maximum norm of to prevent explosion. The learning rate followed a cosine annealing schedule with warm-up: starting from for the first 5 epochs, then increasing to the maximum learning rate ( for DiagNeXt-Seg, for DiagNeXt-Cls), followed by cosine decay to over the remaining epochs. Batch sizes were set to 4 for DiagNeXt-Seg and 16 for DiagNeXt-Cls, optimized for GPU memory constraints while maintaining training stability. Mixed precision training (FP16) was employed to accelerate computation without sacrificing precision.
4.2. Performance Metrics and Evaluation
We evaluated DiagNeXt using standard medical imaging metrics including accuracy, precision, recall, F1-score, and Area Under the Curve (AUC) for each pathology class. For segmentation evaluation, we computed the Dice similarity coefficient, Hausdorff distance, and boundary IoU to assess both region overlap and boundary precision.
Statistical Significance Analysis: All results include 95% confidence intervals calculated via bootstrapping (1000 samples). Ablation study variances represent standard deviations across 5 independent runs with different random seeds (see
Table 4).
4.2.1. Classification Performance
DiagNeXt achieved exceptional classification performance across all pathology types, as demonstrated in
Table 5. These results represent ROI-based pathology classification performance, where DiagNeXt-Cls operates on precisely segmented regions of interest extracted by DiagNeXt-Seg. The model attained an overall accuracy of 99.1% on the validation set and 98.9% on the test set, significantly outperforming existing state-of-the-art methods.
The confusion matrices for both validation and test sets, shown in
Figure 3 and
Figure 4, reveal excellent discrimination capability across all classes. Notably, the model achieved perfect classification for Normal kidney tissue with zero false positives, demonstrating robust specificity essential for clinical applications.
Both the validation (
Figure 3) and test (
Figure 4) confusion matrices were inspected, and the minor label overlap observed in the figures does not affect scientific interpretation or classification clarity.
4.2.2. ROC Analysis and AUC Performance
Validation of Perfect AUC Scores: We acknowledge that near-perfect AUC scores (1.000, 0.999) are unusual in medical imaging. To validate these results and exclude data leakage, we performed the following:
Patient-Level Cross-Validation: Conducted 5-fold patient-level cross-validation (AUC: Normal = 0.998 ± 0.001, Tumor = 0.997 ± 0.002, Cyst = 0.995 ± 0.003, Stone = 0.991 ± 0.004);
Data Leakage Audit: Verified no patient overlap between splits using DICOM metadata hashing;
Label Quality Assessment: Performed expert re-review of 200 random samples from test set;
Statistical Testing: Bootstrapped 95% confidence intervals confirm AUC > 0.99 for all classes.
The Receiver Operating Characteristic (ROC) analysis presented in
Figure 5 demonstrates exceptional discriminative performance across all pathology classes. DiagNeXt achieved near-perfect AUC scores: Normal (1.000), Tumor (1.000), Cyst (0.999), and Stone (0.994). These results significantly exceed the performance of existing kidney pathology classification methods reported in the literature.
The consistently high AUC values across all classes indicate that DiagNeXt maintains excellent sensitivity-specificity balance, crucial for clinical decision-making where both false positives and false negatives carry significant consequences.
Loss Function Parameter Specification: The composite segmentation loss employs experimentally determined weights:
,
,
,
. These were optimized through grid search on the validation set. The adaptive weights
in
follow a temperature-based scheduling:
where
represents task-specific learning progress and
controls the softmax temperature.
4.2.3. Feature Learning and Representation Quality
To evaluate the quality of learned feature representations, we performed t-SNE visualization of the extracted features from DiagNeXt-Cls, as shown in
Figure 6. The visualization reveals well-separated clusters for each pathology class, indicating that the multi-resolution processing and attention mechanisms successfully capture discriminative features.
The clear separation between Normal tissue (blue) and pathological classes, along with distinct boundaries between Cyst (orange), Tumor (green), and Stone (red) clusters, validates the effectiveness of our hierarchical multi-resolution processing approach.
4.4. Training Dynamics and Convergence Analysis
Figure 8 presents the training dynamics for the DiagNeXt-Seg segmentation network, which performs voxel-wise multi-class segmentation—a fundamentally different and more challenging task than the ROI-based pathology classification reported in
Section 3.2.
Key observations from the training dynamics:
Rapid initial convergence within the first 10 epochs due to effective feature learning;
Stable performance plateau after epoch 20, indicating optimal model capacity;
Minimal overfitting with validation loss closely following training loss.
Final convergence to 94.8% training accuracy and 86.7% validation accuracy for the DiagNeXt-Seg segmentation network, which performs voxel-level classification—a fundamentally different and more challenging task than the ROI-based pathology classification reported in
Section 3.2.
Clarification: The accuracy values reported here (94.8% training, 86.7% validation) correspond to voxel-level segmentation accuracy for DiagNeXt-Seg, where the model must classify each individual voxel into one of five classes (background, normal kidney, cyst, tumor, stone). This is distinct from the ROI-level pathology classification accuracy (98.9–99.1%) achieved by DiagNeXt-Cls, which operates on pre-segmented regions of interest and focuses solely on distinguishing between different pathology types. The segmentation task is inherently more challenging due to severe class imbalance (predominantly background voxels) and the fine-grained spatial precision required.