4.1. Experimental Setup
4.1.1. Datasets
We performed a basic screening process on our datasets, confirming that each patient contributed only one case and ensuring that there was no data leakage between the training and test sets. After screening, 434 cases from the UOH and 615 cases from the OTMC remained, totaling 1049 CT scans. Among these, 403 cases from the UOH and the 210 cases from OTMC were retained with PF-ILD labels, as shown in
Table 1.
PM extractor dataset: We randomly select 200 cases from each facility (400 cases in total) from the 1049 available cases, regardless of PF-ILD status. All images are converted from the original DICOM format to 8-bit PNG format based on HU, using a Level of HU and a Width of 700 HU, followed by scanner mask normalization. The dataset is split into 70%, 15%, and 15% for training, validation, and testing, respectively.
PF-ILD identification dataset: We use all available cases with PF-ILD labels for identification, totaling 613 cases (403 from the UOH and 210 from the OTMC). These cases undergo scanner mask normalization and RGB windowing processing to better highlight fibrotic regions based on clinical knowledge. The dataset is also divided into 70%, 15%, and 15% for training, validation, and testing, respectively.
4.1.2. Implementation Details
All PM extractor and Slider modules are implemented in the PyTorch framework (version 1.13.1) and trained on servers equipped with two NVIDIA RTX 6000 Ada Generation GPUs. For the PM extractor, we use EfficientNet-b4 [
58] as the classification backbone. Input images are resized to
and normalized to the intensity range
. Models are trained for 20 epochs with a batch size of 256 using weighted cross-entropy loss and the Adam optimizer [
59], with an initial learning rate of
. Data augmentation is applied during training. For the Slider, we employ DINOv2 [
13] as the vision foundation model backbone, with the adapter dimension
d set to 192, corresponding to a reduction factor
. Input images are resized to
. Models are trained for 80 epochs with a batch size of 8 using class-weighted cross-entropy loss (weight ratio 1:1.5), scanner mask normalization, and RGB windowing processing. Optimization is performed using the Adam optimizer, with an initial learning rate of
, a cosine annealing learning rate scheduler, and a dropout rate of 0.3.
4.1.3. Evaluation Metrics
1-Up-Down Accuracy. To evaluate the performance of the PM extractor, we adopt a relaxed evaluation criterion called 1-Up-Down Accuracy. In clinical practice, the slice immediately above or below the ground-truth PM often contains similar anatomical features. Therefore, we consider a prediction correct if the predicted slice index,
, satisfies the following inequality:
where
denotes the ground-truth PM slice index. The 1-Up-Down Accuracy is then defined as follows:
where
N is the number of evaluated cases and
is the indicator function, which returns 1 if the condition is true and 0 otherwise.
For the PF-ILD identification task, we adopt several metrics for evaluation:
AUROC. The Area Under the Receiver Operating Characteristic Curve (AUROC) is widely used for binary classification and measures the model’s ability to distinguish between healthy and diseased samples across various classification thresholds. Given predicted scores,
, and true labels,
, for
, the AUROC is defined as follows:
where
and
are the number of healthy and diseased samples, respectively.
The metrics below are calculated from the following components of the confusion matrix at a specific threshold of 0.5: True Positives (TPs), True Negatives (TNs), False Positives (FPs), and False Negatives (FNs).
Accuracy (Acc.) measures the proportion of all samples that are correctly classified:
Recall (Rec.) measures the proportion of actual positive samples that are correctly identified and is crucial for minimizing missed diagnoses:
Precision (Prec.) measures the proportion of positive predictions that are correct, indicating the reliability of a positive diagnosis:
Specificity (Spec.) measures the proportion of actual negative samples that are correctly identified, reflecting the model’s ability to rule out the condition:
F1-Score (F1) is the harmonic mean of precision and recall, providing a balanced measure of a model’s performance, which is especially useful in cases of class imbalance:
AUPRC. The Area Under the Precision–Recall Curve (AUPRC) is another threshold-independent metric. It summarizes the trade-off between precision and recall across all possible thresholds. The AUPRC is particularly informative for imbalanced datasets, as it focuses on the performance of the minority (positive) class and is less influenced by the large number of true negatives than the AUROC.
Statistical comparison. To formally compare the models using each metric, we estimated 95% confidence intervals for the between-model difference () using a class-stratified, paired bootstrap at the patient level ( resamples; the same resampled indices were applied to both models). Two-sided p-values were obtained via a within-case score-swapping permutation test for the AUPRC and DeLong’s test for the AUROC. All tests were two-sided with .
4.1.4. Comparison Methods
For the PM extractor, we evaluate several model families, including ResNet [
60], DenseNet [
61], and EfficientNet [
58]. For the Slider, we compare our method against several transfer learning baseline methods:
Full fine-tuning: fully updates all parameters of the backbone for PF-ILD identification.
Partial fine-tuning: updates only the last ViT layer while keeping all other layers frozen.
Linear probe: trains only the linear classification layer, keeping all other parameters fixed.
4.2. Results of PSM-Based PM Extractor
We evaluate the performance of various backbone models for the PSM-based PM extractor on the same training and testing datasets using the 1-Up-Down Accuracy metric. The models span several architecture families, including ResNet [
60], DenseNet [
61], and EfficientNet [
58], and the best-performing model from each family is summarized in
Table 2.
Among all the models, EfficientNet-b4 achieves the highest overall performance, with an average 1-Up-Down Accuracy of 98.33%. Its class-wise performance is also strong, reaching 100% for the upper PM, 98.33% for the middle PM, and 96.67% for the lower PM extraction. Within the DenseNet family, DenseNet-169 performs the best, achieving an average 1-Up-Down Accuracy of 97.78%. Its accuracy for the upper and middle PM classes is comparable to that of EfficientNet-b4. These results suggest that the upper PM is a relatively easy prediction target, likely because it corresponds to the slice just before the lung fields become visible, a visually distinct and consistent anatomical feature. In contrast, identifying the lower PM is more challenging as there is greater variability in lung morphology near the diaphragm across patients.
To further validate the effectiveness of the PSM-based PM extractor, we analyze the deviation of predicted slices from ground-truth PMs using EfficientNet-b4, as shown in
Figure 4. We find that the majority of predictions are either exactly correct or within one slice of the ground truth. Notably, even incorrect predictions remain within two slices above or below the reference PM, demonstrating the robustness and reliability of the proposed PSM-based approach.
4.3. Results on Slider for PF-ILD Identification
Table 3 summarizes the PF-ILD identification performance of the proposed Slider model under three different RS configurations. In the 5-RS setting, Slider achieves the following results:
(95% CI [0.652, 0.901]) and
(95% CI [0.760, 0.921]). This corresponds to improvements of
and
over full fine-tuning (
, 95% CI [0.645, 0.894];
, 95% CI [0.706, 0.907]). As detailed in the Methods (statistical comparison) section, we compared the AUROC using DeLong’s test, yielding the following result:
(95% CI [
, 0.115];
;
). We compared the AUPRC using a class-stratified, paired bootstrap of the difference (
) to obtain 95% CIs, with ΔAUPRC = 0.008 (95% CI [
, 0.086]), and computed a within-case score-swapping permutation
p-value (
). All tests were two-sided at the
significance level. Overall, the point estimates favor Slider, but the differences are not statistically significant; notably, Slider demonstrates comparable performance with substantially fewer trainable parameters (3.56 M), signifying its computational efficiency and deployability.
Across all configurations, 5-RS yields the best overall performance, followed by 9-RS and then 3-RS. Notably, in the 5-RS setting, Slider outperforms full fine-tuning on nearly all metrics except recall, highlighting its strong parameter efficiency and effectiveness. Partial fine-tuning achieves the second-highest AUROC (0.832), confirming that lightweight adaptation remains competitive. Linear probe achieves relatively high recall (0.730) but suffers from low precision and specificity, resulting in a lower AUPRC (0.724) and AUROC (0.774). These results demonstrate that Slider achieves the best trade-off between model complexity and diagnostic accuracy, making it particularly well suited for realistic clinical deployment.
4.4. Domain Shift Analysis
In practical applications, domain shifts frequently occur, representing variations between the training dataset,
, and the target environment. These discrepancies reduce performance when models are applied beyond their original training distribution. Such shifts are prevalent across datasets and are widely used as benchmarks to evaluate the robustness of machine learning models [
62]. To investigate ILD-Slider’s resilience to domain shifts, we train Slider models on datasets from different facilities and evaluate them using the AUROC with
. The results are summarized in
Table 4.
In general, datasets from different facilities exhibit distinct data regimes. The OTMC is less heterogeneous, so Slider can learn it more easily, while the UOH dataset is more heterogeneous. Models trained on a single facility show strong in-domain performance but suffer from notable degradation when tested on data from a different facility. For example, the Slider trained on the UOH dataset achieves an AUROC of 0.750 on this data, but the AUROC increases to 0.896 on OTMC data, whereas the model trained on the OTMC dataset attains 0.921 on this data, yet the value drops to 0.759 on UOH data. In contrast, the model trained on both the UOH and OTMC datasets demonstrates the most balanced and robust behavior, achieving AUROCs of 0.823 on UOH data, 0.875 on OTMC data, and 0.847 on the combined UOH and OTMC test set. These findings underscore the importance of multi-facility training for Slider, which is critical for reliable PF-ILD identification across different scanners and institutions.
4.7. Further Analyses on ILD-Slider
This section further investigates the capabilities of ILD-Slider through a series of experiments.
The dimension on which the Slider should be applied. We set the adapter dimension,
p, to 192 in Equation (
5), which corresponds to a scale factor of
. To systematically investigate the impact of different scale factors
, we evaluate
(corresponding to
, respectively). The results are summarized in
Table 5.
We found that the best performance is achieved at , with an AUPRC of 0.790 and an AUROC of 0.847. Moreover, this setting provides the most favorable balance between recall and precision. In contrast, both smaller () and larger () scale factors yield decreased performance. Notably, a parameter efficiency of (3.56 M tunable parameters) outperforms both higher-capacity () and lower-capacity settings (), highlighting an effective trade-off between model capacity and diagnostic accuracy in Slider.
The impact of RGB windowing processing. RGB windowing applies different window levels and widths to emphasize tissue-specific features, enabling the Slider model to capture a richer set of visual cues.
Table 6 compares Slider’s performance with and without RGB windowing. Without RGB windowing, its performance markedly drops across most metrics: the AUROC decreases from 0.847 to 0.808, the AUPRC from 0.790 to 0.762, and specificity from 0.840 to 0.446, indicating a sharp increase in false positives. Although recall increases from 0.730 to 0.919 due to the model generating more false positive predictions, this comes at the cost of reduced precision (0.750 to 0.523) and overall accuracy (0.796 to 0.634).
These results demonstrate that RGB windowing yields more balanced and robust diagnostic performance by enhancing the visibility of fibrotic regions while preserving discriminative power for both positive and negative classes in PF-ILD identification.
The impact of using representative slices. We support the use of RSs for PF-ILD identification because they capture anatomically consistent lung parenchyma regions determined by PMs. To evaluate their effectiveness, we compare the proposed Slider model using RSs with randomly selected slices across three runs with different random seeds, as shown in
Table 7.
The results indicate a clear and consistent advantage when using RSs. Compared to random slices, RSs improve the AUROC from 0.817 to 0.847 and the AUPRC from 0.739 to 0.790, reflecting stronger overall discrimination and more reliable positive class predictions. Similarly, the F1-Score increases from 0.689 to 0.740. The specificity also increases from 0.744 to 0.840, indicating that RSs help reduce false positives. These improvements demonstrate that PM-guided RS selection not only enhances sensitivity to disease-relevant regions but also minimizes noise from non-informative slices, leading to more accurate and robust PF-ILD identification with Slider.
Effect of kernel shape in Slider. To assess the impact of incorporating slice-level information in Slider for PF-ILD identification, we evaluated different 3D convolution kernel shapes (
Table 8). The results show that slice-level modeling plays a crucial role in achieving strong diagnostic performance. When the kernel only captures spatial context without inter-slice information (
), the AUROC drops to 0.797 and the AUPRC to 0.718, indicating reduced discriminatory ability. Conversely, purely slice-wise kernels (our default setting) without spatial aggregation (
) achieve the best results, with an AUROC of 0.847 and an AUPRC of 0.790, suggesting that inter-slice context is more critical than additional spatial filtering for PF-ILD identification.
The kernel, which combines both spatial and slice-level information, yields a competitive AUROC (0.843) but underperforms compared to the kernal, possibly due to fine-grained slice-level patterns that are relevant to disease progression being over-smoothed. The kernel, lacking both spatial and slice context, performs worst, confirming that contextual cues, particularly along the slice dimension, are indispensable for PF-ILD identification with Slider.