This section presents the outcomes of the proposed segmentation framework, emphasizing the role of annotation refinement, anatomically aware augmentations, and empirical model performance. To ensure robust evaluation, various configurations and pathological classes are considered. The refined annotations serve as the foundation for improved training signals, ultimately contributing to more precise and consistent model predictions.
4.1. Annotation Refinement and Label Quality Enhancement
The quality and granularity of annotations play a pivotal role in the training and evaluation of deep learning-based medical segmentation models. The original NIH Chest X-ray dataset used in this research provides only image-level labels without pixel-wise delineation of abnormalities. These coarse annotations are insufficient for supervised semantic segmentation, particularly when the objective is to accurately localize and classify overlapping thoracic pathologies. To address this limitation, a multi-stage annotation refinement protocol was employed to generate clinically validated segmentation masks. This process involved expert radiologists using OncoDocAI (ai.oncodoc.id), a web-based annotation platform that supports pixel-wise, multi-label correction. The platform enables precise boundary marking and the assignment of multiple overlapping labels within the same region, improving both spatial accuracy and label specificity. An example of the annotation interface and its multi-label capabilities is shown in
Figure 2.
Initially, a subset of 1061 frontal-view chest X-ray images was extracted from the NIH repository. These images covered nine major thoracic pathology classes, namely Atelectasis, Cardiomegaly, Effusion, Infiltration, Mass, Nodule, Pneumonia, Pneumothorax, and Consolidation, and were prioritized for expert annotation and mask refinement. This subset enabled controlled evaluation of annotation accuracy and inter-observer agreement. Building on this, a larger curated dataset of 2152 images was compiled, incorporating both the refined subset and additional samples that met quality and class distribution criteria. The complete dataset was divided into 1932 training and 220 validation samples. Each image underwent pixel-level annotation correction to ensure multi-label segmentation fidelity. The distribution of abnormality classes across both sets is detailed in
Table 2, while
Table 3 provides further breakdowns of class-specific label statistics. These curated labels serve as a reliable foundation for training segmentation models capable of capturing co-occurring pulmonary abnormalities.
To address the pronounced class imbalance and expand the dataset’s diversity, a targeted data augmentation strategy was implemented. In the final curated dataset, 2152 annotated chest X-ray images were augmented using geometric transformations, including discrete-angle rotations (±5°, ±10°) and scaling. These operations were selectively applied to underrepresented pathology classes such as Nodule, Mass, and Pneumothorax, thereby enriching the training set with varied yet clinically plausible representations. Importantly, the augmentation pipeline preserved the integrity of the pixel-level segmentation masks, ensuring that multi-label anatomical structures remained correctly aligned with their corresponding abnormalities. Following augmentation, the training dataset increased to 5556 samples, with augmented samples distributed proportionally to mitigate label sparsity. The augmentation details and parameter settings are summarized in
Table 4, while
Table 5 illustrates the resulting increase in image count and per-class label occurrences. This strategy not only enhanced generalization capacity but also ensured that less frequent classes contributed meaningfully to the training process.
To ensure high-fidelity segmentation masks essential for training clinically reliable models, the annotation process incorporated expert pixel-level labeling. Unlike the original NIH image-level tags, which often fail to represent the actual extent and co-localization of thoracic abnormalities, the refined labels were manually delineated by experienced radiologists using the OncoDocAI platform. This approach enabled the generation of anatomically accurate, pixel-precise masks that reflect the true spatial distribution and morphology of pulmonary abnormalities. Importantly, the expert annotations captured multi-label complexity; cases where multiple conditions such as Effusion and Consolidation co-occurred within a single image were explicitly labeled without ambiguity.
Table 6 and
Table 7 illustrate the difference between original NIH labels and the enriched, multi-label annotations produced during this refinement phase. This enhancement not only improves segmentation realism but also provides the necessary granularity for training models capable of robust multi-class and multi-region inference.
4.2. Segmentation Performance and Comparative Evaluation
This section presents a detailed analysis of the semantic segmentation performance of the NCT-CXR framework under four experimental configurations that varied the discrete rotation component of the augmentation strategy. All four models were trained using a shared base augmentation pipeline, which included intensity-based (brightness, contrast, Gaussian noise) and geometric (scaling, translation) transformations validated for anatomical plausibility. The only variable across the models was the inclusion and magnitude of discrete-angle rotation, enabling controlled assessment of how spatial perturbations affect segmentation outcomes.
The purpose of this ablation design was to evaluate whether small-angle geometric rotation, when applied in conjunction with clinically constrained augmentation, contributes to measurable gains in segmentation precision and generalization. Previous studies often apply compound augmentations without considering their impact on anatomical alignment or label consistency. In contrast, our approach isolates the effect of rotation magnitude within a robust, expert-validated augmentation framework. This ensures that any observed performance differences can be attributed primarily to rotation, rather than to uncontrolled variation in other augmentation parameters. As illustrated in
Figure 3, applying ±10° rotations introduces noticeable shifts in the spatial distribution of segmented pathologies, such as infiltration, effusion, and nodules, demonstrating that larger angular deviations can alter anatomical context and impact the alignment between pathological features and image landmarks.
To assess the balance between anatomical label fidelity and augmentation effectiveness, we further analyzed the model’s performance under the ±5° rotation component, within a broader augmentation strategy that also included translation, scaling, brightness adjustment, and Gaussian noise. This mild geometric transformation preserved anatomical relationships while introducing sufficient spatial variability, leading to measurable improvements in segmentation quality. As shown in
Figure 4, the augmented images maintain clear boundary alignment and anatomical realism, particularly in regions affected by multi-label pathologies. Compared to both the unaugmented baseline and the more extreme ±10° augmentation, the ±5° approach yielded smoother contours, improved lesion localization, and enhanced generalization across diverse patient presentations.
These results underscore the nuanced impact of augmentation magnitude: while ±10° rotations contribute broader positional diversity, they carry a higher risk of annotation drift. In contrast, ±5° rotations offer a controlled variability that strengthens the model’s sensitivity to subtle abnormalities without sacrificing spatial coherence. The optimal augmentation strategy depends on clinical priorities, robustness to extreme imaging conditions versus precision in fine-grained abnormality detection, and may benefit from a hybrid approach that combines both rotation levels in training.
The results of the chest X-ray segmentation experiments highlight the significant influence of both data augmentation strategies and hyperparameter tuning on the performance of the YOLOv8 model.
Figure 5 compares the performance of four model variants: Model 1 (baseline without the rotation component), Model 2 (trained with ±10° rotation), Model 3 (trained with ±5° rotation), and Model 4 (trained with mixed ±5° and ±10° rotations). All models shared a consistent base augmentation pipeline that included brightness, contrast, Gaussian noise, translation, and scaling, ensuring that only the effect of rotation was varied across configurations. Among these, Models 2 and 3 yielded markedly higher precision scores of 0.519 and 0.517, respectively, compared to Model 1 (0.346) and Model 4 (0.180). This improvement in precision is especially meaningful in clinical contexts, where reducing false positives can minimize unnecessary diagnostic procedures and alleviate patient anxiety.
Recall measures the proportion of actual positive cases correctly identified and is especially critical in clinical imaging where missed diagnoses can have serious consequences.
Table 8 shows that Model 4 (mixed ±5° and ±10° rotations) achieved the highest overall recall (0.2610), outperforming the baseline (0.2130). While this improvement reflects greater sensitivity to thoracic abnormalities, it is accompanied by a trade-off in precision (as seen in
Table 9), which is a common challenge in high-sensitivity systems. Class-specific analysis reveals that Model 3 (±5° rotation) showed relatively strong recall for Infiltration and Effusion, while Model 2 performed better on Cardiomegaly and Atelectasis. These patterns suggest that different rotation magnitudes help the model generalize across anatomical variations. However, the limited recall for Pneumonia and Nodule, despite augmentation, underscores the need for more advanced sampling strategies or focal loss functions to address underrepresented classes. The recall analysis affirms that augmentation strategies contribute positively, but their gains must be balanced with class imbalance mitigation and clinical risk tolerance.
The precision values across model configurations, as shown in
Table 9, provide insight into the model’s ability to avoid false positives. Among all classes, Pneumothorax detection achieved the highest precision, with Model 2 (±10° rotation) and Model 3 (±5° rotation) reaching 0.829 and 0.804, respectively. These values are clinically significant, as pneumothorax often presents as a well-demarcated pathology, making it more amenable to precise segmentation with minimal misclassification. In contrast, other conditions exhibited lower and more variable precision scores, particularly Infiltration and Pneumonia. Infiltration detection suffered from consistently poor performance, which is expected given its diffuse, low-contrast appearance and high inter-observer variability, even among radiologists. The variation in class-wise precision highlights a central issue in CXR segmentation: augmentation improves precision for clearly defined abnormalities but is insufficient for those with ambiguous boundaries. These findings indicate that augmentation alone cannot address all sources of error and should be complemented by structural priors or context-aware modeling for better pathology-specific precision.
The F1-scores across the four model configurations, as presented in
Table 10, reflect the delicate balance between precision and recall achieved for each class of thoracic abnormality. Overall, Model 4, which incorporated a combined augmentation strategy using both ±10° and ±5° discrete rotations, achieved the highest mean F1-score (0.3840), outperforming the baseline Model 1 (0.2637), Model 2 (0.2760), and Model 3 (0.2513). While these values may appear modest when compared to F1-scores in single-label or high-resolution segmentation tasks, they are consistent with prior studies addressing multi-label pixel-wise segmentation under noisy and imbalanced datasets. The class-specific analysis reveals that Model 2 performed best for pneumothorax, achieving an F1-score of 0.5442, which demonstrates a clinically meaningful balance between sensitivity and specificity for this well-defined pathology. Similarly, Model 4 showed notable improvements in detecting effusion (0.4320) and atelectasis (0.4120), further supporting the effectiveness of moderate augmentation in preserving spatial fidelity during training. In contrast, performance for pneumonia and especially infiltration remained low across all configurations. The F1-score for infiltration detection was 0.0000 in every model, highlighting the substantial difficulty in segmenting diffuse abnormalities with ill-defined boundaries and scarce training examples. Pneumonia also yielded poor results, with the best F1-score reaching only 0.0980 (Model 4). These findings align with known challenges in CXR segmentation literature, where subtle and overlapping abnormalities are notoriously difficult to model using conventional augmentation.
The mAP@0.5 values across the four model configurations highlight the model’s ability to accurately localize and detect thoracic abnormalities at an intersection over union threshold of 0.5. As shown in
Table 11, Model 3, which was trained with discrete rotations of 5 degrees in both directions, achieved the highest overall mAP@0.5 score of 0.2800. This indicates that mild rotational augmentation provided the most effective enhancement in spatial localization and detection precision. Model 2, which used 10-degree rotations, followed closely with a mAP@0.5 of 0.2520, reflecting the benefit of moderate variation in improving the model’s adaptability. Interestingly, Model 4, which combined both 5-degree and 10-degree rotations, demonstrated a slight reduction in performance with a mAP@0.5 of 0.2150, only marginally higher than the baseline Model 1, which scored 0.2020. This suggests that while discrete augmentation at individual angles introduces helpful variability, excessive or combined rotations may lead to inconsistencies in spatial patterns, thereby affecting the model’s localization capability. These results emphasize that the effectiveness of augmentation strategies depends not only on increasing variability but also on maintaining anatomical consistency, which is critical for precise detection and segmentation in clinical imaging tasks.
The mAP@0.5:0.95 values across the four model configurations indicate the model’s performance over a range of intersection over union thresholds, from 0.5 to 0.95, providing a more comprehensive evaluation of localization precision across varying degrees of overlap. As presented in
Table 12, Model 2, which utilized discrete rotations of 10 degrees in both directions, achieved the highest overall mAP@0.5:0.95 score of 0.1510. This model outperformed the baseline Model 1, which scored 0.1110, as well as Models 3 and 4, suggesting that moderate rotational augmentation was particularly effective in enhancing detection robustness for classes with complex positional and spatial variability. The improved performance at higher IoU thresholds indicates better bounding box alignment with ground truth annotations, highlighting the potential of rotation-based augmentation to support fine-grained localization accuracy in chest X-ray segmentation tasks.
The results of chest X-ray segmentation using the enhanced NCT-CXR framework are presented in
Figure 6, with
Figure 6a showing single-label segmentation outcomes and
Figure 6b displaying multi-label segmentation results. In
Figure 6a, the model demonstrates its effectiveness in detecting individual pathological conditions with high accuracy. The first image illustrates fibrosis segmentation, supported by strong confidence scores that confirm the reliability of the prediction. The second image highlights pneumonia detection, where the model clearly delineates the affected region. The third image shows accurate identification of nodules, further validated by high confidence values. In
Figure 6b, the model’s capability for multi-label segmentation is demonstrated through its successful identification of multiple co-occurring abnormalities within a single image. The first example shows the segmentation of both fibrosis and pneumonia, while the second image depicts fibrosis and pneumothorax. The third example highlights the detection of effusion and pneumothorax, each with clearly defined masks and confidence scores. These outcomes reflect not only the efficacy of the YOLOv8 segmentation backbone but also the contribution of the compound, spatially aware augmentation strategy in supporting multi-label generalization across diverse thoracic conditions.
4.3. Statistical Evaluation
The performance evaluation phase incorporated comprehensive statistical analyses to validate the significance of observed differences across model configurations. Given the relatively small sample size, potential non-normal distribution of performance metrics, and the presence of outliers, non-parametric statistical tests were employed. Specifically, the Kruskal–Wallis test was used to assess whether significant differences existed among the four model configurations: (1) the baseline model without augmentation, (2) the model with discrete rotations at (−10°, +10°), (3) the model with discrete rotations at (−5°, +5°), and (4) the mixed rotation model. This analysis was conducted across all key performance metrics—precision, recall, F1-score, mAP@0.5, and mAP@0.5:0.95—using a significance level of 0.05.
Following a significant Kruskal–Wallis result for precision, as shown in
Table 13, a Nemenyi post hoc test was performed to identify specific pairwise differences.
Table 14 presents the outcomes of this analysis, revealing statistically significant differences between Model 2 (discrete rotations at (−10°, +10°)) and Model 4 (mixed rotation) (
p = 0.005602), as well as between Model 3 (discrete rotations at (−5°, +5°)) and Model 4 (
p = 0.013806). While the baseline Model 1 did not exhibit statistically significant differences compared to the other models, its comparison with Model 4 approached significance (
p = 0.153177). Furthermore, Model 2 and Model 3 showed no significant difference in precision performance (
p = 0.992827), indicating that both moderate-angle rotation strategies yielded comparable precision improvements.
These statistical findings strongly suggest that the rotation angle used in data augmentation plays a critical role in influencing model precision. Moderate-angle rotations (Models 2 and 3) significantly outperformed the mixed rotation strategy (Model 4), likely due to their consistency in preserving anatomical structure during augmentation. The absence of significant differences in the other evaluation metrics implies that while augmentation primarily affects precision, overall detection and segmentation performance remains stable across models.