Next Article in Journal
Comparison of Shear Bond Strength and Interfacial Failure Patterns of Glass Hybrid Ionomer, Resin-Modified Glass Ionomer, and Nanofilled Composite to Dentin: An In Vitro Study
Previous Article in Journal
Anomalies in AI Outputs Beyond Input Data Quality: The Significance of Reasoning
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Boundary-Enhanced YOLO-Based Instance Segmentation with Background-Only Negative Samples for Three-Level Scoliosis Severity Screening in Whole-Spine Radiography

1
Machine Intelligence Convergence System, Seongnam-si 13135, Gyeonggi-do, Republic of Korea
2
Department of Medical Artificial Intelligence, Eulji University, Seongnam-si 13135, Gyeonggi-do, Republic of Korea
3
Department of Radiological Science and Medical Artificial Intelligent, Eulji University, Seongnam-si 13135, Gyeonggi-do, Republic of Korea
*
Author to whom correspondence should be addressed.
Appl. Sci. 2026, 16(11), 5492; https://doi.org/10.3390/app16115492
Submission received: 24 April 2026 / Revised: 23 May 2026 / Accepted: 26 May 2026 / Published: 1 June 2026

Abstract

Clinical evaluation of scoliosis primarily relies on the Cobb angle measured on standing whole-spine radiographs. However, manual measurement is affected by intra- and inter-observer variability caused by differences in end-vertebra selection, endplate definition, and vertebral boundary interpretation. In addition, low radiographic contrast and anatomical overlap can hinder accurate identification of the spinal contour. In clinical screening, rapid three-level severity classification with reduced false negatives serves as a complementary function to precise quantitative measurement, supporting case triage and missed-detection prevention. This study proposes a boundary-enhanced YOLO-based instance segmentation framework—where ‘boundary-enhanced’ refers to the reinforcement of spinal contour boundary representation through the DeepLabV3+-based segmentation head—for three-level scoliosis severity screening using clinician-assigned severity labels derived from Cobb angle measurements. Unlike semantic segmentation, which may cause class fragmentation within a single spine, the proposed method defines the entire spine as one anatomical instance and predicts a single severity label based on the global contour structure. Class-balanced offline augmentation, background-only negative samples, attention modules, and segmentation heads were comparatively evaluated. Results showed that background-only negative samples reduced false negatives, and CBAM improved accuracy while maintaining a practical model size and near-real-time inference speed under the tested environment. DeepLabV3+ provided the most stable contour reconstruction. The final model improved both contour extraction and three-level severity screening performance, suggesting that the proposed framework may be potentially useful for assisting scoliosis screening. However, further external validation and prospective evaluation are required before clinical deployment.

1. Introduction

Scoliosis is a complex anatomical deformity characterized by lateral curvature and axial rotation of the spine, and adolescent idiopathic scoliosis (AIS) is the most commonly encountered form in clinical practice [1]. AIS is generally defined as a spinal curvature of unknown etiology that occurs in adolescents between 10 and 18 years of age and is diagnosed when the Cobb angle measured on standing whole-spine radiographs is 10° or greater. The prevalence of AIS in the adolescent population has been reported to be approximately 1–3%, and because the curvature may progress during growth, early screening and continuous follow-up are essential [2,3].
The clinical evaluation of scoliosis can be divided into primary screening and subsequent image-based quantitative assessment [4]. In the primary screening stage, the forward bend test and scoliometer may be used. Definitive diagnosis and severity assessment, however, rely on radiographic measurement of the Cobb angle. The Cobb angle is defined as the angle formed by two lines parallel to the endplates of the most tilted superior and inferior end vertebrae on standing whole-spine radiographs. It is a key metric throughout scoliosis management, including diagnosis, assessment of progression, evaluation of brace treatment indications, and surgical planning. Furthermore, standing whole-spine radiography serves as the fundamental imaging modality for scoliosis evaluation because it provides information not only on curve magnitude but also on the overall morphology and alignment of the spinal deformity [5].
However, Cobb angle measurement is inherently observer-dependent and vulnerable to multiple sources of error during image interpretation. Reliable identification of the spinal contour and overall spinal morphology substantially affects the reliability of quantitative evaluation in scoliosis diagnosis [6,7]. Specifically, end-vertebra selection, endplate line placement, and interpretation of vertebral boundaries and corners depend heavily on the clinician’s experience and judgment. These tasks are particularly challenging for less experienced readers because delineation of spinal boundaries is hindered by the low contrast of radiographs and the overlap of anatomical structures. Previous studies have reported intra- and inter-observer variability of approximately 4–8° in Cobb angle measurement, and recent automated measurement studies have also reported manual measurement error ranges of approximately 3–10° [6,7,8]. Moreover, patient posture is itself an important source of variation: differences in curve magnitude have been observed between standing and recumbent or supine images, and prior studies have shown that the standing Cobb angle may be approximately 7–10° greater than the supine Cobb angle [9,10]. These findings underscore an inherent limitation of scoliosis diagnosis, in which a difference of only a few degrees may alter the diagnostic category or treatment decision.
To address these limitations, artificial intelligence (AI), particularly deep learning-based medical image analysis, has increasingly been applied to scoliosis assessment. Many previous studies have focused on automated Cobb angle measurement using vertebral detection, landmark estimation, end-vertebra selection, or endplate orientation analysis [8,11,12,13,14]. Although these approaches are valuable for quantitative assessment, clinical screening also requires rapid severity stratification and false-negative reduction. Therefore, this study focuses on three-level severity screening based on clinician-assigned labels derived from Cobb angle criteria, rather than direct Cobb angle estimation. In this context, morphology-preserving segmentation remains important because the overall spinal contour provides structural information relevant to scoliosis severity [8,15,16,17].
However, when scoliosis severity classification is performed within a segmentation framework, semantic segmentation may produce a class fragmentation problem if severity is learned as a pixel-level class [18,19]. In such cases, different regions within a single spinal object may be assigned different severity classes, because semantic segmentation inherently treats each pixel independently and cannot enforce a single consistent label across an entire object instance [18]. In this context, the YOLO (You Only Look Once) family offers a practical architecture for scoliosis screening. YOLO was originally proposed as a single-stage framework that predicts object location and class labels end-to-end within a single network, and it has been widely adopted in real-world applications that require rapid inference and relatively low computational burden [20]. More recently, YOLO has been extended from object detection to instance segmentation, enabling simultaneous prediction of object-specific masks and classes. YOLO-based medical imaging studies have also demonstrated strong performance and rapid processing speed for tumor detection, lesion detection, and segmentation tasks, making this family a suitable candidate for clinical decision-support systems [21,22]. Because standing whole-spine radiographs typically contain one principal spinal structure, there is little need for a panoptic segmentation framework that jointly handles multiple object categories and background classes. Instead, it is more appropriate to define the entire spine as a single instance and simultaneously predict its mask and severity class using a YOLO-based instance segmentation architecture that balances both accuracy and efficiency [14].
Accordingly, this study proposes a YOLO-based instance segmentation framework for three-level scoliosis severity screening–normal, mild, and severe–using standing whole-spine radiographs. By learning clinician-assigned severity labels derived from Cobb angle measurements, the proposed method may provide a potentially useful framework for assisting clinical screening, although further external validation and prospective evaluation are required to confirm its clinical applicability.
Specifically, this study aims to conduct the following: (1) reformulate scoliosis severity classification from a pixel-level semantic segmentation problem into a whole-spine instance-level structural classification problem; (2) improve data-level reliability by applying class-balanced augmentation and background-only negative samples to mitigate class imbalance and augmentation-induced false negatives; and (3) evaluate whether previously established attention modules and segmentation heads can improve spinal contour restoration and severity classification when integrated into a YOLO-based instance segmentation framework. Therefore, the contribution of this study lies not in proposing a fundamentally new algorithmic module, but in the task-specific integration and empirical evaluation of dataset-level refinement and network-level adaptation for screening-oriented scoliosis assessment.
The major contributions of this study lie in the synergistic integration of a tailored dataset reconfiguration strategy and architectural adaptations optimized for scoliosis screening, rather than the proposal of a fundamentally new network architecture. Specifically, our contributions are summarized as follows:
  • First, we reformulated scoliosis severity screening from a pixel-level classification problem into a whole-spine instance-level structural classification problem.
  • Second, as a dataset reconfiguration strategy, we introduced background-only negative samples to mitigate data imbalance and effectively reduce false negatives—an empirical approach specifically validated for standing whole-spine radiography.
  • Third, we experimentally evaluated the task-specific integration of CBAM and a DeepLabV3+-based segmentation head within a YOLO-based instance segmentation framework. The results suggest that this configuration can improve spinal contour representation and screening-oriented severity classification performance in standing whole-spine radiographs.
The remainder of this paper is organized as follows. Section 2 reviews studies related to scoliosis diagnosis, automated Cobb angle measurement, medical image segmentation, YOLO-based instance segmentation, and attention modules. Section 3 describes the dataset, spinal contour annotation, class-balanced augmentation, construction of background-only negative samples, the YOLO-based segmentation architecture, improvements to attention modules and segmentation heads, and the training conditions. Section 4 presents comparisons between semantic and instance segmentation, the effects of augmentation and background-only negative samples, comparisons of attention modules and segmentation heads, and the ablation results and class-wise screening performance of the final proposed model. Section 5 discusses the clinical implications of the findings, the limitations of the study, and potential extensions toward automated Cobb angle measurement. Finally, Section 6 concludes the paper.

2. Related Work

2.1. Deep Learning-Based Scoliosis Analysis and Cobb Angle Estimation

Recent advances in deep learning have substantially improved performance across a wide range of medical image analysis tasks, including classification, detection, segmentation, and quantitative measurement, and have accelerated research on automated Cobb angle measurement and spinal structure analysis in scoliosis imaging. Because the Cobb angle remains the principal quantitative metric for scoliosis assessment, many previous AI studies have focused on identifying end vertebrae on radiographs and automatically estimating the Cobb angle from vertebral landmarks or segmentation outputs. Zhu et al. analyzed 50 studies in a systematic review and meta-analysis of deep learning-based automated Cobb angle measurement from radiographs and reported an overall circular mean absolute error (CMAE) of 2.99 based on 17 studies included in the meta-analysis [8]. They further reported a CMAE of 2.40 for segmentation-based methods and 3.31 for landmark-based methods, while also highlighting inter-study heterogeneity and the need for standardization prior to clinical deployment.
Maeda et al. proposed a CNN-based method for automated vertebral detection and Cobb angle measurement using 1021 full-length standing whole-spine radiographs from patients with adolescent idiopathic scoliosis [13].
More recently, Li et al. further validated an automated measurement algorithm against assessments from multiple spine specialists, reporting clinically acceptable agreement and supporting the feasibility of AI-assisted quantitative scoliosis evaluation [15]. İlkhan et al. proposed an integrated deep learning platform combining vertebra segmentation with automated Cobb angle calculation, representing a recent example of end-to-end segmentation-based clinical decision support for scoliosis [17].
Their study compared AI-derived measurements with those obtained by six physicians and reported an intraclass correlation coefficient (ICC) of 0.973 for the major curve in the standing position, thereby demonstrating the clinical feasibility of automated whole-spine radiographic analysis. Their method employed a multi-stage pipeline consisting of region-of-interest detection, feature point detection for 17 vertebral bodies, and subsequent Cobb angle calculation, ultimately deriving major and minor curves from the corner points of individual vertebrae.
Overall, prior studies have demonstrated the feasibility of automated Cobb angle measurement, but their objectives differ from that of the screening-oriented task addressed in this study. Automated measurement models generally require vertebral localization, landmark estimation, end-vertebra selection, and angle calculation, whereas the present study focuses on clinician-assigned three-level severity labels. Accordingly, this study frames scoliosis assessment as an instance-level screening problem using the overall spinal contour as structural information for classifying normal, mild, and severe cases.

2.2. Semantic Segmentation and Segmentation Heads

The importance of segmentation has also been increasingly emphasized in automated scoliosis analysis. A meta-analysis of automated Cobb angle measurement reported that segmentation-based approaches yielded lower errors than landmark-based approaches, and many segmentation-based studies adopted U-Net-like architectures followed by minimum bounding rectangle analysis or vertebral corner estimation to compute the Cobb angle [23]. These findings suggest that explicitly modeling the morphology of the spine or vertebral bodies is important in scoliosis image analysis. In addition, the architecture of the segmentation head directly affects mask quality and boundary restoration performance. U-Net is a representative encoder–decoder architecture for precise localization in biomedical image segmentation [24,25]. DeepLabV3+ combines atrous convolution-based multi-scale contextual modeling with decoder-based boundary refinement, making it particularly suitable for elongated and thin anatomical structures with clinically important boundaries, such as the spinal contour [23,25,26,27].
Nevertheless, the direct application of semantic segmentation to scoliosis severity classification has inherent limitations. Semantic segmentation assigns a class label independently to each pixel, making it structurally suited to tasks where class membership is a local property of image regions [18]. However, severity labels such as normal, mild, and severe are not attributes of individual pixels or local regions; rather, they are object-level labels determined by the global curvature pattern and structural alignment of the entire spinal contour. This mismatch between pixel-level prediction and object-level labeling can cause a class fragmentation problem, in which a single spinal structure is partially predicted as normal and partially as mild or severe [19]. Instance segmentation, by contrast, assigns a single class label per detected object instance, making it structurally more appropriate for whole-spine severity classification [26].
To address these limitations, the present study adopts an instance segmentation-based approach in which the entire spine is defined as a single anatomical instance and its severity is classified on the basis of the global contour structure of that instance. At the same time, the advantages of established segmentation architectures are incorporated through the integration of advanced segmentation heads. The aim is not merely to increase architectural complexity, but rather to apply segmentation heads suited to scoliosis radiographs while preserving the lightweight and practical characteristics of a YOLO-based clinical screening model.

2.3. Object Detection and Instance Segmentation

Object detection is a core task in computer vision that simultaneously estimates the location and category of objects in an image and is generally categorized into two-stage and one-stage detectors. Two-stage methods first generate candidate regions and then perform classification and localization regression for each proposal. In contrast, the YOLO family reformulates object detection as a single regression problem and uses a single neural network to predict bounding boxes and class probabilities for the entire image in one stage [28]. Owing to this structural characteristic, YOLO has been widely adopted in real-world applications that require a balance between detection accuracy and inference speed.
More recently, YOLO has expanded beyond bounding-box detection to segmentation tasks, with instance segmentation enabling simultaneous prediction of object-specific masks or contours, class labels, and confidence scores. YOLO-based medical imaging studies have demonstrated strong performance and rapid processing speed in tasks such as tumor detection, lesion detection, lesion segmentation, and microscopic image analysis, thereby supporting its utility in clinical decision-support systems [20,21,22]. A 2025 systematic review of YOLO-based medical imaging studies likewise summarized the broad applicability of the YOLO family in medical imaging owing to its real-time efficiency and robust performance, while also noting limitations related to low-contrast images, insufficient annotation, generalization challenges, and missed detection of small objects [20].
In scoliosis screening based on standing whole-spine radiographs, a single major spinal structure is present in each image, and the diagnostic label is associated with the global contour of the spine. Therefore, rather than panoptic segmentation, which is designed to jointly handle multiple objects and background classes, an instance segmentation strategy that defines the entire spine as a single instance and jointly predicts its mask and severity class is more appropriate. Moreover, because practical clinical screening environments require rapid inference and low hardware burden for high-resolution radiographs, a YOLO-based one-stage segmentation architecture represents a more practical alternative than two-stage instance segmentation. On this basis, YOLO segmentation was selected as the baseline framework in this study and extended to improve both spinal contour preservation and severity classification.

2.4. Attention Modules and Segmentation Head Architectures

A considerable body of research has explored the application of attention modules to improve the performance of deep learning-based detection and segmentation models. Attention modules enhance feature representations by emphasizing task-relevant channels or spatial regions within input feature maps. CBAM (Convolutional Block Attention Module) is a lightweight attention module that sequentially applies channel and spatial attention to enhance feature representation, and can be integrated into existing CNN architectures with minimal structural modification and computational overhead [28].
Transformer-based architectures have also demonstrated strong performance in vision tasks. Swin Transformer, a hierarchical Transformer based on shifted windows, has been shown to serve as a general-purpose backbone for image classification, object detection, and semantic segmentation [29]. It achieves computational efficiency by performing self-attention within non-overlapping local windows and compensates for limited inter-window interaction through shifted windowing. However, compared with lightweight CNN-based modules, Transformer-based attention mechanisms generally impose greater computational and memory demands. In screening environments that require both rapid inference and hardware efficiency, the trade-off between performance gain and computational overhead must therefore be carefully considered. In this study, CBAM and Swin Transformer blocks were each applied to the YOLO backbone and comparatively evaluated.

2.5. Data Augmentation and Background-Only Negative Samples for Medical Images

In medical image AI research, it is often difficult to secure a sufficient quantity of high-quality annotated data, and class imbalance frequently arises because of disease prevalence characteristics. To alleviate these problems, various augmentation strategies have been explored, including rotation, translation, scaling, cropping, intensity variation, image mixing, and generative model-based synthesis. Kim and Bae reviewed data augmentation in medical image analysis and described how limited data availability and class imbalance can hinder deep learning performance, while also introducing approaches such as rotation, translation, intensity variation, generative adversarial network-based methods, and image property mixing [30].
Data augmentation has also been an important factor in improving generalization performance in the YOLO family. YOLOv4 improved the balance between speed and accuracy by combining several bag-of-freebies techniques, including Mosaic data augmentation, DropBlock regularization, and CIoU loss [31].
In medical imaging, augmentation must be applied cautiously so as not to distort diagnostically meaningful features. Based on the prior evidence summarized above, the present study applied standard geometric transformations reflecting clinically realistic radiographic variations, while preserving the morphological characteristics of the spinal contour. Detailed augmentation procedures are described in Section 3.3.
Previous studies on object detection have shown that hard-example or hard-negative learning can improve detector training by emphasizing difficult background or confusing negative samples [32,33,34]. In medical image analysis, hard-negative mining has also been used to improve feature representation, reduce training inefficiency, and suppress detection errors under clinically relevant non-target conditions [35,36]. Based on these findings, the present study incorporated background-only negative samples from standing whole-spine radiographs as empty-label images to help the model distinguish non-spinal regions from true spinal instances and mitigate augmentation-associated screening false negatives.

3. Materials and Methods

The aim of this study was not to develop an automated measurement model that directly computes the Cobb angle from standing whole-spine radiography, but rather to develop a three-level AI screening model that classifies normal, mild, and severe scoliosis based on the overall spinal contour visible on standing whole-spine radiography.

3.1. Dataset

This study used standing whole-spine radiographs obtained from Gyeonggi-do Medical Center Pocheon Hospital. All images were acquired for the diagnosis or follow-up of scoliosis, and each image contained a single principal spinal structure. The study protocol was approved by the Institutional Review Board of Eulji University (IRB No. EU24-77).
During the initial data screening process, radiographs with inconsistent Cobb angle measurements or unclear severity assignments were excluded before construction of the final dataset. Because this exclusion was performed prior to final dataset compilation, the number of excluded images was not separately recorded.
The final dataset used in this study comprised 8999 standing whole-spine radiographs. Each image was assigned to one of three severity classes-normal, mild, or severe-according to the Cobb angle measured by clinicians or documented in the corresponding radiology reports. The class definitions used in this study were as follows: normal (Cobb angle < 10°), mild (10° ≤ Cobb angle < 25°), and severe (Cobb angle ≥ 25°) [37].
To prevent data leakage and ensure unbiased model evaluation, the dataset was partitioned at the patient level. Each image corresponded to a unique patient, and no patient was included in more than one subset. The dataset was divided into training, validation, and test sets at a ratio of 6:2:2, respectively. The original dataset is summarized in Table 1.
These thresholds are consistent with widely used clinical criteria in scoliosis diagnosis and management, in which a Cobb angle of 10° defines the diagnostic threshold for scoliosis and 25° serves as a key inflection point guiding brace treatment decisions.

3.2. Spine Contour Annotation and Label Construction

The spine contour annotations used in this study were performed by two board-certified clinical specialists and reviewed by one professor of rehabilitation medicine. The two specialists independently delineated the outer boundaries of the spinal structure on all standing whole-spine radiographs. Cases showing disagreement between the two specialists regarding the spinal contour annotation were excluded before construction of the final dataset. In addition, radiographs with inconsistent Cobb angle measurements or unclear severity labels were also excluded during the pre-dataset screening stage. Because these exclusion procedures were conducted before final dataset compilation, the number of excluded cases was not separately recorded. After this screening process, segmentation masks of the spinal region were generated based on the accepted annotations. The annotations were subsequently reviewed and validated by the professor of rehabilitation medicine. The resulting spine contour masks were then converted into polygon annotations for YOLO segmentation training. Figure 1 presents examples of standing whole-spine radiographs and their corresponding segmentation masks.
For YOLO-based instance segmentation training, the spine contour masks were converted into polygon annotation format. First, the outer contour of the spine was extracted from each binary mask, and the closed contour was resampled into points using spline-based interpolation. To determine the appropriate number of polygon contour points, a preliminary quantitative comparison was conducted using three point settings: 240, 360, and 480 points. For each setting, YOLOv12-based instance segmentation was trained under identical conditions, and mAP@0.5 was evaluated on the validation set. The results are summarized in Table 2.
As shown in Table 2, the 480-point setting achieved the highest mAP@0.5 (0.835), whereas the 360-point setting yielded the lowest (0.828). The 240-point setting produced a result comparable to 480 points (0.834); however, visual inspection by the two clinical specialists and the professor of rehabilitation medicine indicated that the 240-point contour exhibited insufficient resolution in regions of high spinal curvature, where boundary detail is clinically meaningful. The 480-point setting was therefore selected as it provided the most consistent balance between contour fidelity and annotation complexity. Accordingly, each spinal instance was converted into a polygon annotation consisting of 480 contour points, and each coordinate was normalized by the image width and height before being saved in YOLO segmentation label format.
The final converted annotations were stored in the format <class_id> <x1> <y1> <x2> <y2>… <xn> <yn>, where class_id represents the scoliosis severity class and (xi, yi) represents the normalized polygon contour coordinates. Figure 2 shows an example of a polygon mask applied to the original mask.
Because each image contained a single spinal instance, each YOLO segmentation label consisted of one polygon object and one severity class. The class IDs were assigned as 0, 1, and 2 for normal, mild, and severe cases, respectively. In other words, the YOLO segmentation labels used in this study were designed not only to predict the location and mask of the spine, but also to simultaneously predict the severity class of the spinal instance. During training, whenever augmentations such as image resizing, rotation, cropping, and scaling were applied, the same geometric transformations were also applied to the masks and polygon coordinates to maintain image–label consistency.

3.3. Class-Balanced Offline Augmentation

The scoliosis dataset used in this study exhibited an imbalance in the number of samples among the normal, mild, and severe classes. Such class imbalance may bias the model toward the majority class and, particularly in screening settings, may increase false negatives, in which actual scoliosis cases are misclassified as normal. To address this issue, offline data augmentation was applied to balance the distribution of the training data across classes.
As described in Section 2.5, the augmentation strategy was restricted to physical transformations simulating real-world radiographic acquisition conditions. Rotation was limited to ±5° and scaling to 0.8–1.2, ensuring that neither the overall spinal contour nor the Cobb angle interpretation was materially affected. Cropping was applied within a range that did not result in excessive loss of the spinal contour. Consequently, misclassifications near the 10° and 25° severity thresholds are attributable to the inherent difficulty of borderline cases in real radiographs, rather than to augmentation-induced label distortion. Representative examples of the applied augmentations are shown in Figure 3.
During all augmentation processes, parameter values were randomly assigned to avoid overrepresentation of specific ranges. The same geometric transformations were applied to both the original images and the segmentation masks to prevent label mismatch and maintain image-label consistency.
After offline augmentation, the number of training images in the normal class remained unchanged, whereas the mild and severe classes were each increased to 2400 images. This strategy enabled the model to learn the differences in spinal contours among normal, mild, and severe cases in a more balanced manner while minimizing bias toward any specific class. It should be noted that, while overall accuracy improved following augmentation, an increase in false negatives was also observed in some experimental settings. This is interpreted not as augmentation-induced label distortion, but as a consequence of the model encountering a more balanced class distribution: as the model was exposed to a greater proportion of mild and severe cases, the decision boundary near the severity thresholds became more sensitive, which may have increased misclassifications in borderline cases. This observation motivated the subsequent introduction of background-only negative samples to further stabilize screening performance.

3.4. Preliminary Experiment with Semantic Segmentation

In the initial stage of this study, a preliminary semantic segmentation-based approach was explored to examine whether scoliosis severity could be represented as pixel-level classes. The semantic segmentation model was trained to predict the normal, mild, and severe classes at the pixel level. However, scoliosis severity is determined not by the attributes of individual pixels or local regions, but by the degree of curvature and structural alignment of the overall spinal contour.
In this preliminary observation, the semantic segmentation-based approach showed a class fragmentation problem, whereby different severity classes were predicted in different regions of a single spinal structure, as illustrated in Figure 4. This finding was used as a methodological motivation for reformulating scoliosis severity screening as an instance-level structural classification problem.
However, this preliminary comparison was qualitative in nature and was not intended as a systematic quantitative comparison between semantic segmentation and instance segmentation baselines. A comprehensive quantitative evaluation of semantic segmentation models, including class fragmentation frequency, class-wise recall, and mask-based metrics, would require additional experiments with multiple semantic segmentation baselines and a clearly defined pixel-level error criterion.
Accordingly, this study redefined scoliosis assessment not as a pixel-level classification problem, but as a whole-spine, instance-level structural classification problem. Based on this reformulation, subsequent experiments focused on a YOLO-based instance segmentation framework in which the entire spine was treated as a single anatomical instance.

3.5. YOLO-Based Instance Segmentation

In this study, a YOLO-based segmentation model was employed to detect the entire spinal contour as a single object instance while simultaneously predicting the normal, mild, and severe classes. YOLO is a single-stage detection framework that predicts bounding boxes, class probabilities, and objectness scores directly from the input image. The basic architecture is illustrated in Figure 5.
For the segmentation task, an additional mask prediction branch was incorporated to generate an instance mask for each object. The baseline models evaluated in this study were YOLOv8, YOLOv9, YOLO11, and YOLOv12.
The YOLO segmentation model consists of a backbone, neck, detection head, and segmentation head. The backbone extracts multi-scale features from the input radiographs, whereas the neck fuses features at different resolutions. The detection and segmentation heads use feature maps at each scale to predict bounding boxes, class labels, confidence scores, and segmentation masks.
Because standing whole-spine radiographs generally contain a single major spinal structure, a panoptic segmentation framework designed to simultaneously handle multiple object categories and background stuff classes was considered unnecessary in this study. Instead, the entire spine was defined as a single anatomical instance, and an instance segmentation approach was adopted to simultaneously predict both the mask and the severity class of that instance.

3.6. Construction of Background-Only Negative Samples

After the application of class-balanced augmentation, the overall accuracy of the model improved; however, an increase in false negatives was observed in some experiments. This finding suggests that the model became more sensitive to positional information or background patterns than to the global structural contour of the spine, likely owing to variations in spinal position, image centering, and background distribution introduced by cropping.
The primary objective of this study was three-level scoliosis screening, in which false negatives-cases in which actual scoliosis patients are incorrectly classified as normal-represent a clinically critical error. To address this issue, background-only negative samples were constructed by extracting regions without the spine from standing whole-spine radiographs and incorporating them into the training process as empty-label samples. This approach was intended to mitigate positional bias and background-dependent feature learning that may arise during augmentation. Using this strategy, we investigated whether the model could more reliably distinguish non-spinal regions from true spinal instances and thereby reduce clinically important false negatives in screening.
To generate background-only negative samples, background regions that did not overlap with the spinal object were automatically extracted from the entire training dataset. First, the bounding region of the spinal object was obtained from the label file of each image. A prohibited region was then defined by applying padding around both the image center and the spinal bounding box. Random crops were repeatedly generated, and any crop overlapping the spinal region by even a single pixel was excluded. In addition, excessively dark regions or regions with extremely low contrast were removed so that only background samples containing a minimal amount of image information were retained.
The selected crops were then saved as empty-label images and added to the original training dataset. This process enabled the model to more clearly distinguish background regions without the spine from true spinal objects. Background-only negative samples were added only to the training set and were not included in the validation or test sets. Ultimately, 1200 background-only negative samples were generated, corresponding to approximately 15% of the non-background augmented training set. The number of background-only negative samples was empirically selected during model development to provide sufficient negative-context exposure without excessively increasing empty-label samples. A systematic ratio-ablation study of background-only samples was not conducted in the present revision and is therefore acknowledged as an important topic for future work. These samples, shown in Figure 6, were used together with the original and augmented images during training.
This dataset-level reconstruction was designed to complement class-balanced augmentation. While augmentation increased the diversity of scoliosis-related samples, background-only negative samples provided negative contextual information from the same radiographic domain. Therefore, the model was encouraged not only to learn class-balanced spinal contour variations, but also to distinguish true spinal instances from non-spinal anatomical or background regions. This complementary data strategy was particularly important for reducing screening false negatives, which are clinically more critical than false positives in scoliosis screening.
The final dataset, including the background-only negative samples, is presented in Table 3.

3.7. Integration of Attention Modules

To improve the performance of YOLO segmentation, attention modules were incorporated into the backbone in this study. Attention modules enhance the representational capacity of the model by emphasizing diagnostically relevant channels or spatial regions within feature maps. In particular, they allow the model to focus more effectively on spinal contours and curvature patterns that are critical for diagnosis.
In scoliosis diagnosis, overall spinal curvature, lateral deviation, and contour morphology are important factors. Therefore, it is necessary not only to capture local features but also to stably represent the global structural characteristics of the entire spine. In this study, two representative attention modules, CBAM and the Swin Transformer block, both of which have demonstrated effectiveness in YOLO-based object detection, were incorporated and comparatively evaluated.

3.7.1. CBAM

CBAM is a lightweight attention module that sequentially applies channel attention and spatial attention. Channel attention learns the relative importance of each channel within the feature maps, whereas spatial attention identifies the regions within the image on which the model should focus. In this study, CBAM was inserted after the major feature extraction blocks of the YOLO backbone to emphasize features associated with the spinal contour.
CBAM was adopted from its original formulation without structural modification [28]. The CBAM-integrated architecture was implemented by inserting the attention blocks into the YOLO backbone while preserving the original backbone, neck, and head structure. The contribution of incorporating CBAM in this study lies in the experimental confirmation that it can improve spinal contour representation and three-level severity screening performance within a whole-spine instance segmentation framework applied to scoliosis radiographs.

3.7.2. Swin Transformer Block

To explore the potential of Transformer-based attention mechanisms, the Swin Transformer block was incorporated into the YOLO backbone. The Swin Transformer, originally proposed by Liu et al. [29], partitions input features into local windows and performs self-attention within each window, while the shifted-window mechanism enables interaction across regions. This module was not newly proposed in this study; it was included solely to investigate whether a Transformer-based approach could offer additional performance gains over CNN-based attention in the scoliosis screening task.
However, Transformer-based architectures generally require greater computational cost and memory consumption than CNN-based attention modules. Therefore, in practical clinical screening environments, the trade-off between performance improvement and computational efficiency must be carefully considered.

3.8. Comparison of Segmentation Head Architectures

Segmentation head architectures previously established in semantic segmentation were incorporated into the YOLO segmentation framework to improve restoration of spinal contour boundaries. Specifically, U-Net- and DeepLabV3+-based heads were adopted from prior studies and compared without claiming them as newly designed modules [23]. The effects of each architecture on spinal mask reconstruction and severity classification performance were analyzed in the specific context of whole-spine radiographic scoliosis screening.
The U-Net-based head facilitates boundary refinement through skip connections. The DeepLabV3+-based head combines multi-scale contextual feature extraction using atrous convolution with decoder-based boundary refinement, which may be particularly advantageous for preserving the morphological characteristics of the long, continuous spinal contour.
These architectures were compared within the same YOLO-based segmentation framework to identify the most suitable configuration for the scoliosis dataset. The detection and classification branches of the baseline YOLO segmentation model were preserved, whereas only the segmentation head responsible for mask prediction was modified according to each architecture. This design allowed the model to retain severity class prediction for each spinal instance while improving boundary representation during mask decoding.
The two segmentation heads were compared under identical YOLO backbone conditions, and not only segmentation performance but also classification performance for the normal, mild, and severe classes, as well as the reduction in false negatives, were comprehensively evaluated.

3.9. Proposed Network Architecture

The final network configuration used in this study was constructed by integrating CBAM [28] and a DeepLabV3+-based segmentation head [25] into the YOLOv12 segmentation baseline, which showed the best baseline segmentation performance among the evaluated YOLO models. These components were not newly designed in this study; rather, they were selected and combined to adapt the YOLO-based instance segmentation framework to the characteristics of standing whole-spine radiographs and three-level scoliosis severity screening.
The overall architecture consists of the following stages. The resulting model preserves the single-stage segmentation structure of YOLO while incorporating CBAM to enhance features related to the spinal contour and a DeepLabV3+-based head to strengthen multi-scale contextual representation and boundary restoration. The purpose of this configuration is not to introduce a fundamentally new network module, but to evaluate whether these established components can improve contour extraction and severity screening performance when applied to whole-spine radiographs. First, the input standing whole-spine radiograph is fed into the YOLO backbone. CBAM is applied to the major feature extraction blocks of the backbone to recalibrate channel and spatial features. Multi-scale feature fusion is then performed in the neck. The detection and classification branch predicts the bounding box, confidence score, and severity class of the spinal instance, whereas the DeepLabV3+-based segmentation head reconstructs the spinal instance mask. Ultimately, the model simultaneously outputs the spinal mask and normal, mild, and severe classes. Figure 7 illustrates the architecture of the proposed model, in which CBAM and the DeepLabV3+-based head are integrated into the baseline YOLO segmentation framework. Figure 8 shows the differences between the basic YOLO and the proposed model.
In this study, the term “boundary-enhanced” refers specifically to the enhancement of spinal contour boundary representation. It does not refer to Cobb angle threshold boundaries between severity classes. The DeepLabV3+-based head was incorporated to improve mask quality and boundary restoration of the elongated spinal contour, whereas CBAM was used to emphasize features related to the spinal contour and spatial structure.

3.10. Training Design

The standing whole-spine radiographs used for training varied in size and contrast depending on the imaging equipment, patient body habitus, degree of spinal curvature, and image resolution. Therefore, to improve training stability, all images were resized to a uniform input resolution of 640 × 640. During the resizing process, a simple resizing method was applied while minimizing distortion of the spinal aspect ratio.
To evaluate the effects of each component on model performance, all experiments were conducted under identical training conditions. The same training, validation, and test split was used throughout the study to ensure a fair comparison.
The hardware environment consisted of an Intel Core Ultra 9 285K processor, 128 GB of RAM, and an NVIDIA GeForce RTX 5090 GPU with 32 GB of memory. The software environment included Python 3.10, PyTorch 2.11.0, and CUDA 12.8. The initial learning rate was set to 0.01 and decayed linearly to 0.0001 over the course of training (lrf = 0.01). The batch size was 32, and the maximum number of training epochs was 300. The stochastic gradient descent (SGD) optimizer was used with a momentum of 0.937 and a weight decay of 0.0005. A linear warmup was applied over the first 3 epochs.
For YOLO-based instance segmentation training, the total loss consisted of bounding-box localization loss, classification loss, distribution focal loss (DFL), and segmentation mask loss, as follows:
L t o t a l = L b o x + L c l s + L d f l + L m a s k
The box loss and DFL were used for bounding-box localization, the classification loss was used for severity class prediction, and the segmentation mask loss was used to optimize the predicted spinal instance mask. The box, classification, and DFL loss weights were set to 7.5, 0.5, and 1.5, respectively, following the YOLOv12 configuration. The segmentation mask loss was computed by the segmentation branch during mask prediction. Automatic mixed precision (AMP) training was enabled to reduce memory usage. Mosaic augmentation was disabled during the final 10 epochs (close_mosaic = 10) to stabilize training convergence. All models were trained from scratch without pre-trained weights, and early stopping was applied with a patience of 100 epochs.
The computational efficiency of the proposed model was evaluated on the test set using the NVIDIA GeForce RTX 5090 GPU. The end-to-end inference speed, including preprocessing, model inference, and postprocessing, was approximately 83.3 images per second (approximately 12.0 ms per image), demonstrating that the proposed framework is capable of near-real-time processing suitable for clinical screening workflows. Therefore, the computational analysis should be interpreted as partial profiling rather than a complete efficiency benchmark. Because FLOPs and peak GPU memory usage were not measured, we avoid claiming comprehensive computational superiority over all baseline models. The number of trainable parameters for each model is summarized in Table 4. Detailed FLOPs and GPU memory profiling were not measured in this study and are reserved for future work.
Specifically, according to the official Ultralytics benchmarks, the baseline YOLOv12m-seg architecture requires approximately 87.3 GFLOPs of computational volume. While detailed multi-GPU server-level memory profiling and sample-specific FLOPs variations were reserved for future work due to institutional runtime constraints, the proposed network maintained a lightweight profile (27.9 M parameters) equivalent to standard mid-sized models, ensuring its practical feasibility without excessive hardware overhead. The random seed was set to 0 to improve the reproducibility of the experiments.

3.11. Evaluation Metrics

The primary objective of this study was to accurately extract the spinal contour from standing whole-spine radiographs and to perform three-level scoliosis screening for normal, mild, and severe cases. Accordingly, model evaluation was conducted with consideration of not only detection performance but also false-negative reduction, particularly the clinically significant misclassification of mild and severe cases as normal.

3.11.1. Detection and Segmentation Evaluation

To evaluate the object detection and mask prediction performance of the YOLO segmentation model, mean Average Precision (mAP) was used as the primary evaluation metric [38].
Recall denotes the proportion of actual spinal objects correctly detected by the model, whereas precision denotes the proportion of predicted spinal objects that corresponded to true spinal objects.
Recall ( R ) = T P T P + F N
Precision ( P ) = T P T P + F P
Average Precision (AP) is defined as the area under the precision–recall curve, and mAP is computed as the mean AP across all classes.
Average   Precision   ( AP ) = 0 1 P d R , m A P = 1 N i = 1 N A P i
Intersection over Union (IoU) quantifies the degree of overlap between the predicted mask and the ground truth mask and is defined as follows. In this study, both bounding box-based mAP and mask-based mAP were calculated.
Intersection   over   Union   ( IoU ) = A r e a   o f   O v e r l a p A r e a   o f   U n i o n

3.11.2. Severity Classification Evaluation

In addition, given the screening-oriented purpose of this study, false negatives were adopted as a major evaluation metric. In particular, cases in which patients with actual mild or severe scoliosis were classified as normal or were not detected at all were defined as screening false negatives.
FN = Actual Mild or Severe, Predicted Normal or No Detection
This metric allowed not only overall average performance to be assessed, but also clinically important missed detections and underestimation errors occurring during the actual screening process to be quantitatively evaluated.

4. Results

In this study, the primary objective was not simply to localize the spine in standing whole-spine radiographs, but to stably extract the entire spinal contour and perform three-level scoliosis screening-normal, mild, and severe-based on this structural information. In scoliosis screening, false negatives, in which actual mild or severe cases are incorrectly classified as normal, constitute a clinically important error. Accordingly, model evaluation included not only mask mAP and overall accuracy but also the number of false negatives. Representative detection results are presented in Figure 9 and Figure 10.
First, semantic segmentation was compared with YOLO-based instance segmentation to evaluate the need for an instance-level approach in scoliosis severity classification. The results showed that semantic segmentation frequently caused class fragmentation, in which different severity classes were assigned to different regions of a single spine. In contrast, the YOLO-based instance segmentation framework consistently predicted a single severity label for the entire spinal structure.
Subsequently, using the baseline YOLO segmentation model as the reference, class-balanced offline augmentation, background-only negative samples, attention modules, and segmentation head architectures were sequentially evaluated. Finally, the proposed model integrating CBAM with a DeepLabV3+-based segmentation head was compared with the baseline model and with variants incorporating individual modules.
The proposed model stably extracted the entire spinal contour from standing whole-spine radiographs and classified each spinal instance into one of three severity categories: normal, mild, or severe. To reduce the influence of random initialization, each experiment was repeated three times using a fixed random seed. The model achieving the best validation performance among the three runs was selected, and its test-set performance is reported in Table 5. In particular, the class fragmentation observed in semantic segmentation was substantially reduced in the YOLO-based instance segmentation approach. In the present evaluation, the box-based mAP@0.5 and mask-based mAP@0.5 were numerically identical under the adopted evaluation setting. However, this should not be interpreted as a general property of single-instance datasets, because mask-level overlap and bounding-box overlap may differ depending on contour quality.
It should be noted that, in all experimental settings, the mask-based mAP@0.5 and the bounding-box-based mAP@0.5 yielded numerically identical results. This is attributable to the single-instance nature of the dataset, in which each standing whole-spine radiograph contains exactly one spinal object. Under this condition, detection success or failure is determined at the instance level, and the mask IoU threshold produces outcomes consistent with the bounding-box IoU threshold. Accordingly, the mAP@0.5 values reported in Table 5 represent both box-level and mask-level performance simultaneously.
The detailed training process of the proposed model is presented in Figure 11.
Training and validation losses decreased steadily with increasing epochs, and validation performance exhibited convergence after a certain number of epochs. These results suggest that the proposed model was able to learn spinal contour features and severity classes from standing whole-spine radiographs in a stable and consistent manner.
The confusion matrix is presented in Figure 12.
Class-wise analysis showed that the proposed model achieved stable recall not only for the normal class but also for the mild and severe classes. In particular, errors in which actual mild or severe cases were misclassified as normal were reduced compared with the baseline model. This improvement is likely attributable to the proposed model more effectively reflecting the overall spinal contour structure and curvature information.
Analysis of the confusion matrix showed that misclassifications occurred primarily between adjacent severity classes. For example, some mild cases were classified as normal or severe, whereas some severe cases were classified as mild. Such errors are likely to arise in cases located near the class boundaries defined by the Cobb angle criteria. Therefore, future studies should consider the continuous relationship between actual Cobb angle values and predicted classes.
The precision–recall curve is presented in Figure 13. The graph of loss values for each training is shown in Figure 14.
Overall, the proposed model maintained high precision and recall, with recall showing particular improvement for the mild and severe classes compared with the baseline model. The class-wise AP@0.5 of the proposed model was as follows: normal = 0.902, mild = 0.756, severe = 0.854. These values confirm that the model maintained balanced detection performance across all three severity categories.
To assess training stability, each experiment was repeated three times using fixed random seeds, and the best-performing run on the validation set was selected for test-set evaluation. Across all three runs, the maximum observed difference in mAP@0.5 did not exceed 0.002, indicating that the reported results are stable and reproducible under the given training conditions.

5. Discussion

This study aimed to develop a screening-oriented AI model for assisting three-level scoliosis assessment-normal, mild, and severe-based on standing whole-spine radiographs.
Building upon clinician-assigned severity labels derived from established Cobb angle criteria, which are clinically used to guide management decisions such as observation, follow-up, bracing, and surgical referral, the proposed framework treats the entire spinal contour as a single anatomical instance. This approach is not intended to replace precise Cobb angle measurement or vertebra-level analysis, which remain essential for definitive diagnosis, curve monitoring, and treatment planning. Rather, it is designed to serve as a complementary screening aid for rapid case triage and missed-detection prevention. Accordingly, this formulation supports screening-oriented severity classification while aiming to reduce clinically important false negatives.
Initial experiments confirmed that a semantic segmentation-based approach was not well suited to scoliosis severity classification. Although semantic segmentation showed some ability to extract the spinal contour itself, it produced a class fragmentation problem in which different severity classes were simultaneously predicted within a single spinal structure. This limitation arises because scoliosis severity is determined not by the features of specific pixels or local regions, but by the overall curvature pattern and structural deformity of the spine. In other words, semantic segmentation operates at the pixel level, whereas actual clinical judgment is made at the whole-spine level. Owing to this fundamental mismatch, semantic segmentation may be useful for contour extraction but remains limited for stable severity classification. In contrast, the proposed YOLO-based instance segmentation framework alleviated this inconsistency by defining the entire spine as a single anatomical instance and assigning one severity class to it.
Two-stage instance segmentation models, such as Mask R-CNN [39], can provide high-quality masks; however, their structural complexity and relatively higher computational cost may limit their suitability for rapid screening workflows. By contrast, YOLO-based one-stage segmentation performs detection, classification, and segmentation within a unified network, providing a favorable balance between accuracy and computational efficiency. This structural advantage has been highlighted in recent reviews of YOLO-based medical imaging applications, which emphasize its potential suitability for clinical decision-support tasks requiring rapid inference. In addition, because standing whole-spine radiographs typically contain one principal spinal structure, treating the entire spine as a single anatomical instance is more consistent with the clinical decision unit of this study than using a panoptic segmentation framework designed to jointly address instance-level objects and background regions [40]. The findings of this study also suggest that YOLO-based instance segmentation can preserve the overall spinal contour while performing three-level severity screening.
From a data perspective, class imbalance had an important influence on model reliability. The collected dataset was not evenly distributed across the normal, mild, and severe classes, and such imbalance can cause the model to become biased toward majority classes. This issue is particularly important in a screening-oriented model because decreased recall for disease-related classes directly increases false negatives, in which actual patients are incorrectly judged to be normal. To alleviate this problem, class-balanced offline augmentation based on physical transformations, including cropping, rotation, scaling and Mosaic augmentation, was applied. These strategies increase training diversity while preserving the diagnostic meaning of medical images. In practice, overall accuracy and several classification metrics improved after augmentation. However, some experiments also showed a simultaneous increase in false negatives.
These findings indicate that data augmentation does not always translate into clinically meaningful performance improvement. In scoliosis radiographs, spinal position, acquisition range, patient posture, and surrounding background structures all influence feature learning. Excessive geometric augmentation may cause the model to become more sensitive to positional information or background patterns than to the structural curvature of the spine itself. In such cases, overall accuracy may improve, yet clinically important abnormal cases may still be missed, thereby increasing false negatives. Thus, improvements in classification metrics do not necessarily imply improved clinical reliability.
To address this limitation, background-only negative samples were introduced as empty-label negative-context samples. Unlike conventional hard negative mining, which mainly focuses on suppressing false positives, our background-only negative samples were specifically designed to reduce false negatives caused by augmentation-induced positional bias in screening-oriented radiographs. These samples were generated by extracting non-spinal regions from standing whole-spine radiographs and were not treated as an additional class among normal, mild, and severe, but rather as training samples without object labels. Through this strategy, the model learned to distinguish more clearly between regions containing the spine and those without the spine, thereby reducing excessive dependence on positional information or background patterns. In practice, false negatives decreased after the introduction of background-only negative samples, suggesting that negative-context learning can be a highly effective strategy for screening-oriented medical image AI. By learning background information explicitly, the model was prevented from overfitting to the specific position of the spine or surrounding contextual cues and instead focused more directly on the spinal contour itself, thereby preserving recall. From a clinical perspective, because overlooking actual mild or severe cases is a critical error, reducing false negatives is more meaningful than merely increasing overall accuracy. The background-only negative samples used in this study were generated in a manner tailored to the characteristics of standing whole-spine radiographs, informed by prior studies on negative or hard-negative sample strategies in object detection. The experimental results showed a reduction in screening false negatives after incorporating these samples, suggesting that they may help the model distinguish true spinal instances from non-spinal background regions. However, a detailed analysis of how these samples contribute to individual loss components-such as objectness-related loss, classification loss, and mask loss-was not conducted in this study, as it was beyond the primary scope of the present work. This loss-level mechanistic analysis should be addressed in future research.
In the comparison of attention modules, CBAM showed greater practical utility than the Swin Transformer block. Although Swin Transformer can effectively model long-range dependencies through self-attention, it substantially increases computational cost and memory requirements. CBAM, by contrast, is a considerably lighter module and was nevertheless effective in improving accuracy and reducing false negatives. This may be because, in standing whole-spine radiographs, the spine already dominates the image as the principal anatomical structure; consequently, more precise emphasis on channels and spatial regions relevant to the spinal contour may be more beneficial than additional global long-range dependency modeling. In other words, for scoliosis screening, boundary preservation and spatial refinement appear to be more important than further expansion of global contextual modeling.
A similar interpretation was supported by the comparison of segmentation heads. Among the evaluated segmentation heads, the U-Net-based head achieved slightly higher mAP@0.5, whereas the DeepLabV3+-based head showed a lower number of screening false negatives and provided stable contour-oriented representation. Therefore, DeepLabV3+ was selected as the final segmentation head because the present study prioritized screening false-negative reduction and boundary restoration rather than mAP alone. This result is likely attributable to its ability to provide both multi-scale contextual information and boundary refinement simultaneously. Scoliosis diagnosis depends not simply on the morphology of individual vertebrae, but on the continuous curvature pattern of the entire spine, including both thoracic and lumbar segments. Atrous spatial pyramid pooling can effectively capture such multi-scale structural information, while the decoder facilitates fine restoration of spinal contour boundaries. Although U-Net also improved segmentation performance to some extent, DeepLabV3+ best preserved both contour continuity and the stability of severity classification.
Ultimately, the proposed model combining CBAM and a DeepLabV3+-based segmentation head showed the best overall performance. CBAM mainly contributed to accuracy improvement and false-negative reduction, whereas DeepLabV3+ contributed to improved mask quality and boundary restoration. By integrating both components, spinal contour reconstruction and screening stability were simultaneously improved, and the model achieved the lowest screening false-negative rate among all experimental settings. These findings emphasize that a clinically useful scoliosis screening system should not merely aim for high overall accuracy, but should also reliably detect abnormal cases.
This study has several limitations. First, it performed three-level classification based on clinician-assigned severity labels rather than direct regression of the Cobb angle. Although this approach is appropriate for screening, it cannot fully replace the quantitative measurements required for actual surgical planning or treatment decisions. Direct Cobb angle regression requires precise vertebral landmark localization, end-vertebra selection, and endplate orientation analysis, all of which remain highly sensitive to image quality, anatomical overlap, and observer variability. In contrast, the present study prioritized robust screening performance and false-negative reduction in practical clinical workflows.
Second, because the dataset was collected from a limited number of institutions, generalizability to diverse imaging protocols and patient populations may be restricted. Additional external validation using multi-center datasets will therefore be necessary to confirm the robustness and clinical applicability of the proposed framework.
Third, misclassifications still occurred in some cases located near the boundaries between normal and mild or between mild and severe. Although Cobb angle measurements were performed by clinicians according to standard clinical criteria, continuous Cobb angle values were converted into discrete severity categories in this study. Therefore, cases located close to the 10° and 25° thresholds may still involve inherent label uncertainty. In addition, low radiographic contrast and anatomical overlap may have contributed to false detections or misclassifications by making spinal contour delineation more difficult.
Fourth, the comparison between semantic segmentation and instance segmentation in this study was qualitative rather than quantitative. A systematic quantitative evaluation involving multiple semantic segmentation baselines with class-wise recall and mask-based metrics is recommended in future work.
Future work should extend the present whole-spine instance segmentation framework to vertebra-level instance segmentation, in which individual vertebrae are separated as distinct anatomical instances. Because the Cobb angle is measured using the endplate orientation of the most tilted superior and inferior end vertebrae, accurate automated measurement requires vertebral separation, vertebral endplate detection, end-vertebra selection, and angle calculation to be performed together. Accordingly, future studies should expand the framework toward a multi-task learning structure that integrates vertebra-level instance segmentation with endplate-based angle estimation, thereby enabling both screening and quantitative Cobb angle regression in an end-to-end manner. Such an approach could evolve into a more comprehensive scoliosis diagnosis system that supports not only early screening but also practical clinical treatment planning.
Regarding the clinical interpretation of model confidence scores, the present study reports confidence scores alongside predicted classes in the detection results; however, a systematic investigation of the relationship between confidence score ranges and clinical outcomes was beyond the scope of this work. The degree to which confidence scores correlate with diagnostic certainty, borderline case identification, or referral decision thresholds may vary depending on clinical context. A dedicated study examining the clinical significance of confidence score distributions across severity classes is currently underway by the research group and will be addressed in future work.
Mechanistically, the background-only negative samples were used as empty-label images that provided negative contextual exposure from the same radiographic domain. Because these samples did not contain foreground spine annotations, they were intended to help the model distinguish non-spinal radiographic regions from true spinal instances and suppress erroneous foreground predictions in background regions. The observed reduction in false negatives suggests that this negative-context learning strategy may have reduced augmentation-induced positional or background bias and encouraged the model to rely more strongly on spinal contour information. However, because a detailed loss-level decomposition was not performed in this study, the contribution of background-only negative samples to individual loss components should be interpreted qualitatively rather than as direct evidence from separate loss-branch analyses.

6. Conclusions

This study presents a YOLO-based instance segmentation framework for three-level scoliosis screening—including normal, mild, and severe—using standing whole-spine radiographs. In conventional semantic segmentation-based approaches, scoliosis severity labels are learned as pixel-level classes, which can lead to class fragmentation, whereby different severity classes are predicted within the same spine. In the present study, the entire spinal contour was defined as a single anatomical instance, and severity was predicted on the basis of its global structural information, thereby alleviating this limitation.
To address class imbalance, class-balanced offline augmentation was applied. In addition, background-only negative samples constructed from non-spinal regions in radiographs were incorporated during training to mitigate the increase in false negatives observed after augmentation. Experimental results demonstrated that these background-only negative samples were effective in reducing false negatives, which is particularly important in screening-oriented applications.
At the network level, CBAM and the Swin Transformer block were compared to adapt YOLO-based segmentation to the characteristics of scoliosis radiographs, and U-Net and DeepLabV3+-based segmentation heads were also evaluated. The results showed that CBAM effectively improved accuracy and reduced false negatives with low computational overhead, whereas the DeepLabV3+-based head improved spinal contour restoration through multi-scale contextual modeling and boundary refinement. Ultimately, the proposed model combining CBAM with a DeepLabV3+-based head improved both spinal contour extraction and three-level severity screening performance relative to the baseline YOLO segmentation model.
Although the present study did not directly estimate the Cobb angle, it provides preliminary evidence that whole-spine instance segmentation may be useful for assisting three-level scoliosis screening. Future work should include external validation with multi-center datasets, prospective workflow evaluation, quantitative inter-observer agreement analysis, systematic comparison with semantic segmentation baselines, and extension toward vertebra-level segmentation and Cobb angle estimation.

Author Contributions

Conceptualization, H.H. and H.K.; resources, H.K.; validation, Y.H.; data curation, Y.H.; writing—original draft preparation, H.H.; writing—review and editing, H.K. All authors have read and agreed to the published version of the manuscript.

Funding

This paper was supported by Eulji University in 2025 (Grant No. EJRG-25-13; A Study on the Development of AI-Based Evaluation Metrics for Scoliosis).

Institutional Review Board Statement

The study protocol was approved by the Institutional Review Board of Eulji University (IRB No. EU24-77, 2025.02.18).

Informed Consent Statement

This study used retrospectively collected de-identified whole-spine X-ray images, and informed consent was waived under the approval of the Institutional Review Board.

Data Availability Statement

The data used in this study are de-identified medical imaging data; however, they are not publicly available in a repository owing to patient privacy protection and the conditions of Institutional Review Board approval. Upon reasonable request, inquiries may be directed to the corresponding author in accordance with relevant regulations and institutional approval procedures.

Conflicts of Interest

Author Hochul Kim is affiliated with MiCS Platforms Co., Ltd. However, this study was conducted independently without any commercial or financial support from the company, and the authors declare no conflicts of interest.

References

  1. Cobb, J. Outline for the study of scoliosis. Instr. Course Lect. 1948, 5, 261–275. [Google Scholar]
  2. Kuznia, A.L.; Hernandez, A.K.; Lee, L.U. Adolescent idiopathic scoliosis: Common questions and answers. Am. Fam. Physician 2020, 101, 19–23. [Google Scholar]
  3. Menger, R.; Sin, A. Adolescent idiopathic scoliosis. In StatPearls; StatPearls: Treasureland, FL, USA, 2023. [Google Scholar]
  4. Horne, J.P.; Flannery, R.; Usman, S. Adolescent idiopathic scoliosis: Diagnosis and management. Am. Fam. Physician 2014, 89, 193–198. [Google Scholar]
  5. Malfair, D.; Flemming, A.K.; Dvorak, M.F.; Munk, P.L.; Vertinsky, A.T.; Heran, M.K.; Graeb, D.A. Radiographic evaluation of scoliosis. Am. J. Roentgenol. 2010, 194, S8–S22. [Google Scholar] [CrossRef] [PubMed]
  6. Morrissy, R.T.; Goldsmith, G.; Hall, E.; Kehl, D.; Cowie, G. Measurement of the Cobb angle on radiographs of patients who have scoliosis. Evaluation of intrinsic error. J. Bone Jt. Surg. Am. 1990, 72, 320–327. [Google Scholar] [CrossRef]
  7. Langensiepen, S.; Semler, O.; Sobottke, R.; Fricke, O.; Franklin, J.; Schönau, E.; Eysel, P. Measuring procedures to determine the Cobb angle in idiopathic scoliosis: A systematic review. Eur. Spine J. 2013, 22, 2360–2371. [Google Scholar] [CrossRef] [PubMed]
  8. Zhu, Y.; Yin, X.; Chen, Z.; Zhang, H.; Xu, K.; Zhang, J.; Wu, N. Deep learning in Cobb angle automated measurement on X-rays: A systematic review and meta-analysis. Spine Deform. 2025, 13, 19–27. [Google Scholar] [CrossRef] [PubMed]
  9. Keenan, B.E.; Izatt, M.T.; Askin, G.N.; Labrom, R.D.; Pearcy, M.J.; Adam, C.J. Supine to standing Cobb angle change in idiopathic scoliosis: The effect of endplate pre-selection. Scoliosis 2014, 9, 16. [Google Scholar] [CrossRef]
  10. Vavruch, L.; Tropp, H. A comparison of Cobb angle: Standing versus supine images of late-onset idiopathic scoliosis. Pol. J. Radiol. 2016, 81, 270. [Google Scholar] [CrossRef]
  11. Lee, W.; Shin, K.; Lee, J.; Yoo, S.-J.; Yoon, M.A.; Choi, Y.W.; Hong, G.-S.; Kim, N.; Paik, S. Diagnosis of scoliosis using chest radiographs with a semi-supervised generative adversarial network. J. Korean Soc. Radiol. 2022, 83, 1298. [Google Scholar] [CrossRef]
  12. Shahid, A.; Kim, J.; Byon, S.S.; Hong, S.; Lee, I.; Lee, B.-D. An end-to-end pipeline for automated scoliosis diagnosis with standardized clinical reporting using SNOMED CT. Sci. Rep. 2025, 15, 17274. [Google Scholar] [CrossRef]
  13. Maeda, Y.; Nagura, T.; Nakamura, M.; Watanabe, K. Automatic measurement of the Cobb angle for adolescent idiopathic scoliosis using convolutional neural network. Sci. Rep. 2023, 13, 14576. [Google Scholar] [CrossRef]
  14. Yang, J.; Zhang, K.; Fan, H.; Huang, Z.; Xiang, Y.; Yang, J.; He, L.; Zhang, L.; Yang, Y.; Li, R. Development and validation of deep learning algorithms for scoliosis screening using back images. Commun. Biol. 2019, 2, 390. [Google Scholar] [CrossRef]
  15. Li, K.; Gu, H.; Colglazier, R.; Lark, R.; Hubbard, E.; French, R.; Smith, D.; Zhang, J.; McCrum, E.; Catanzano, A. Deep learning automates Cobb angle measurement compared with multi-expert observers. BJR|Artif. Intell. 2025, 2, ubaf009. [Google Scholar] [CrossRef]
  16. Li, L.; Zhang, T.; Lin, F.; Li, Y.; Wong, M.-S. Automated 3d cobb angle measurement using u-net in ct images of preoperative scoliosis patients. J. Imaging Inform. Med. 2025, 38, 309–317. [Google Scholar] [CrossRef] [PubMed]
  17. İlkhan, İ.H.; Gümüşkaya, H.; Turgut, F. Vertebra Segmentation and Cobb Angle Calculation Platform for Scoliosis Diagnosis Using Deep Learning: SpineCheck. Informatics 2025, 12, 140. [Google Scholar] [CrossRef]
  18. Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
  19. Chen, L.-C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 834–848. [Google Scholar] [CrossRef] [PubMed]
  20. Cai, Z.; Zhou, K.; Liao, Z. A systematic review of YOLO-based object detection in medical imaging: Advances, challenges, and future directions. Comput. Mater. Contin. 2025, 85, 2255. [Google Scholar] [CrossRef]
  21. Xiong, M.; Wu, A.; Yang, Y.; Fu, Q. Efficient Brain Tumor Segmentation for MRI Images Using YOLO-BT. Sensors 2025, 25, 3645. [Google Scholar] [CrossRef]
  22. Sapkota, R.; Karkee, M. Ultralytics YOLO evolution: An overview of YOLO26, YOLO11, YOLOv8 and YOLOv5 object detectors for computer vision and pattern recognition. arXiv 2025, arXiv:2510.09653. [Google Scholar]
  23. Liu, Y.; Bai, X.; Wang, J.; Li, G.; Li, J.; Lv, Z. Image semantic segmentation approach based on DeepLabV3 plus network with an attention mechanism. Eng. Appl. Artif. Intell. 2024, 127, 107260. [Google Scholar] [CrossRef]
  24. Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar]
  25. Chen, L.-C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European conference on computer vision (ECCV), Munich, Germany, 14 September 2018; pp. 801–818. [Google Scholar]
  26. Lin, T.-Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
  27. Tian, Y.; Ye, Q.; Doermann, D. Yolov12: Attention-centric real-time object detectors. arXiv 2025, arXiv:2502.12524. [Google Scholar]
  28. Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
  29. Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar]
  30. Kim, M.; Bae, H.J. Data Augmentation Techniques for Deep Learning-Based Medical Image Analyses. J. Korean Soc. Radiol. 2020, 81, 1290–1304. [Google Scholar] [CrossRef]
  31. Bochkovskiy, A.; Wang, C.-Y.; Liao, H.-Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar] [CrossRef]
  32. Zhang, P.; Lai, Z.; Chen, W.; Wu, X.; Kong, H. FaNe: Towards Fine-Grained Cross-Modal Contrast with False-Negative Reduction and Text-Conditioned Sparse Attention. Proc. AAAI Conf. Artif. Intell. 2026, 40, 12681–12689. [Google Scholar] [CrossRef]
  33. Huang, W.; Hu, X.; Abousamra, S.; Prasanna, P.; Chen, C. Hard negative sample mining for whole slide image classification. In Medical Image Computing and Computer Assisted Intervention—MICCAI 2024; Springer: Berlin/Heidelberg, Germany, 2024; pp. 144–154. [Google Scholar]
  34. Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 42, 2980–2988. [Google Scholar]
  35. Seesawad, N.; Ittichaiwong, P.; Sudhawiyangkul, T.; Sawangjai, P.; Thuwajit, P.; Boonsakan, P.; Sripodok, S.; Veerakanjana, K.; Charngkaew, K.; Pongpaibul, A. PseudoCell: Hard negative mining as pseudo labeling for deep learning-based centroblast cell detection. IEEE Open J. Eng. Med. Biol. 2024, 5, 514–523. [Google Scholar] [CrossRef] [PubMed]
  36. Obaido, G.; Mienye, I.D.; Aruleba, K.; Chukwu, C.W.; Esenogho, E.; Modisane, C. A Systematic Review of Contrastive Learning in Medical AI: Foundations, Biomedical Modalities, and Future Directions. Bioengineering 2026, 13, 176. [Google Scholar] [CrossRef]
  37. Weinstein, S.L.; Dolan, L.A.; Wright, J.G.; Dobbs, M.B. Effects of bracing in adolescents with idiopathic scoliosis. N. Engl. J. Med. 2013, 369, 1512–1521. [Google Scholar] [CrossRef]
  38. Li, Y.-T.; Chan, Y.-C.; Huang, C.-C.; Hsu, Y.-C.; Chen, S.-H. YOLOSeg with applications to wafer die particle defect segmentation. Sci. Rep. 2025, 15, 2311. [Google Scholar] [CrossRef]
  39. He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
  40. Chen, T.; Li, L.; Saxena, S.; Hinton, G.; Fleet, D.J. A generalist framework for panoptic segmentation of images and videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 909–919. [Google Scholar]
Figure 1. Examples of standing whole-spine radiographs (a) and their corresponding segmentation masks (b).
Figure 1. Examples of standing whole-spine radiographs (a) and their corresponding segmentation masks (b).
Applsci 16 05492 g001
Figure 2. Example of a polygon mask (b) applied to the original mask (a).
Figure 2. Example of a polygon mask (b) applied to the original mask (a).
Applsci 16 05492 g002
Figure 3. Examples of offline data augmentation applied to spine radiographs. The original image (right) is transformed by rotation (±5°), scaling (0.8–1.2×), cropping, and their combination, simulating real-world acquisition conditions. All transformations preserve the spinal contour and Cobb angle interpretation, and the same geometric transformations are applied to the segmentation masks to maintain image–label consistency.
Figure 3. Examples of offline data augmentation applied to spine radiographs. The original image (right) is transformed by rotation (±5°), scaling (0.8–1.2×), cropping, and their combination, simulating real-world acquisition conditions. All transformations preserve the spinal contour and Cobb angle interpretation, and the same geometric transformations are applied to the segmentation masks to maintain image–label consistency.
Applsci 16 05492 g003
Figure 4. Predicted example of semantic segmentation.
Figure 4. Predicted example of semantic segmentation.
Applsci 16 05492 g004
Figure 5. Architectural overview of the baseline YOLO instance-segmentation pipeline. The network consists of four stages. (i) Input: an H × W × 3 image is fed to the model. (ii) Backbone: a stem (Conv 3 × 3, s = 2) followed by three CSP/C2f stages extracts multi-scale features, yielding a high-resolution map P3 (80 × 80), a mid-resolution map P4 (40 × 40), and a low-resolution map P5 (20 × 20). (iii) Neck: FPN/PAN-style top-down and bottom-up fusion refines these features into N3, N4, and N5. (iv) YOLO Segment Head: the refined features are processed by a mask branch that predicts prototype masks and per-instance mask coefficients, and a detection branch that regresses bounding boxes and predicts objectness and class scores. Each instance mask is reconstructed as a linear combination of the prototype masks and its mask coefficients, producing the final instance-level prediction—class label (Normal, Mild, Severe), bounding box, and pixel-wise segmentation mask.
Figure 5. Architectural overview of the baseline YOLO instance-segmentation pipeline. The network consists of four stages. (i) Input: an H × W × 3 image is fed to the model. (ii) Backbone: a stem (Conv 3 × 3, s = 2) followed by three CSP/C2f stages extracts multi-scale features, yielding a high-resolution map P3 (80 × 80), a mid-resolution map P4 (40 × 40), and a low-resolution map P5 (20 × 20). (iii) Neck: FPN/PAN-style top-down and bottom-up fusion refines these features into N3, N4, and N5. (iv) YOLO Segment Head: the refined features are processed by a mask branch that predicts prototype masks and per-instance mask coefficients, and a detection branch that regresses bounding boxes and predicts objectness and class scores. Each instance mask is reconstructed as a linear combination of the prototype masks and its mask coefficients, producing the final instance-level prediction—class label (Normal, Mild, Severe), bounding box, and pixel-wise segmentation mask.
Applsci 16 05492 g005
Figure 6. Randomly generated background-only negative samples.
Figure 6. Randomly generated background-only negative samples.
Applsci 16 05492 g006
Figure 7. Architectural overview of the proposed YOLO-based segmentation network with DeepLabV3+ feature enhancement and CBAM attention. (i) Input: an H × W × 3 image. (ii) Backbone (layers 0–8): alternating Conv and C3k2 blocks progressively downsample the image and produce hierarchical features P2/4, P3/8, P4/16, and P5/32, with A2C2f blocks at deeper stages strengthening representational capacity. (iii) YOLO Neck (layers 9–20): a multi-scale feature aggregation path combines upsampling, concatenation, A2C2f, and Conv operations to produce P3, P4, and P5 head features. (iv) DeepLabV3+ Feature Enhancement Head (layers 21–26): an ASPP module captures multi-scale context from P5, while P3 is projected by a 1 × 1 Conv and upsampled by 4× bilinear sampling; the two are concatenated and refined by 3 × 3 Convs to yield an enhanced P3 with both deep semantic and shallow spatial cues. (v) CBAM Attention Modules (layers 27–29): channel and spatial attention are applied in parallel to the enhanced P3, the refined P4, and the P5 feature, suppressing background noise and emphasizing object regions. (vi) Final YOLO Segment Head (layer 30): the attended multi-scale features feed Segment [nc, 32, 256], producing instance segmentation masks together with class predictions over nc = 3 categories (Normal, Mild, Severe).
Figure 7. Architectural overview of the proposed YOLO-based segmentation network with DeepLabV3+ feature enhancement and CBAM attention. (i) Input: an H × W × 3 image. (ii) Backbone (layers 0–8): alternating Conv and C3k2 blocks progressively downsample the image and produce hierarchical features P2/4, P3/8, P4/16, and P5/32, with A2C2f blocks at deeper stages strengthening representational capacity. (iii) YOLO Neck (layers 9–20): a multi-scale feature aggregation path combines upsampling, concatenation, A2C2f, and Conv operations to produce P3, P4, and P5 head features. (iv) DeepLabV3+ Feature Enhancement Head (layers 21–26): an ASPP module captures multi-scale context from P5, while P3 is projected by a 1 × 1 Conv and upsampled by 4× bilinear sampling; the two are concatenated and refined by 3 × 3 Convs to yield an enhanced P3 with both deep semantic and shallow spatial cues. (v) CBAM Attention Modules (layers 27–29): channel and spatial attention are applied in parallel to the enhanced P3, the refined P4, and the P5 feature, suppressing background noise and emphasizing object regions. (vi) Final YOLO Segment Head (layer 30): the attended multi-scale features feed Segment [nc, 32, 256], producing instance segmentation masks together with class predictions over nc = 3 categories (Normal, Mild, Severe).
Applsci 16 05492 g007
Figure 8. Differences from existing models.
Figure 8. Differences from existing models.
Applsci 16 05492 g008
Figure 9. Representative correct detection examples of the proposed model. Each image shows a standing whole-spine radiograph with the predicted spinal contour mask and the predicted severity class (normal/mild/severe) with confidence score.
Figure 9. Representative correct detection examples of the proposed model. Each image shows a standing whole-spine radiograph with the predicted spinal contour mask and the predicted severity class (normal/mild/severe) with confidence score.
Applsci 16 05492 g009
Figure 10. Representative false detection and misclassification examples of the proposed model. Borderline cases near the 10° and 25° clinical thresholds exhibit inherent diagnostic uncertainty due to low radiographic contrast and anatomical overlapping. Compared to the baseline YOLOv12 model, which frequently missed these structural variations entirely, the proposed network reduces total false negatives from 238 to 189 cases, although localized contour fragmentation under suboptimal contrast remains a shared radiological challenge.
Figure 10. Representative false detection and misclassification examples of the proposed model. Borderline cases near the 10° and 25° clinical thresholds exhibit inherent diagnostic uncertainty due to low radiographic contrast and anatomical overlapping. Compared to the baseline YOLOv12 model, which frequently missed these structural variations entirely, the proposed network reduces total false negatives from 238 to 189 cases, although localized contour fragmentation under suboptimal contrast remains a shared radiological challenge.
Applsci 16 05492 g010
Figure 11. Training process of the proposed model.
Figure 11. Training process of the proposed model.
Applsci 16 05492 g011
Figure 12. Confusion matrix of the proposed model. The “background” row and column follow the YOLO detection-evaluation convention and indicate no-detection or unmatched false-positive bins, rather than an additional clinical severity class in the test set.
Figure 12. Confusion matrix of the proposed model. The “background” row and column follow the YOLO detection-evaluation convention and indicate no-detection or unmatched false-positive bins, rather than an additional clinical severity class in the test set.
Applsci 16 05492 g012
Figure 13. Precision-recall curve of the proposed model after three training sessions.
Figure 13. Precision-recall curve of the proposed model after three training sessions.
Applsci 16 05492 g013
Figure 14. Loss value graph by epoch.
Figure 14. Loss value graph by epoch.
Applsci 16 05492 g014
Table 1. Original Dataset.
Table 1. Original Dataset.
ClassTotal NumbersTrainValidationTest
normal5464327810921094
mild31281876625627
severe4072448182
Table 2. Comparison of mAP@0.5 across polygon contour point settings using YOLOv12-based instance segmentation. All experiments were conducted under identical training conditions.
Table 2. Comparison of mAP@0.5 across polygon contour point settings using YOLOv12-based instance segmentation. All experiments were conducted under identical training conditions.
PointmAP@0.5
2400.834
3600.828
4800.835
Table 3. Final dataset after augmentation.
Table 3. Final dataset after augmentation.
ClassTotal NumbersTrainValidation Test
normal5464327810921094
mild36522400625627
severe256324008182
background-only1200120000
Table 4. Model parameter comparison.
Table 4. Model parameter comparison.
ModelParameter
YOLOv8m-seg27.3 M
YOLOv9m-seg20.0 M
YOLO11m-seg22.4 M
YOLOv12m-seg20.2 M
Proposed Network Architecture27.9 M
Table 5. Comparison results by model (Bold text indicates peak performance).
Table 5. Comparison results by model (Bold text indicates peak performance).
Base ModelAttention ModuleSegmentation
Head
DatasetF1-ScoremAP@0.5 (Box and Mask)
(<±0.001)
FN
YOLOv8--Original0.760.802207
YOLOv9--Original0.750.799219
YOLO11--Original0.750.703208
YOLOv12--Original0.760.824238
YOLOv8--Class-Balanced0.770.834259
YOLOv9--Class-Balanced0.760.830280
YOLO11--Class-Balanced0.760.834271
YOLOv12--Class-Balanced0.760.835253
YOLOv12CBAM-Final0.780.837211
YOLOv12Swin-Final0.770.824212
YOLOv12-DeepLabV3+Final0.760.826206
YOLOv12-U-NetFinal0.780.838223
YOLOv12CBAMDeepLabV3+Final0.780.837189
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Hwang, H.; Hyun, Y.; Kim, H. Boundary-Enhanced YOLO-Based Instance Segmentation with Background-Only Negative Samples for Three-Level Scoliosis Severity Screening in Whole-Spine Radiography. Appl. Sci. 2026, 16, 5492. https://doi.org/10.3390/app16115492

AMA Style

Hwang H, Hyun Y, Kim H. Boundary-Enhanced YOLO-Based Instance Segmentation with Background-Only Negative Samples for Three-Level Scoliosis Severity Screening in Whole-Spine Radiography. Applied Sciences. 2026; 16(11):5492. https://doi.org/10.3390/app16115492

Chicago/Turabian Style

Hwang, Hoseong, Yeji Hyun, and Hochul Kim. 2026. "Boundary-Enhanced YOLO-Based Instance Segmentation with Background-Only Negative Samples for Three-Level Scoliosis Severity Screening in Whole-Spine Radiography" Applied Sciences 16, no. 11: 5492. https://doi.org/10.3390/app16115492

APA Style

Hwang, H., Hyun, Y., & Kim, H. (2026). Boundary-Enhanced YOLO-Based Instance Segmentation with Background-Only Negative Samples for Three-Level Scoliosis Severity Screening in Whole-Spine Radiography. Applied Sciences, 16(11), 5492. https://doi.org/10.3390/app16115492

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop