1. Introduction
Scoliosis is a complex anatomical deformity characterized by lateral curvature and axial rotation of the spine, and adolescent idiopathic scoliosis (AIS) is the most commonly encountered form in clinical practice [
1]. AIS is generally defined as a spinal curvature of unknown etiology that occurs in adolescents between 10 and 18 years of age and is diagnosed when the Cobb angle measured on standing whole-spine radiographs is 10° or greater. The prevalence of AIS in the adolescent population has been reported to be approximately 1–3%, and because the curvature may progress during growth, early screening and continuous follow-up are essential [
2,
3].
The clinical evaluation of scoliosis can be divided into primary screening and subsequent image-based quantitative assessment [
4]. In the primary screening stage, the forward bend test and scoliometer may be used. Definitive diagnosis and severity assessment, however, rely on radiographic measurement of the Cobb angle. The Cobb angle is defined as the angle formed by two lines parallel to the endplates of the most tilted superior and inferior end vertebrae on standing whole-spine radiographs. It is a key metric throughout scoliosis management, including diagnosis, assessment of progression, evaluation of brace treatment indications, and surgical planning. Furthermore, standing whole-spine radiography serves as the fundamental imaging modality for scoliosis evaluation because it provides information not only on curve magnitude but also on the overall morphology and alignment of the spinal deformity [
5].
However, Cobb angle measurement is inherently observer-dependent and vulnerable to multiple sources of error during image interpretation. Reliable identification of the spinal contour and overall spinal morphology substantially affects the reliability of quantitative evaluation in scoliosis diagnosis [
6,
7]. Specifically, end-vertebra selection, endplate line placement, and interpretation of vertebral boundaries and corners depend heavily on the clinician’s experience and judgment. These tasks are particularly challenging for less experienced readers because delineation of spinal boundaries is hindered by the low contrast of radiographs and the overlap of anatomical structures. Previous studies have reported intra- and inter-observer variability of approximately 4–8° in Cobb angle measurement, and recent automated measurement studies have also reported manual measurement error ranges of approximately 3–10° [
6,
7,
8]. Moreover, patient posture is itself an important source of variation: differences in curve magnitude have been observed between standing and recumbent or supine images, and prior studies have shown that the standing Cobb angle may be approximately 7–10° greater than the supine Cobb angle [
9,
10]. These findings underscore an inherent limitation of scoliosis diagnosis, in which a difference of only a few degrees may alter the diagnostic category or treatment decision.
To address these limitations, artificial intelligence (AI), particularly deep learning-based medical image analysis, has increasingly been applied to scoliosis assessment. Many previous studies have focused on automated Cobb angle measurement using vertebral detection, landmark estimation, end-vertebra selection, or endplate orientation analysis [
8,
11,
12,
13,
14]. Although these approaches are valuable for quantitative assessment, clinical screening also requires rapid severity stratification and false-negative reduction. Therefore, this study focuses on three-level severity screening based on clinician-assigned labels derived from Cobb angle criteria, rather than direct Cobb angle estimation. In this context, morphology-preserving segmentation remains important because the overall spinal contour provides structural information relevant to scoliosis severity [
8,
15,
16,
17].
However, when scoliosis severity classification is performed within a segmentation framework, semantic segmentation may produce a class fragmentation problem if severity is learned as a pixel-level class [
18,
19]. In such cases, different regions within a single spinal object may be assigned different severity classes, because semantic segmentation inherently treats each pixel independently and cannot enforce a single consistent label across an entire object instance [
18]. In this context, the YOLO (You Only Look Once) family offers a practical architecture for scoliosis screening. YOLO was originally proposed as a single-stage framework that predicts object location and class labels end-to-end within a single network, and it has been widely adopted in real-world applications that require rapid inference and relatively low computational burden [
20]. More recently, YOLO has been extended from object detection to instance segmentation, enabling simultaneous prediction of object-specific masks and classes. YOLO-based medical imaging studies have also demonstrated strong performance and rapid processing speed for tumor detection, lesion detection, and segmentation tasks, making this family a suitable candidate for clinical decision-support systems [
21,
22]. Because standing whole-spine radiographs typically contain one principal spinal structure, there is little need for a panoptic segmentation framework that jointly handles multiple object categories and background classes. Instead, it is more appropriate to define the entire spine as a single instance and simultaneously predict its mask and severity class using a YOLO-based instance segmentation architecture that balances both accuracy and efficiency [
14].
Accordingly, this study proposes a YOLO-based instance segmentation framework for three-level scoliosis severity screening–normal, mild, and severe–using standing whole-spine radiographs. By learning clinician-assigned severity labels derived from Cobb angle measurements, the proposed method may provide a potentially useful framework for assisting clinical screening, although further external validation and prospective evaluation are required to confirm its clinical applicability.
Specifically, this study aims to conduct the following: (1) reformulate scoliosis severity classification from a pixel-level semantic segmentation problem into a whole-spine instance-level structural classification problem; (2) improve data-level reliability by applying class-balanced augmentation and background-only negative samples to mitigate class imbalance and augmentation-induced false negatives; and (3) evaluate whether previously established attention modules and segmentation heads can improve spinal contour restoration and severity classification when integrated into a YOLO-based instance segmentation framework. Therefore, the contribution of this study lies not in proposing a fundamentally new algorithmic module, but in the task-specific integration and empirical evaluation of dataset-level refinement and network-level adaptation for screening-oriented scoliosis assessment.
The major contributions of this study lie in the synergistic integration of a tailored dataset reconfiguration strategy and architectural adaptations optimized for scoliosis screening, rather than the proposal of a fundamentally new network architecture. Specifically, our contributions are summarized as follows:
First, we reformulated scoliosis severity screening from a pixel-level classification problem into a whole-spine instance-level structural classification problem.
Second, as a dataset reconfiguration strategy, we introduced background-only negative samples to mitigate data imbalance and effectively reduce false negatives—an empirical approach specifically validated for standing whole-spine radiography.
Third, we experimentally evaluated the task-specific integration of CBAM and a DeepLabV3+-based segmentation head within a YOLO-based instance segmentation framework. The results suggest that this configuration can improve spinal contour representation and screening-oriented severity classification performance in standing whole-spine radiographs.
The remainder of this paper is organized as follows.
Section 2 reviews studies related to scoliosis diagnosis, automated Cobb angle measurement, medical image segmentation, YOLO-based instance segmentation, and attention modules.
Section 3 describes the dataset, spinal contour annotation, class-balanced augmentation, construction of background-only negative samples, the YOLO-based segmentation architecture, improvements to attention modules and segmentation heads, and the training conditions.
Section 4 presents comparisons between semantic and instance segmentation, the effects of augmentation and background-only negative samples, comparisons of attention modules and segmentation heads, and the ablation results and class-wise screening performance of the final proposed model.
Section 5 discusses the clinical implications of the findings, the limitations of the study, and potential extensions toward automated Cobb angle measurement. Finally,
Section 6 concludes the paper.
3. Materials and Methods
The aim of this study was not to develop an automated measurement model that directly computes the Cobb angle from standing whole-spine radiography, but rather to develop a three-level AI screening model that classifies normal, mild, and severe scoliosis based on the overall spinal contour visible on standing whole-spine radiography.
3.1. Dataset
This study used standing whole-spine radiographs obtained from Gyeonggi-do Medical Center Pocheon Hospital. All images were acquired for the diagnosis or follow-up of scoliosis, and each image contained a single principal spinal structure. The study protocol was approved by the Institutional Review Board of Eulji University (IRB No. EU24-77).
During the initial data screening process, radiographs with inconsistent Cobb angle measurements or unclear severity assignments were excluded before construction of the final dataset. Because this exclusion was performed prior to final dataset compilation, the number of excluded images was not separately recorded.
The final dataset used in this study comprised 8999 standing whole-spine radiographs. Each image was assigned to one of three severity classes-normal, mild, or severe-according to the Cobb angle measured by clinicians or documented in the corresponding radiology reports. The class definitions used in this study were as follows: normal (Cobb angle < 10°), mild (10° ≤ Cobb angle < 25°), and severe (Cobb angle ≥ 25°) [
37].
To prevent data leakage and ensure unbiased model evaluation, the dataset was partitioned at the patient level. Each image corresponded to a unique patient, and no patient was included in more than one subset. The dataset was divided into training, validation, and test sets at a ratio of 6:2:2, respectively. The original dataset is summarized in
Table 1.
These thresholds are consistent with widely used clinical criteria in scoliosis diagnosis and management, in which a Cobb angle of 10° defines the diagnostic threshold for scoliosis and 25° serves as a key inflection point guiding brace treatment decisions.
3.2. Spine Contour Annotation and Label Construction
The spine contour annotations used in this study were performed by two board-certified clinical specialists and reviewed by one professor of rehabilitation medicine. The two specialists independently delineated the outer boundaries of the spinal structure on all standing whole-spine radiographs. Cases showing disagreement between the two specialists regarding the spinal contour annotation were excluded before construction of the final dataset. In addition, radiographs with inconsistent Cobb angle measurements or unclear severity labels were also excluded during the pre-dataset screening stage. Because these exclusion procedures were conducted before final dataset compilation, the number of excluded cases was not separately recorded. After this screening process, segmentation masks of the spinal region were generated based on the accepted annotations. The annotations were subsequently reviewed and validated by the professor of rehabilitation medicine. The resulting spine contour masks were then converted into polygon annotations for YOLO segmentation training.
Figure 1 presents examples of standing whole-spine radiographs and their corresponding segmentation masks.
For YOLO-based instance segmentation training, the spine contour masks were converted into polygon annotation format. First, the outer contour of the spine was extracted from each binary mask, and the closed contour was resampled into points using spline-based interpolation. To determine the appropriate number of polygon contour points, a preliminary quantitative comparison was conducted using three point settings: 240, 360, and 480 points. For each setting, YOLOv12-based instance segmentation was trained under identical conditions, and mAP@0.5 was evaluated on the validation set. The results are summarized in
Table 2.
As shown in
Table 2, the 480-point setting achieved the highest mAP@0.5 (0.835), whereas the 360-point setting yielded the lowest (0.828). The 240-point setting produced a result comparable to 480 points (0.834); however, visual inspection by the two clinical specialists and the professor of rehabilitation medicine indicated that the 240-point contour exhibited insufficient resolution in regions of high spinal curvature, where boundary detail is clinically meaningful. The 480-point setting was therefore selected as it provided the most consistent balance between contour fidelity and annotation complexity. Accordingly, each spinal instance was converted into a polygon annotation consisting of 480 contour points, and each coordinate was normalized by the image width and height before being saved in YOLO segmentation label format.
The final converted annotations were stored in the format <class_id> <x1> <y1> <x2> <y2>… <xn> <yn>, where class_id represents the scoliosis severity class and (xi, yi) represents the normalized polygon contour coordinates.
Figure 2 shows an example of a polygon mask applied to the original mask.
Because each image contained a single spinal instance, each YOLO segmentation label consisted of one polygon object and one severity class. The class IDs were assigned as 0, 1, and 2 for normal, mild, and severe cases, respectively. In other words, the YOLO segmentation labels used in this study were designed not only to predict the location and mask of the spine, but also to simultaneously predict the severity class of the spinal instance. During training, whenever augmentations such as image resizing, rotation, cropping, and scaling were applied, the same geometric transformations were also applied to the masks and polygon coordinates to maintain image–label consistency.
3.3. Class-Balanced Offline Augmentation
The scoliosis dataset used in this study exhibited an imbalance in the number of samples among the normal, mild, and severe classes. Such class imbalance may bias the model toward the majority class and, particularly in screening settings, may increase false negatives, in which actual scoliosis cases are misclassified as normal. To address this issue, offline data augmentation was applied to balance the distribution of the training data across classes.
As described in
Section 2.5, the augmentation strategy was restricted to physical transformations simulating real-world radiographic acquisition conditions. Rotation was limited to ±5° and scaling to 0.8–1.2, ensuring that neither the overall spinal contour nor the Cobb angle interpretation was materially affected. Cropping was applied within a range that did not result in excessive loss of the spinal contour. Consequently, misclassifications near the 10° and 25° severity thresholds are attributable to the inherent difficulty of borderline cases in real radiographs, rather than to augmentation-induced label distortion. Representative examples of the applied augmentations are shown in
Figure 3.
During all augmentation processes, parameter values were randomly assigned to avoid overrepresentation of specific ranges. The same geometric transformations were applied to both the original images and the segmentation masks to prevent label mismatch and maintain image-label consistency.
After offline augmentation, the number of training images in the normal class remained unchanged, whereas the mild and severe classes were each increased to 2400 images. This strategy enabled the model to learn the differences in spinal contours among normal, mild, and severe cases in a more balanced manner while minimizing bias toward any specific class. It should be noted that, while overall accuracy improved following augmentation, an increase in false negatives was also observed in some experimental settings. This is interpreted not as augmentation-induced label distortion, but as a consequence of the model encountering a more balanced class distribution: as the model was exposed to a greater proportion of mild and severe cases, the decision boundary near the severity thresholds became more sensitive, which may have increased misclassifications in borderline cases. This observation motivated the subsequent introduction of background-only negative samples to further stabilize screening performance.
3.4. Preliminary Experiment with Semantic Segmentation
In the initial stage of this study, a preliminary semantic segmentation-based approach was explored to examine whether scoliosis severity could be represented as pixel-level classes. The semantic segmentation model was trained to predict the normal, mild, and severe classes at the pixel level. However, scoliosis severity is determined not by the attributes of individual pixels or local regions, but by the degree of curvature and structural alignment of the overall spinal contour.
In this preliminary observation, the semantic segmentation-based approach showed a class fragmentation problem, whereby different severity classes were predicted in different regions of a single spinal structure, as illustrated in
Figure 4. This finding was used as a methodological motivation for reformulating scoliosis severity screening as an instance-level structural classification problem.
However, this preliminary comparison was qualitative in nature and was not intended as a systematic quantitative comparison between semantic segmentation and instance segmentation baselines. A comprehensive quantitative evaluation of semantic segmentation models, including class fragmentation frequency, class-wise recall, and mask-based metrics, would require additional experiments with multiple semantic segmentation baselines and a clearly defined pixel-level error criterion.
Accordingly, this study redefined scoliosis assessment not as a pixel-level classification problem, but as a whole-spine, instance-level structural classification problem. Based on this reformulation, subsequent experiments focused on a YOLO-based instance segmentation framework in which the entire spine was treated as a single anatomical instance.
3.5. YOLO-Based Instance Segmentation
In this study, a YOLO-based segmentation model was employed to detect the entire spinal contour as a single object instance while simultaneously predicting the normal, mild, and severe classes. YOLO is a single-stage detection framework that predicts bounding boxes, class probabilities, and objectness scores directly from the input image. The basic architecture is illustrated in
Figure 5.
For the segmentation task, an additional mask prediction branch was incorporated to generate an instance mask for each object. The baseline models evaluated in this study were YOLOv8, YOLOv9, YOLO11, and YOLOv12.
The YOLO segmentation model consists of a backbone, neck, detection head, and segmentation head. The backbone extracts multi-scale features from the input radiographs, whereas the neck fuses features at different resolutions. The detection and segmentation heads use feature maps at each scale to predict bounding boxes, class labels, confidence scores, and segmentation masks.
Because standing whole-spine radiographs generally contain a single major spinal structure, a panoptic segmentation framework designed to simultaneously handle multiple object categories and background stuff classes was considered unnecessary in this study. Instead, the entire spine was defined as a single anatomical instance, and an instance segmentation approach was adopted to simultaneously predict both the mask and the severity class of that instance.
3.6. Construction of Background-Only Negative Samples
After the application of class-balanced augmentation, the overall accuracy of the model improved; however, an increase in false negatives was observed in some experiments. This finding suggests that the model became more sensitive to positional information or background patterns than to the global structural contour of the spine, likely owing to variations in spinal position, image centering, and background distribution introduced by cropping.
The primary objective of this study was three-level scoliosis screening, in which false negatives-cases in which actual scoliosis patients are incorrectly classified as normal-represent a clinically critical error. To address this issue, background-only negative samples were constructed by extracting regions without the spine from standing whole-spine radiographs and incorporating them into the training process as empty-label samples. This approach was intended to mitigate positional bias and background-dependent feature learning that may arise during augmentation. Using this strategy, we investigated whether the model could more reliably distinguish non-spinal regions from true spinal instances and thereby reduce clinically important false negatives in screening.
To generate background-only negative samples, background regions that did not overlap with the spinal object were automatically extracted from the entire training dataset. First, the bounding region of the spinal object was obtained from the label file of each image. A prohibited region was then defined by applying padding around both the image center and the spinal bounding box. Random crops were repeatedly generated, and any crop overlapping the spinal region by even a single pixel was excluded. In addition, excessively dark regions or regions with extremely low contrast were removed so that only background samples containing a minimal amount of image information were retained.
The selected crops were then saved as empty-label images and added to the original training dataset. This process enabled the model to more clearly distinguish background regions without the spine from true spinal objects. Background-only negative samples were added only to the training set and were not included in the validation or test sets. Ultimately, 1200 background-only negative samples were generated, corresponding to approximately 15% of the non-background augmented training set. The number of background-only negative samples was empirically selected during model development to provide sufficient negative-context exposure without excessively increasing empty-label samples. A systematic ratio-ablation study of background-only samples was not conducted in the present revision and is therefore acknowledged as an important topic for future work. These samples, shown in
Figure 6, were used together with the original and augmented images during training.
This dataset-level reconstruction was designed to complement class-balanced augmentation. While augmentation increased the diversity of scoliosis-related samples, background-only negative samples provided negative contextual information from the same radiographic domain. Therefore, the model was encouraged not only to learn class-balanced spinal contour variations, but also to distinguish true spinal instances from non-spinal anatomical or background regions. This complementary data strategy was particularly important for reducing screening false negatives, which are clinically more critical than false positives in scoliosis screening.
The final dataset, including the background-only negative samples, is presented in
Table 3.
3.7. Integration of Attention Modules
To improve the performance of YOLO segmentation, attention modules were incorporated into the backbone in this study. Attention modules enhance the representational capacity of the model by emphasizing diagnostically relevant channels or spatial regions within feature maps. In particular, they allow the model to focus more effectively on spinal contours and curvature patterns that are critical for diagnosis.
In scoliosis diagnosis, overall spinal curvature, lateral deviation, and contour morphology are important factors. Therefore, it is necessary not only to capture local features but also to stably represent the global structural characteristics of the entire spine. In this study, two representative attention modules, CBAM and the Swin Transformer block, both of which have demonstrated effectiveness in YOLO-based object detection, were incorporated and comparatively evaluated.
3.7.1. CBAM
CBAM is a lightweight attention module that sequentially applies channel attention and spatial attention. Channel attention learns the relative importance of each channel within the feature maps, whereas spatial attention identifies the regions within the image on which the model should focus. In this study, CBAM was inserted after the major feature extraction blocks of the YOLO backbone to emphasize features associated with the spinal contour.
CBAM was adopted from its original formulation without structural modification [
28]. The CBAM-integrated architecture was implemented by inserting the attention blocks into the YOLO backbone while preserving the original backbone, neck, and head structure. The contribution of incorporating CBAM in this study lies in the experimental confirmation that it can improve spinal contour representation and three-level severity screening performance within a whole-spine instance segmentation framework applied to scoliosis radiographs.
3.7.2. Swin Transformer Block
To explore the potential of Transformer-based attention mechanisms, the Swin Transformer block was incorporated into the YOLO backbone. The Swin Transformer, originally proposed by Liu et al. [
29], partitions input features into local windows and performs self-attention within each window, while the shifted-window mechanism enables interaction across regions. This module was not newly proposed in this study; it was included solely to investigate whether a Transformer-based approach could offer additional performance gains over CNN-based attention in the scoliosis screening task.
However, Transformer-based architectures generally require greater computational cost and memory consumption than CNN-based attention modules. Therefore, in practical clinical screening environments, the trade-off between performance improvement and computational efficiency must be carefully considered.
3.8. Comparison of Segmentation Head Architectures
Segmentation head architectures previously established in semantic segmentation were incorporated into the YOLO segmentation framework to improve restoration of spinal contour boundaries. Specifically, U-Net- and DeepLabV3+-based heads were adopted from prior studies and compared without claiming them as newly designed modules [
23]. The effects of each architecture on spinal mask reconstruction and severity classification performance were analyzed in the specific context of whole-spine radiographic scoliosis screening.
The U-Net-based head facilitates boundary refinement through skip connections. The DeepLabV3+-based head combines multi-scale contextual feature extraction using atrous convolution with decoder-based boundary refinement, which may be particularly advantageous for preserving the morphological characteristics of the long, continuous spinal contour.
These architectures were compared within the same YOLO-based segmentation framework to identify the most suitable configuration for the scoliosis dataset. The detection and classification branches of the baseline YOLO segmentation model were preserved, whereas only the segmentation head responsible for mask prediction was modified according to each architecture. This design allowed the model to retain severity class prediction for each spinal instance while improving boundary representation during mask decoding.
The two segmentation heads were compared under identical YOLO backbone conditions, and not only segmentation performance but also classification performance for the normal, mild, and severe classes, as well as the reduction in false negatives, were comprehensively evaluated.
3.9. Proposed Network Architecture
The final network configuration used in this study was constructed by integrating CBAM [
28] and a DeepLabV3+-based segmentation head [
25] into the YOLOv12 segmentation baseline, which showed the best baseline segmentation performance among the evaluated YOLO models. These components were not newly designed in this study; rather, they were selected and combined to adapt the YOLO-based instance segmentation framework to the characteristics of standing whole-spine radiographs and three-level scoliosis severity screening.
The overall architecture consists of the following stages. The resulting model preserves the single-stage segmentation structure of YOLO while incorporating CBAM to enhance features related to the spinal contour and a DeepLabV3+-based head to strengthen multi-scale contextual representation and boundary restoration. The purpose of this configuration is not to introduce a fundamentally new network module, but to evaluate whether these established components can improve contour extraction and severity screening performance when applied to whole-spine radiographs. First, the input standing whole-spine radiograph is fed into the YOLO backbone. CBAM is applied to the major feature extraction blocks of the backbone to recalibrate channel and spatial features. Multi-scale feature fusion is then performed in the neck. The detection and classification branch predicts the bounding box, confidence score, and severity class of the spinal instance, whereas the DeepLabV3+-based segmentation head reconstructs the spinal instance mask. Ultimately, the model simultaneously outputs the spinal mask and normal, mild, and severe classes.
Figure 7 illustrates the architecture of the proposed model, in which CBAM and the DeepLabV3+-based head are integrated into the baseline YOLO segmentation framework.
Figure 8 shows the differences between the basic YOLO and the proposed model.
In this study, the term “boundary-enhanced” refers specifically to the enhancement of spinal contour boundary representation. It does not refer to Cobb angle threshold boundaries between severity classes. The DeepLabV3+-based head was incorporated to improve mask quality and boundary restoration of the elongated spinal contour, whereas CBAM was used to emphasize features related to the spinal contour and spatial structure.
3.10. Training Design
The standing whole-spine radiographs used for training varied in size and contrast depending on the imaging equipment, patient body habitus, degree of spinal curvature, and image resolution. Therefore, to improve training stability, all images were resized to a uniform input resolution of 640 × 640. During the resizing process, a simple resizing method was applied while minimizing distortion of the spinal aspect ratio.
To evaluate the effects of each component on model performance, all experiments were conducted under identical training conditions. The same training, validation, and test split was used throughout the study to ensure a fair comparison.
The hardware environment consisted of an Intel Core Ultra 9 285K processor, 128 GB of RAM, and an NVIDIA GeForce RTX 5090 GPU with 32 GB of memory. The software environment included Python 3.10, PyTorch 2.11.0, and CUDA 12.8. The initial learning rate was set to 0.01 and decayed linearly to 0.0001 over the course of training (lrf = 0.01). The batch size was 32, and the maximum number of training epochs was 300. The stochastic gradient descent (SGD) optimizer was used with a momentum of 0.937 and a weight decay of 0.0005. A linear warmup was applied over the first 3 epochs.
For YOLO-based instance segmentation training, the total loss consisted of bounding-box localization loss, classification loss, distribution focal loss (DFL), and segmentation mask loss, as follows:
The box loss and DFL were used for bounding-box localization, the classification loss was used for severity class prediction, and the segmentation mask loss was used to optimize the predicted spinal instance mask. The box, classification, and DFL loss weights were set to 7.5, 0.5, and 1.5, respectively, following the YOLOv12 configuration. The segmentation mask loss was computed by the segmentation branch during mask prediction. Automatic mixed precision (AMP) training was enabled to reduce memory usage. Mosaic augmentation was disabled during the final 10 epochs (close_mosaic = 10) to stabilize training convergence. All models were trained from scratch without pre-trained weights, and early stopping was applied with a patience of 100 epochs.
The computational efficiency of the proposed model was evaluated on the test set using the NVIDIA GeForce RTX 5090 GPU. The end-to-end inference speed, including preprocessing, model inference, and postprocessing, was approximately 83.3 images per second (approximately 12.0 ms per image), demonstrating that the proposed framework is capable of near-real-time processing suitable for clinical screening workflows. Therefore, the computational analysis should be interpreted as partial profiling rather than a complete efficiency benchmark. Because FLOPs and peak GPU memory usage were not measured, we avoid claiming comprehensive computational superiority over all baseline models. The number of trainable parameters for each model is summarized in
Table 4. Detailed FLOPs and GPU memory profiling were not measured in this study and are reserved for future work.
Specifically, according to the official Ultralytics benchmarks, the baseline YOLOv12m-seg architecture requires approximately 87.3 GFLOPs of computational volume. While detailed multi-GPU server-level memory profiling and sample-specific FLOPs variations were reserved for future work due to institutional runtime constraints, the proposed network maintained a lightweight profile (27.9 M parameters) equivalent to standard mid-sized models, ensuring its practical feasibility without excessive hardware overhead. The random seed was set to 0 to improve the reproducibility of the experiments.
3.11. Evaluation Metrics
The primary objective of this study was to accurately extract the spinal contour from standing whole-spine radiographs and to perform three-level scoliosis screening for normal, mild, and severe cases. Accordingly, model evaluation was conducted with consideration of not only detection performance but also false-negative reduction, particularly the clinically significant misclassification of mild and severe cases as normal.
3.11.1. Detection and Segmentation Evaluation
To evaluate the object detection and mask prediction performance of the YOLO segmentation model, mean Average Precision (mAP) was used as the primary evaluation metric [
38].
Recall denotes the proportion of actual spinal objects correctly detected by the model, whereas precision denotes the proportion of predicted spinal objects that corresponded to true spinal objects.
Average Precision (AP) is defined as the area under the precision–recall curve, and mAP is computed as the mean AP across all classes.
Intersection over Union (IoU) quantifies the degree of overlap between the predicted mask and the ground truth mask and is defined as follows. In this study, both bounding box-based mAP and mask-based mAP were calculated.
3.11.2. Severity Classification Evaluation
In addition, given the screening-oriented purpose of this study, false negatives were adopted as a major evaluation metric. In particular, cases in which patients with actual mild or severe scoliosis were classified as normal or were not detected at all were defined as screening false negatives.
This metric allowed not only overall average performance to be assessed, but also clinically important missed detections and underestimation errors occurring during the actual screening process to be quantitatively evaluated.
4. Results
In this study, the primary objective was not simply to localize the spine in standing whole-spine radiographs, but to stably extract the entire spinal contour and perform three-level scoliosis screening-normal, mild, and severe-based on this structural information. In scoliosis screening, false negatives, in which actual mild or severe cases are incorrectly classified as normal, constitute a clinically important error. Accordingly, model evaluation included not only mask mAP and overall accuracy but also the number of false negatives. Representative detection results are presented in
Figure 9 and
Figure 10.
First, semantic segmentation was compared with YOLO-based instance segmentation to evaluate the need for an instance-level approach in scoliosis severity classification. The results showed that semantic segmentation frequently caused class fragmentation, in which different severity classes were assigned to different regions of a single spine. In contrast, the YOLO-based instance segmentation framework consistently predicted a single severity label for the entire spinal structure.
Subsequently, using the baseline YOLO segmentation model as the reference, class-balanced offline augmentation, background-only negative samples, attention modules, and segmentation head architectures were sequentially evaluated. Finally, the proposed model integrating CBAM with a DeepLabV3+-based segmentation head was compared with the baseline model and with variants incorporating individual modules.
The proposed model stably extracted the entire spinal contour from standing whole-spine radiographs and classified each spinal instance into one of three severity categories: normal, mild, or severe. To reduce the influence of random initialization, each experiment was repeated three times using a fixed random seed. The model achieving the best validation performance among the three runs was selected, and its test-set performance is reported in
Table 5. In particular, the class fragmentation observed in semantic segmentation was substantially reduced in the YOLO-based instance segmentation approach. In the present evaluation, the box-based mAP@0.5 and mask-based mAP@0.5 were numerically identical under the adopted evaluation setting. However, this should not be interpreted as a general property of single-instance datasets, because mask-level overlap and bounding-box overlap may differ depending on contour quality.
It should be noted that, in all experimental settings, the mask-based mAP@0.5 and the bounding-box-based mAP@0.5 yielded numerically identical results. This is attributable to the single-instance nature of the dataset, in which each standing whole-spine radiograph contains exactly one spinal object. Under this condition, detection success or failure is determined at the instance level, and the mask IoU threshold produces outcomes consistent with the bounding-box IoU threshold. Accordingly, the mAP@0.5 values reported in
Table 5 represent both box-level and mask-level performance simultaneously.
The detailed training process of the proposed model is presented in
Figure 11.
Training and validation losses decreased steadily with increasing epochs, and validation performance exhibited convergence after a certain number of epochs. These results suggest that the proposed model was able to learn spinal contour features and severity classes from standing whole-spine radiographs in a stable and consistent manner.
The confusion matrix is presented in
Figure 12.
Class-wise analysis showed that the proposed model achieved stable recall not only for the normal class but also for the mild and severe classes. In particular, errors in which actual mild or severe cases were misclassified as normal were reduced compared with the baseline model. This improvement is likely attributable to the proposed model more effectively reflecting the overall spinal contour structure and curvature information.
Analysis of the confusion matrix showed that misclassifications occurred primarily between adjacent severity classes. For example, some mild cases were classified as normal or severe, whereas some severe cases were classified as mild. Such errors are likely to arise in cases located near the class boundaries defined by the Cobb angle criteria. Therefore, future studies should consider the continuous relationship between actual Cobb angle values and predicted classes.
The precision–recall curve is presented in
Figure 13. The graph of loss values for each training is shown in
Figure 14.
Overall, the proposed model maintained high precision and recall, with recall showing particular improvement for the mild and severe classes compared with the baseline model. The class-wise AP@0.5 of the proposed model was as follows: normal = 0.902, mild = 0.756, severe = 0.854. These values confirm that the model maintained balanced detection performance across all three severity categories.
To assess training stability, each experiment was repeated three times using fixed random seeds, and the best-performing run on the validation set was selected for test-set evaluation. Across all three runs, the maximum observed difference in mAP@0.5 did not exceed 0.002, indicating that the reported results are stable and reproducible under the given training conditions.
5. Discussion
This study aimed to develop a screening-oriented AI model for assisting three-level scoliosis assessment-normal, mild, and severe-based on standing whole-spine radiographs.
Building upon clinician-assigned severity labels derived from established Cobb angle criteria, which are clinically used to guide management decisions such as observation, follow-up, bracing, and surgical referral, the proposed framework treats the entire spinal contour as a single anatomical instance. This approach is not intended to replace precise Cobb angle measurement or vertebra-level analysis, which remain essential for definitive diagnosis, curve monitoring, and treatment planning. Rather, it is designed to serve as a complementary screening aid for rapid case triage and missed-detection prevention. Accordingly, this formulation supports screening-oriented severity classification while aiming to reduce clinically important false negatives.
Initial experiments confirmed that a semantic segmentation-based approach was not well suited to scoliosis severity classification. Although semantic segmentation showed some ability to extract the spinal contour itself, it produced a class fragmentation problem in which different severity classes were simultaneously predicted within a single spinal structure. This limitation arises because scoliosis severity is determined not by the features of specific pixels or local regions, but by the overall curvature pattern and structural deformity of the spine. In other words, semantic segmentation operates at the pixel level, whereas actual clinical judgment is made at the whole-spine level. Owing to this fundamental mismatch, semantic segmentation may be useful for contour extraction but remains limited for stable severity classification. In contrast, the proposed YOLO-based instance segmentation framework alleviated this inconsistency by defining the entire spine as a single anatomical instance and assigning one severity class to it.
Two-stage instance segmentation models, such as Mask R-CNN [
39], can provide high-quality masks; however, their structural complexity and relatively higher computational cost may limit their suitability for rapid screening workflows. By contrast, YOLO-based one-stage segmentation performs detection, classification, and segmentation within a unified network, providing a favorable balance between accuracy and computational efficiency. This structural advantage has been highlighted in recent reviews of YOLO-based medical imaging applications, which emphasize its potential suitability for clinical decision-support tasks requiring rapid inference. In addition, because standing whole-spine radiographs typically contain one principal spinal structure, treating the entire spine as a single anatomical instance is more consistent with the clinical decision unit of this study than using a panoptic segmentation framework designed to jointly address instance-level objects and background regions [
40]. The findings of this study also suggest that YOLO-based instance segmentation can preserve the overall spinal contour while performing three-level severity screening.
From a data perspective, class imbalance had an important influence on model reliability. The collected dataset was not evenly distributed across the normal, mild, and severe classes, and such imbalance can cause the model to become biased toward majority classes. This issue is particularly important in a screening-oriented model because decreased recall for disease-related classes directly increases false negatives, in which actual patients are incorrectly judged to be normal. To alleviate this problem, class-balanced offline augmentation based on physical transformations, including cropping, rotation, scaling and Mosaic augmentation, was applied. These strategies increase training diversity while preserving the diagnostic meaning of medical images. In practice, overall accuracy and several classification metrics improved after augmentation. However, some experiments also showed a simultaneous increase in false negatives.
These findings indicate that data augmentation does not always translate into clinically meaningful performance improvement. In scoliosis radiographs, spinal position, acquisition range, patient posture, and surrounding background structures all influence feature learning. Excessive geometric augmentation may cause the model to become more sensitive to positional information or background patterns than to the structural curvature of the spine itself. In such cases, overall accuracy may improve, yet clinically important abnormal cases may still be missed, thereby increasing false negatives. Thus, improvements in classification metrics do not necessarily imply improved clinical reliability.
To address this limitation, background-only negative samples were introduced as empty-label negative-context samples. Unlike conventional hard negative mining, which mainly focuses on suppressing false positives, our background-only negative samples were specifically designed to reduce false negatives caused by augmentation-induced positional bias in screening-oriented radiographs. These samples were generated by extracting non-spinal regions from standing whole-spine radiographs and were not treated as an additional class among normal, mild, and severe, but rather as training samples without object labels. Through this strategy, the model learned to distinguish more clearly between regions containing the spine and those without the spine, thereby reducing excessive dependence on positional information or background patterns. In practice, false negatives decreased after the introduction of background-only negative samples, suggesting that negative-context learning can be a highly effective strategy for screening-oriented medical image AI. By learning background information explicitly, the model was prevented from overfitting to the specific position of the spine or surrounding contextual cues and instead focused more directly on the spinal contour itself, thereby preserving recall. From a clinical perspective, because overlooking actual mild or severe cases is a critical error, reducing false negatives is more meaningful than merely increasing overall accuracy. The background-only negative samples used in this study were generated in a manner tailored to the characteristics of standing whole-spine radiographs, informed by prior studies on negative or hard-negative sample strategies in object detection. The experimental results showed a reduction in screening false negatives after incorporating these samples, suggesting that they may help the model distinguish true spinal instances from non-spinal background regions. However, a detailed analysis of how these samples contribute to individual loss components-such as objectness-related loss, classification loss, and mask loss-was not conducted in this study, as it was beyond the primary scope of the present work. This loss-level mechanistic analysis should be addressed in future research.
In the comparison of attention modules, CBAM showed greater practical utility than the Swin Transformer block. Although Swin Transformer can effectively model long-range dependencies through self-attention, it substantially increases computational cost and memory requirements. CBAM, by contrast, is a considerably lighter module and was nevertheless effective in improving accuracy and reducing false negatives. This may be because, in standing whole-spine radiographs, the spine already dominates the image as the principal anatomical structure; consequently, more precise emphasis on channels and spatial regions relevant to the spinal contour may be more beneficial than additional global long-range dependency modeling. In other words, for scoliosis screening, boundary preservation and spatial refinement appear to be more important than further expansion of global contextual modeling.
A similar interpretation was supported by the comparison of segmentation heads. Among the evaluated segmentation heads, the U-Net-based head achieved slightly higher mAP@0.5, whereas the DeepLabV3+-based head showed a lower number of screening false negatives and provided stable contour-oriented representation. Therefore, DeepLabV3+ was selected as the final segmentation head because the present study prioritized screening false-negative reduction and boundary restoration rather than mAP alone. This result is likely attributable to its ability to provide both multi-scale contextual information and boundary refinement simultaneously. Scoliosis diagnosis depends not simply on the morphology of individual vertebrae, but on the continuous curvature pattern of the entire spine, including both thoracic and lumbar segments. Atrous spatial pyramid pooling can effectively capture such multi-scale structural information, while the decoder facilitates fine restoration of spinal contour boundaries. Although U-Net also improved segmentation performance to some extent, DeepLabV3+ best preserved both contour continuity and the stability of severity classification.
Ultimately, the proposed model combining CBAM and a DeepLabV3+-based segmentation head showed the best overall performance. CBAM mainly contributed to accuracy improvement and false-negative reduction, whereas DeepLabV3+ contributed to improved mask quality and boundary restoration. By integrating both components, spinal contour reconstruction and screening stability were simultaneously improved, and the model achieved the lowest screening false-negative rate among all experimental settings. These findings emphasize that a clinically useful scoliosis screening system should not merely aim for high overall accuracy, but should also reliably detect abnormal cases.
This study has several limitations. First, it performed three-level classification based on clinician-assigned severity labels rather than direct regression of the Cobb angle. Although this approach is appropriate for screening, it cannot fully replace the quantitative measurements required for actual surgical planning or treatment decisions. Direct Cobb angle regression requires precise vertebral landmark localization, end-vertebra selection, and endplate orientation analysis, all of which remain highly sensitive to image quality, anatomical overlap, and observer variability. In contrast, the present study prioritized robust screening performance and false-negative reduction in practical clinical workflows.
Second, because the dataset was collected from a limited number of institutions, generalizability to diverse imaging protocols and patient populations may be restricted. Additional external validation using multi-center datasets will therefore be necessary to confirm the robustness and clinical applicability of the proposed framework.
Third, misclassifications still occurred in some cases located near the boundaries between normal and mild or between mild and severe. Although Cobb angle measurements were performed by clinicians according to standard clinical criteria, continuous Cobb angle values were converted into discrete severity categories in this study. Therefore, cases located close to the 10° and 25° thresholds may still involve inherent label uncertainty. In addition, low radiographic contrast and anatomical overlap may have contributed to false detections or misclassifications by making spinal contour delineation more difficult.
Fourth, the comparison between semantic segmentation and instance segmentation in this study was qualitative rather than quantitative. A systematic quantitative evaluation involving multiple semantic segmentation baselines with class-wise recall and mask-based metrics is recommended in future work.
Future work should extend the present whole-spine instance segmentation framework to vertebra-level instance segmentation, in which individual vertebrae are separated as distinct anatomical instances. Because the Cobb angle is measured using the endplate orientation of the most tilted superior and inferior end vertebrae, accurate automated measurement requires vertebral separation, vertebral endplate detection, end-vertebra selection, and angle calculation to be performed together. Accordingly, future studies should expand the framework toward a multi-task learning structure that integrates vertebra-level instance segmentation with endplate-based angle estimation, thereby enabling both screening and quantitative Cobb angle regression in an end-to-end manner. Such an approach could evolve into a more comprehensive scoliosis diagnosis system that supports not only early screening but also practical clinical treatment planning.
Regarding the clinical interpretation of model confidence scores, the present study reports confidence scores alongside predicted classes in the detection results; however, a systematic investigation of the relationship between confidence score ranges and clinical outcomes was beyond the scope of this work. The degree to which confidence scores correlate with diagnostic certainty, borderline case identification, or referral decision thresholds may vary depending on clinical context. A dedicated study examining the clinical significance of confidence score distributions across severity classes is currently underway by the research group and will be addressed in future work.
Mechanistically, the background-only negative samples were used as empty-label images that provided negative contextual exposure from the same radiographic domain. Because these samples did not contain foreground spine annotations, they were intended to help the model distinguish non-spinal radiographic regions from true spinal instances and suppress erroneous foreground predictions in background regions. The observed reduction in false negatives suggests that this negative-context learning strategy may have reduced augmentation-induced positional or background bias and encouraged the model to rely more strongly on spinal contour information. However, because a detailed loss-level decomposition was not performed in this study, the contribution of background-only negative samples to individual loss components should be interpreted qualitatively rather than as direct evidence from separate loss-branch analyses.
6. Conclusions
This study presents a YOLO-based instance segmentation framework for three-level scoliosis screening—including normal, mild, and severe—using standing whole-spine radiographs. In conventional semantic segmentation-based approaches, scoliosis severity labels are learned as pixel-level classes, which can lead to class fragmentation, whereby different severity classes are predicted within the same spine. In the present study, the entire spinal contour was defined as a single anatomical instance, and severity was predicted on the basis of its global structural information, thereby alleviating this limitation.
To address class imbalance, class-balanced offline augmentation was applied. In addition, background-only negative samples constructed from non-spinal regions in radiographs were incorporated during training to mitigate the increase in false negatives observed after augmentation. Experimental results demonstrated that these background-only negative samples were effective in reducing false negatives, which is particularly important in screening-oriented applications.
At the network level, CBAM and the Swin Transformer block were compared to adapt YOLO-based segmentation to the characteristics of scoliosis radiographs, and U-Net and DeepLabV3+-based segmentation heads were also evaluated. The results showed that CBAM effectively improved accuracy and reduced false negatives with low computational overhead, whereas the DeepLabV3+-based head improved spinal contour restoration through multi-scale contextual modeling and boundary refinement. Ultimately, the proposed model combining CBAM with a DeepLabV3+-based head improved both spinal contour extraction and three-level severity screening performance relative to the baseline YOLO segmentation model.
Although the present study did not directly estimate the Cobb angle, it provides preliminary evidence that whole-spine instance segmentation may be useful for assisting three-level scoliosis screening. Future work should include external validation with multi-center datasets, prospective workflow evaluation, quantitative inter-observer agreement analysis, systematic comparison with semantic segmentation baselines, and extension toward vertebra-level segmentation and Cobb angle estimation.