Localization and Edge-Based Segmentation of Lumbar Spine Vertebrae to Identify the Deformities Using Deep Learning Models

The lumbar spine plays a very important role in our load transfer and mobility. Vertebrae localization and segmentation are useful in detecting spinal deformities and fractures. Understanding of automated medical imagery is of main importance to help doctors in handling the time-consuming manual or semi-manual diagnosis. Our paper presents the methods that will help clinicians to grade the severity of the disease with confidence, as the current manual diagnosis by different doctors has dissimilarity and variations in the analysis of diseases. In this paper we discuss the lumbar spine localization and segmentation which help for the analysis of lumbar spine deformities. The lumber spine is localized using YOLOv5 which is the fifth variant of the YOLO family. It is the fastest and the lightest object detector. Mean average precision (mAP) of 0.975 is achieved by YOLOv5. To diagnose the lumbar lordosis, we correlated the angles with region area that is computed from the YOLOv5 centroids and obtained 74.5% accuracy. Cropped images from YOLOv5 bounding boxes are passed through HED U-Net, which is a combination of segmentation and edge detection frameworks, to obtain the segmented vertebrae and its edges. Lumbar lordortic angles (LLAs) and lumbosacral angles (LSAs) are found after detecting the corners of vertebrae using a Harris corner detector with very small mean errors of 0.29° and 0.38°, respectively. This paper compares the different object detectors used to localize the vertebrae, the results of two methods used to diagnose the lumbar deformity, and the results with other researchers.


Introduction
Spine deformity can occur by birth, due to aging, injury, or due to spine surgery. Road accidents are the main cause of spinal injuries due to increasing rate of auto and motor vehicles. In 2013, the World Health Organization (WHO) presented key facts regarding spinal injuries and deformities showing that every year almost 250,000 to 500,000 people suffer from spine issues [1]. According to the 2016 American Journal of Public Health [2], after stroke, spine issues are the second leading cause of paralysis. The human spine consists of 26 vertebrae; the first seven in the neck are called cervical, the twelve in the torso are called thoracic, and the five in the lower back are called lumbar vertebrae, as shown in Figure 1. The other two are sacrum and coccyx. The area of the low back, also known as the lumbar region, starts below the rib cage [3]. The lumbar vertebrae, numbered from L1-L5, are the largest in size and are more prone to deformity because they are responsible for carrying the weight of the body [4]. About 80% of the population suffers lower back pain in their lives [5], and it is the third most common reason for doctor visits, costing Americans more than USD 50 billion each year [6]. Imaging tests can help the doctors in diagnosing the lumbar spine deformity and they can correlate it with pain symptoms. Diagnosing the deformity is a laborious task and clinicians require manual methods or computer-assisted diagnoses tools which act as a brain of doctors; they have improved the clinical identification, and are less prone to errors. Conventional manual diagnoses are prolonged and there can be variability in manual diagnoses [7,8]. Automated systems based on artificial intelligence (AI) can help to lessen the diagnostic errors caused by human clinical practice [9][10][11] and they can be used to assist the clinicians in diagnosing the spinal disorders. Magnetic resonance imaging (MRI) and computed tomography (CT) technologies are used to detect various spinal disorders by machine learning (ML) techniques which assist the surgeons and physicians to diagnose the disease without using time-consuming manual methods. Timely diagnoses of spine deformities can prevent the patient from dangerous consequences and help in treating the disease at its early stage. MRI scan is good at detecting small herniation of discs, pressed nerves, and soft-tissue-related issues, while CT is more useful in detecting the moderate-and high-risk spinal fractures and injuries due to its clear bones' structure [13].
AI has been very popular in medical imaging in the past few years, and it helps the clinicians and doctors to diagnose various diseases. As the number of imaging modalities are increasing, they support the clinicians to diagnose but they lack efficiency and accuracy; however, AI has changed the way people process large amounts of data [14]. Our objective is to develop a diagnostic system which is based on object detector framework and can be used to detect the lumbar spine deformities using machine learning tools. Manual labeling is outdated and automated methods can save the precious time of doctors.
The paper is organized as follows: Section 2 describes the related work about the models used for vertebrae identification and localization; Section 3 discusses the dataset used in our proposed technique and covers the detailed methodology, results are analyzed in Section 4 and the conclusion is presented in Section 5.

Related Work
AI is growing vastly in medical imaging, and automated systems have been developed by many researchers to diagnose different diseases and help the doctors to choose less invasive surgical procedures. Many studies have been carried out on lumbar spine as it is responsible for lower backache. Due to heavy mechanical stress, slip often occurs at L4 to L5 or L5 to S1. In the past, many approaches have been applied on the vertebrae to detect, segment, and identify various diseases, but, still, researchers are working on better and new techniques to diagnose the diseases more efficiently.
In [15], researchers worked to detect the lumbar spinal stenosis (MRI) images. They worked on axial view of the images and applied SegNet with different training ratios. Mabarki et al. [16] worked on convolutional neural networks (CNN) based on Visual Geometry Group 19 (VGG19) architecture to detect the herniation in the lumbar disc. They tested the system successfully with more than 200 patients. Ala et al. [17] developed a system to find the herniation in disc by taking centroid distance function as a shape feature. They concluded that this feature can be visualized as the best indicator of disc herniation in MRI scan axial images. In [18], authors worked on the mid-sagittal view of the MRI images. They use two segmentation techniques: the first technique was a customized algorithm and the other was semantic segmentation. They obtained good results in classification of spondylolisthesis and lumbar lordosis.
A cascaded fully connected network (FCN) was developed in [19]. They trained the 3D FCN to obtain the lumbar shape and called it a localization net, and then they trained another 3D FCN to segment the cropped lumbar, and called it a segmentation net. The localization net helped the segmentation net to segment the lumbar region correctly. Their results are pretty good, with a dice coefficient of 95%. Liao et al. [20] worked on arbitrary CT images which is a demanding task. As all the images have different shapes and appearances, it is very difficult to segment and localize vertebrae. Therefore, they solved the problem by working on short-range contextual information and long-range contextual information. For short-range contextual information, a 3D FCN is used to extract the features, and for long-range contextual information, they used the bidirectional recurrent neural network, which is applied to encode the contextual information. In conclusion, their method extracts better feature representation than previously used methods with a notable margin on the Medical Image Computing and Computer-Assisted Intervention (MICCAI) Challenge dataset. To predict the centroid coordinates of vertebrae, a deep network was deployed in [21]. They used the public dataset of CT volumetric images and obtained accuracy up to 90%.
Glocker et al. [22] developed a novel approach based on a regression tree. They used two datasets and have a total of 424 CT scans images with different pathologies. Each classification forest is trained to a maximum depth of 24 trees and consists of 20 trees. Their approach works better than regression forest + hidden Markov model (HMM) on pathological spine CT. Pisov et al. [23] worked on a publicly available dataset of the chest to detect the early stage of osteoporosis. Another two-step algorithm is proposed by them, which is used to localize the vertebral column in 3D CT images and the next step is to detect each vertebra and look for fractures in 2D. They trained neural networks for both steps on GPU using an easy six-keypoint-based annotation scheme. Their error is very small, up to 1 mm with very high accuracy up to 0.99. In [24], authors presented their work at Large Scale Vertebrae Segmentation Challenge (VerSe) in 2019, where they used a humanmachine hybrid algorithm, with 95% of high vertebrae identification rate and 90% dice coefficient. They used three steps to identify vertebrae: the first step is to detect vertebrae, the second step is to label the vertebrae that is based on btrfly-Net [25], and the third step is to segment vertebrae, which was performed by U-Net. In [26], vertebrae segmentation and labeling was carried out by using a FCN. They segmented the vertebrae by combining the network with a memory component that keeps information about already-segmented vertebrae. After segmentation, it then searches for another vertebrae that is located next to the segmented one and predicts whether it is visible enough to process for further analysis. The methodology attained very high accuracy of 93%, with only one mislabeled vertebrae case. Lecron et al. [27] tried to develop an automatic approach to detect the vertebra. The purpose of developing such model is to detect vertebra without human involvement. They obtained the points of interest in radiography by an edge polygonal approximation, and a scale-invariant feature transform (SIFT) descriptor was used to train a support vector machine (SVM) model. They conclude that their results are very promising, with a corner and vertebrae detection accuracy rate up to 90% and 86%. James et al. [28] proposed a system to detect and localize vertebrae. It detects vertebrae using 3D samples and identifies the specific vertebrae using 2D slices. Their results show very accurate identification and localization of vertebrae.
Friska et al. [29] developed an automated system to measure the foraminal widths and anteroposterior diameter to determine the disease called lumbar spinal stenosis. They used SegNet to obtain six regions of interests in composite axial MRI Images. The results reported 97% agreement with the specialists' opinion to identify the severity in the intervertebral disc herniation. Boundary detection method using dynamic programming was developed in [30]. They calculated the Euclidean distance between their method of detecting the boundary and manual labeling of lumbar spine and achieved the mean Euclidean distance of 3 mm. Ghosh et al. [31] proposed a system that uses two methods to detect and localize the intervertebral disc (IVD). The system detects IVD by using different machine learning algorithms and segments all the tissues in lumbar sagittal MRI by using different features and training them on robust classifiers. The process achieved promising results with both methods. Gang et al. [32] proposed a novel approach of adding three CNN layers in You Look Only Once (YOLO)-tiny. Their system was used to detect spinal fractures with accuracy of 85.63%. Zuzanna et al. [33] used YOLOv3 to detect different regions in the pelvic area. Modified YOLOv3 is developed in [34]. The researchers used the approach to locate the IVD and detect disc herniation.
Bagus Adhi Kusuma [35], in his research article, addressed the detection of scoliosis using X-ray images. The author preprocessed by converting X-ray images to grayscale and marked seed locations that divide images into 12 sub-images. Later, median filtering and canny were applied to obtain the boundary or vertebrae. After center point calculations, polynomial curve fitting, and Cobb angle estimation, with the help of gradient equation, was achieved. K-mean clustering played a significant role to determine the scoliosis curve. The procedure average deviation is less than 6 degrees. Yaling Pan et al. in [36] used two separate mask regions with convolutional neural network (R-CNN) models to segment and detect the spinal curve and all vertebral bones on 248 X-rays. The Cobb angle is measured from the output of these models. Measuring the angle between any interior and superior perpendicular of the cranial and caudal vertebrae, a set containing all possible angles is obtained, and a maximum angle is considered as the Cobb angle. To assess the reliability and accuracy, two experienced radiologists separately measured the Cobb angle. Manually output results of these models were compared, achieving intraclass and interclass correlation coefficients of 0.941 and 0.887, respectively.
In [37], Safari et al. developed a semi-manual approach for the estimation of Cobb angle. Contract stretching is used to extract the ROI in an input X-ray image. The curvature of the spine is determined with the help of manual landmarking of at least one point for each vertebra, and a fifth-order polynomial curve fitting is applied. After determining the morphologic curve, the final phase is to estimate the Cobb angle by using a tangent equation. The equation is calculated at the inflection points, and the angle is between two perpendicular lines to the spinal curve. The paper claims the correlation coefficient between the angle values is 0.81. In [38], a new, high-precision regression technique, adaptive error correction net (AEC-Net), is introduced for evaluation of Cobb angle from X-ray images of spine. The proposed technique has two modules: the first one is regressing landmark net for boundary features extraction that indirectly aids in Cobb angle calculation. The second one is angle net for direct approach for Cobb calculation using curve features. The final stage is error correction net that basically estimates both modules' output using extrapolation to identify the difference in Cobb angles from both networks. To evaluate the results, 581 spinal anterior-posterior X-ray images were utilized, attaining a mean absolute error of 4.90 in Cobb angle.
Kang Cheol Kim et al., in [39], presented an approach to identify scoliosis from X-ray images; they explained the drawbacks of manual measurements which are laborious and time-consuming. The method consists of three major parts: in the first part, a confidence map is utilized for localization. In the second part, a vertebral-tilt field is used for the estimation of slope of each vertebra, and in the third part, the Cobb angle is measured using vertebral centroids in combination with the calculated vertebral-tilt field. The performance is evaluated, accomplishing circular mean absolute error (CMAE) of 3:51 degree and symmetric mean absolute percentage error (SMAPE) of 7:84% for the Cobb angle. The main purpose of these works are to aid the clinicians in handling the time-consuming task of manual image labeling.
The researchers have utilized different image processing and machine learning techniques for analysis of spine to identify different lumber deformities. Recently, utilization of deep learning has also been carried out for this purpose. The automated analysis of lumbar deformities relies on accurate localization of vertebrae, and even a small variation in the centers can lead to false grading of deformities. In the current state-of-the-art approaches, almost no research has been carried out on the localization of vertebrae. Most of them have taken this problem as segmentation, which generally faces challenges in the presence of noise and illumination changes. With recent advancements in deep learning, we have more robust object localization techniques which are invariant to these changes, so these techniques can be utilized for localization of vertebrae and further analysis of spinal deformities. Keeping all these gaps and challenges in mind, the contributions made in this research work are as follows: 1.
This paper presents the object detection framework for lumbar deformities and provides the research community an annotated dataset in the sagittal plane with labels in YOLO format. 2.
One of the major contributions of this research work is to utilize the object detection/localization module as vertebrae localization in comparison to current state-ofthe-art methods which are based on semantic segmentation.

3.
Edge-based segmentation is used to obtain the localized vertebrae to diagnose the disease.

4.
Furthermore, we provide automated methods to calculate the angles to diagnose lumbar deformity, such as lumbar lordosis, and its further grading, which will be used as a decision support system for young radiologists and helps them to grade the severity of lumbar deformities.

Materials
A number of datasets have been developed by hospitals and challenges to diagnose the spinal deformities. MRI Dataset [40] was collected from patients who reported symptomatic back pain between September 2015 and July 2016 at the Irbid Speciality Hospital in Jordan. Ref. [15] collected the MRI scans of 575 subjects but removed 60 scans due to noise and distortion. They were sure to keep the MRI scans pf patients with age of at least 17, so the lumbar spines have the same physiology. Ref. [40] has annotations of axial views and it has both T1 weighted and T2 weighted scans. This research was performed on Lumbar Spine Composite Dataset [41], which is posted on Mendeley Data, and it contains sagittal views of MRI images of 514 subjects. Lumbar Spine Composite Dataset is originally taken from MRI Dataset, while [18] annotated Lumbar Spine Composite Dataset into mid-sagittal views for effectiveness of results. Ref. [40] has data of 515 subjects, while [41] discarded one subject due to noisy picture and has data of 514 subjects, with ground truth labels, marked pixelwise, and pseudo-colored labels available for the segmentation. According to [42], bone shows the same contrast for both T1 and T2 weighted scans, therefore [18] applied an empirical evaluation and selected either type for spinal measurement. It also includes spinal measurements, ground truth labels where L1, L2, L3, L4, L5, and S are marked pixelwise, and pseudo-colored ground truth images. Spinal measurements help surgeons in suggesting and selecting the appropriate surgical procedure. The resolution of the images and labels are 320 × 320. Figure 2 shows images from composite dataset.  Dicom viewer was used to read the image and mid-sagittal view was exported from MRI Dataset.

2.
Images were manually labeled from lumbar spine to first sacrum vertebra after obtaining the sagittal view.

3.
Labeled regions were consulted with radiologists and validated by expert surgeons. 4.
Labels were assigned and pseudo-coloring was applied where each vertebra is represented by different color. 5.
Fully automated spinal measurements were performed that contain lumbar lordotic angle (LLA), lumbosacral angle (LSA), dimensions of lumbar spine and sacrum and their identification and labeling, height of lumbar spine, dimensions of discs, and spinal curve estimation.

Methods
Spine deformity can be due to many reasons, but severity in the disease can led to the lifetime paralysis, and localization and segmentation techniques can be used to analyze the spine deformities. In this research, an automated lumbar spine analysis was performed to diagnose the lumbar spine deformities. Our proposed approach is divided into two different techniques: in the first step, vertebrae are localized by using the object detector, while in the second step, the localized vertebrae are passed through the edgebased segmentation model to diagnose the severity of the lumbar spine disease. Figure 4 shows the flow of our proposed methodology where the dataset is localized using YOLOv5 and then passed through holistically-nested edge detection (HED) U-Net [43]. L1 and S for LLA and L5 and S for LSA are extracted from the localized image. Images are smoothed using Gaussian smoothing filter with a sigma value 1, and Harris corner detector [44] is applied to obtain the corners of desired vertebrae. LLA and LSA can be found from the corners. Dataset is augmented, annotated, and passed through the YOLOv5 to obtain localized vertebrae. L1 and S for LLA and L5 and S for LSA are extracted from the localized vertebrae, which are then passed through the HED U-Net to obtain the edge-based segmentation. Images are smoothed using Gaussian smoothing filter, and Harris corner detector is applied to obtain the corners of desired vertebrae. LLA and LSA can be found from the corners.

Localization
Localization identifies the location of the objects in an image and draws a bounding box around the objects [45]. The annotated dataset is passed through YOLOv5 to obtain the localize vertebrae. Bounding boxes across each vertebra are used to crop the images.

Preprocessing
Preprocessing is the main and first step before training the model. Images are autooriented and resized to 416 × 416. Other preprocessing steps involved are data augmentation, data labeling, and annotations.

Data Augmentation
Augmentation of the dataset is necessary to reduce overfitting and increase variability in the dataset [46], so the steps involved in augmentation are noise addition, image flipping, 90°r otation, image cropping, and image shearing. The abovementioned steps are discussed below: Noise: The salt and pepper noise is added to 5% of image pixels. Flip: Images are flipped to horizontal. 90°Rotate: Images are 90°rotated to clockwise, counterclockwise, and upside down. Crop: Images are cropped to 0°minimum zoom and 20°maximum zoom. Rotate: Images are rotated between −23°and +23°. Shear: Images are sheared to ±15 horizontally and vertically.
Number of images is increased from 514 to 1028 after augmentation. Each image after augmentation has different values of each step.

Data Labeling and Annotations
To label the dataset, we used LabelMe and Roboflow Annotate, which is used to label the whole dataset or correct any already-present annotations [47]. Each image contains six annotations of a single class in YOLO format. YOLO format is txt file with the same name as the image that consists of class, x and y coordinates of object, and width and height of object. Class name is defined as V for vertebrae, and six labels contain five lumbar vertebrae and sacrum in sagittal view, as shown in Figure 5.

Training
The annotated dataset is trained on YOLOv5 with 32 batch size and 90 epochs. We started training the data from epoch 5 and batch size 2 until we obtained the best results. Data is divided into 85% training, 10% validation, and 5% testing. YOLOv5 detected the lumbar vertebrae and sacrum with very good confidence score. YOLO stands for You Look Only Once and was developed in 2015 as an object-detecting system using single neural network that contains multiple convolution networks. The YOLO algorithm became very popular due to its high speed and accuracy. Object detection has been reframed as a single regression problem by YOLO, and this model predicts bounding boxes and class probabilities from image pixels. The YOLO algorithm finds the bounding boxes of objects and probabilities of classes in boxes. Due to its good results in determining and detecting the object coordinates, it stands out more than other object detection algorithms at the time of its release [48].
Average precision is used to evaluate the accuracy of YOLOv5; it calculates the average precision values for over 0 to 1 recall value. Precision is defined as the ratio of true positive cases and total number of true predictions.

P = True Positive Cases Total Positive Predictions
(1) Recall is defined as the ratio of true positive cases and the total number of cases.

R = True Positive Cases Total Cases
YOLOv5 has four models: YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x. We used the smallest and the fastest model, YOLOv5s. It has a size of 14 Mb and has 2.2 ms inference time. YOLOv5 network architecture is shown in Figure 6. It consists of three stages:

1.
CSPNet:Backbone YOLOv5 includes cross-stage partial network (CSPNet) [49] into darknet and makes CSPDarknet its backbone. It decreases the model's floating-point operations per second (FLOPS) and parameters by solving the problem of repeated gradient information in large scale and integrates the gradient changes into a feature map. This not only reduces the size of the model but also ensures the accuracy and speed of inference.

2.
PANet [50]:Neck It increases the flow of information through pipelining. Low-level features can be propagated efficiently by adding a feature pyramid network (FPN), which is a new feature with properties such as bottom-up path augmentation. Feature grid and all other features are linked together by adaptive pooling which makes the useful information in each feature level. The network decides which features are useful from all the layers. It increases the accuracy of object location by using the correct localization signals in lower layers [51]. 3.
YOLO Layer:Head This is the last layer of YOLOv5, and it detects the results in the form of confidence score, size, and accuracy. It contains three different types of feature maps, i.e., 18 × 18, 36 × 36, and 72 × 72, to detect small, medium, and big objects [51].

Edge-Based Segmentation and Identification
Binary masks were created to pass the localized vertebrae through HED U-Net and obtain the segmented images and their detected edges. Extracted images are smoothed and corners are found, and lastly, lines are drawn to find the angles. We cropped L1, L5, and S through the bounding boxes, and cropped images with their binary masks are passed through the HED U-Net. The best results are obtained with the batch size 8, epoch size 10, learning rate 0.001, and optimizer as Adam. Heidler in [43] developed HED U-Net, which is a combination of segmentation and edge-detection framework. They unified the U-Net [52] for semantic segmentation and HED [53] for edge detection in a natural way. In our case of segmentation, there are two class labels, i.e., vertebra and background, and in the case of edge detection, there are two classes, which are "no edge" and "edge". Figure 7 shows the high-level representation of the HED U-Net; the image is passed through the encoder where downsampling of the image is performed to accumulate the contextual information at low resolution. Then, the decoder upsamples the image by distributing this information to individual pixels. Element-wise addition is used to merge the data flows in the decoder part. In this model, the researchers used six resolution levels of feature pyramid, where the full image resolution has the finest feature map and a poor feature map at resolution 1/32 [43]. The model is deep supervised, where it is trained to predict the ground truth at every level of the feature pyramid. Deep supervision is used to increase the generalization capacity and learning efficacy of a model. Network captures better results after deep supervision and encodes meaningful features present in the deep layers by making use of the available receptive field. After computing the feature pyramid through encoder-decoder methods, two task-specified merging heads are employed which use hierarchy attention mechanism to combine this information. In this mechanism, most useful features of each pixel are focused, rather than fused, features of fix weights [43].

Figure 7.
High-level structure of HED U-Net. Encoder and decoder compute the feature maps pyramids and specific task-merging heads unite this information using hierarchy attention maps [43].

Image Smoothing and Corner Calculation
After the images are passed through HED U-Net, we obtain the segmented image and the detected edges. We cropped the images from the bounding boxes we received from YOLOv5. We must compute the LLA and LSA. LLA is the angle between the L1 superior endplate and S superior endplate, while LSA is the angle between the L5 inferior endplate and S superior endplate, as shown in Figure 8, so we used the bounding boxes of L1 and S for LLA and L5 and S for LSA to crop the desire vertebra. To obtain the smooth image, the vertebra images are then passed through the Gaussian smoothing filter with the sigma value equal to 1, which helps us in finding the corners through the Harris corner detector [44] accurately. The Harris corner detector is applied on the smoothed images to obtain the corners of L1 and S for LLA and corners of L5 and S for LSA.

Angles Computation
In the case of LLA, a line is drawn from the superior corners of L1 and another line is drawn from the superior corners of S, while for LSA, one line is drawn from the inferior corners of L5 while another is drawn from the superior corners of S, as shown in Figure 9. The slope of the lines can be found from the following formula: where m represents the slope of lines, p1 y and p2 y are y-axis points, while p1 x and p2 x are the x-axis of two points. The angle between the lines can be found from the below expression: where m L1 is the slope of superior endplate of L1 line, while m S is the slope of superior endplate of S line. The same equations can be used to find the slope and angle for LSA.

Evaluation Metrics
The proposed approaches were trained on RTX 2070 GPU with 16 GB RAM and implemented in Python using Pycharm and Anaconda. We used different Python libraries, some of them are torch, torchvision, PyYAML, and scipy. To measure the performance of our methodology for localization, we used Euclidean distance (EU) and intersection over union (IOU) metrics. Distance between the centroids is calculated by EU. The equation for the EU is given below: where x . 1 and y . 1 are the centroid coordinates of ground truth boxes, while x . 2 , y . 2 are the xand y-axis of predicted boxes. Each image contains seven EU, one for each class. Smaller value of EU indicates less error of distance between the centroids of predicted bounding box with ground truth box. All the measurements taken on the image are in mm. The overlap of ground truth box and predicted box is measured through IOU. To find the IOU, the following formula is used: IOU is the ratio between the area of two boxes overlap and the total area of two boxes, as shown in Equation (3). The larger the value of IOU, the more overlap of the two boxes there is. If the value of IOU is 0.95, it means that the two boxes are overlapped by 95 percent. Mean error computes the error between the angles, and it can be found as follows: ME represents the mean absolute error of angles; its value is in degrees, and it calculates the error in the estimated and predicted angles. Smaller value of ME shows less error between the predicted and ground truth values. K is the total number of images and a pred is the angle predicted, while a orig is the ground truth value of an angle.

Results and Discussion
Various deep learning techniques were applied on the same dataset, but we used object detector for the localization of lumbar spine and sacrum and the labelMe python package for data annotations and saved it in YOLO format. Vertebrae were localized with a very good confidence score. An empirical threshold of 0.65 was applied on confidence score to eliminate the boxes with lower scores. The non-maximum suppression (NMS) intersection over union (IOU) threshold is set to 0.3 in testing data.
Bounding boxes are drawn across each vertebra to find the centroids of boxes. The ground truth and predicted centroids are measured by the following formula: where x min , x max , y min , and y max represent the x and y coordinates of bounding boxes. The ground truth and predicted centroids are shown in green and red colors in Figure 10.
After training and testing the model, the ground truth and predicted bounding boxes were compared by the EU of center points. The mean of the EU between the centroids of bounding boxes and IOU of two boxes can be calculated from the following formula: Table 1 shows the mean and the standard deviation of six vertebrae. The distance between the centroids is in millimeters; the sacrum has the highest EU mean due to its tilted structure, while L1 has the least EU mean of 1.6 mm. Low value of mean indicates less distance between the centroids. Table 2 shows the mean and standard deviation of IOU of each vertebra. Each vertebra has high value of mean and low value of standard deviation. High value of mean shows higher overlap between the bounding boxes.   Figure 11 shows the values of precision and recall of object detector YOLOv5 with increasing values of epoch from 0 to 90. Mean average precision (mAP) is the mean of average precision, and we obtained mAP of 0.975 by using YOLOv5s. YOLOv5 can be visualized in heat maps from its trained weights before applying non-maximum suppression, as shown in Figure 12.   Figure 13 shows the boxplot values of calculated and ground truth angles of three classes calculated from the corners of vertebrae that are passed through HED U-Net. Normal LLA ranges from 39°to 53° [54], so angles less than 39°are in the range of hypolordosis, named as class 0. Normal lordosis is class 1, and class 2 is hyperlordosis, having angle values above 53°. The boxplot shows very small error between the calculated values and ground truth values of angles. According to [55][56][57], the measurement error up to ±3°-±5°is clinically accepted. The mean error and standard deviation of mean error of LLA and LSA are shown in Table 3. LLA and LSA calculated from our method have very small mean error values, i.e., 0.29°and 0.38°. A confusion matrix for this technique is shown in Table 4; low values of mean error indicate there is no failure cases in any class. It has classified all the 51 subjects of lordosis correctly. Table 3. This table represents the mean error and std of LLA and LSA. LLA has lower mean error and std than LSA mean error and its std.

Comparison of Models
We compared YOLOv5 with other object detection models in Table 5. Region-based fully convolutional network (R-FCN) [58] has residual networks (ResNet) as a backbone to detect the objects. We achieved 0.894 mAP value to detect the lumbar vertebrae and a sacrum by using R-FCN. SSD513 [59] is the single shot detector with 513 × 513 inputs that detects the vertebrae with the mAP of 0.925. Lin et al. [60] used feature pyramid networks (FPN) architecture in faster RCNN to detect the object more accurately. We obtained mAP value of 0.942 by using the FPN FRCN, which is the second highest value in Table 5. YOLOv3 is the third variant of the YOLO family, which is believed to be three times faster than SSD [61]. It achieved 0.917 mAP in detecting the vertebrae. YOLOv5 has surpassed other current state of the art methods and its mAP is 0.033 greater than FPN FRCN.

Comparison of Approaches for Lumbar Lordosis Assessment
We used two ways to diagnose the lumbar lordosis. This first method is to assess the lumbar lordosis by finding the angle from the corners of the vertebra, while the other method is the lumbar lordosis assessment through region area, where the area is correlated with the angles values to classify the disease. A block diagram of the region-area-based method to diagnose the lumbar lordosis is shown in Figure 14. The presented technique consists of preprocessing steps, training of YOLOv5 model, vertebrae localization, centroids calculations, and area computation through centroids. The centroids of localized vertebrae can be connected together to find the area of enclosed region, which will be used to diagnose the severity of the disease.
Ref. [18] proposed the method to diagnose lordosis on the basis of area enclosed in the region. They combined [62,63] techniques to obtain the area under the curve from the centroids. We computed the area from the centroids of the bounding boxes of YOLOv5. The centroid of L1 is connected to L2, L2 to L3, and so on, and lastly, the centroid of S is connected to L1, to form an enclosed region. The normal range for lordosis is 39°to 53° [54]. Angles below 39°are termed as hypolordosis, while the angles above 53°are termed as hyperlordosis. The area of the enclosed region is correlated with the angles to diagnose the disease. The area can be found by summing all the non-zero pixels in the images. The equation is given by Area region represents the area of the region, zz i is the non-zero pixels in the images, and ∆zz is the interval between pixels and is equivalent to one pixel. This method was proposed by [18] to diagnose the lordosis from the centroids instead of corner points. Figure 15 shows three images, each from a different lordosis type. Hypolordosis includes straight back and flat back cases, while hyperlordosis includes sway back of the vertebrae. The region area of normal lordosis is smaller than the hyperlordosis and greater than the hypolordosis curve [18]. We obtained 74.5 percent accuracy by using this technique. Table 6 shows the confusion matrix for this technique; the total number of test cases is 51, in which 6 subjects have hypolordosis, 19 subjects have normal lordosis, and 26 subjects have hyperlordosis. Hypolordosis is classified correctly, while normal and hyper have some of the misclassified cases. Figure 14. Region-area-based lumbar lordosis identification block diagram. The preprocessed dataset is passed through the localization network with its labels. Vertebrae are localized with very good confidence score and its centroids are calculated and joined together to form a region whose area can be calculated to find the severity of lumbar lordosis.  The summary of all the results is shown in Table 7. The results show that lumbar assessment through corners classifies the class more accurately as compared to lumbar lordosis assessment through region area. Our proposed approach outperforms the regionarea-based approach by 25.5%. The first method uses the localization and edge-based segmentation methods to diagnose the disease, while the region-area-based method uses only the localization part and then calculates the centroids of the vertebrae to calculate the area of the enclosed region. Accuracies for both techniques are shown in Table 7. Table 7. Comparison table of two methods that were used to diagnose the lumbar lordosis. Lumbar lordosis assessment through corners outperforms the lumbar lordosis assessment through region area. We used the same localization part for both methods; therefore, the value of mAP in both methods is the same.

Comparison with Other Researchers
We also compared our proposed model with other researchers. Our proposed methodology uses YOLOv5 for the localization and HED U-Net for the edge based segmentation, Masood et al. used Resnet-UNet for the segmentation of lumbar vertebrae, Suri et al. [64] used three neural networks, each network for each modality of image to compute the LLA, while Cho et al. [65] used semantic segmentation by using the model UNet We obtained very small mean error of LLA and LSA as compared to [41,64,65]. The first two techniques are applied on Composite Lumbar Spine MRI Dataset [41]. As we can see from the results, our object detection and edge-based segmentation model outperforms other state-of-the-art models that use semantic segmentation. This system will help the clinicians and support their decisions in diagnosing the disease. Table 8 shows the comparison of our results with other researchers. Our proposed technique achieves ME of 0.29°and 0.38°for LLA and LSA, which is 2.32°and 1.63°less than the ME of LLA and LSA calculated by the proposed approach of Masood et al. Clinicians often use time-consuming different manual or semi-manual methods to calculate the Cobb angles and diagnose the vertebral diseases. Our proposed method is an automated method to compute the Cobb angles with very small mean error value. According to [66], clinicians take around 18.96 s to calculate the Cobb angles, while our method gives the Cobb angles results within 5 s on 10th gen Intel core i7 and Nvidia Geforce RTX 3060. We used the object detector to localize the vertebrae, which has not been used before, and then edge-based segmentation to diagnose the disease severity. As no object detection framework for diagnosing the spinal disorder is present, the comparison is difficult. We also compared YOLOv5 with other object detectors, i.e., R-FCN, SSD513, FPN FRCN, and YOLOv3. We achieved the highest value of mAP by using the YOLOv5 model. The region-area-based method tends to fail because it is dependent on centroids of multiple vertebrae, and small error in the computation of the centroids can affect the whole area, which can led to misclassification. Cobb angle is dependent on the end plate structures of vertebrae, and its computation often tends to fail if there is an abnormal-shaped vertebra in test phase, as the model has been trained on the dataset having mostly healthy vertebrae. Our primary focus is to develop a fully automated system to diagnose the disease, and very little attention is given to the abnormally deformed vertebrae.

Conclusions
In this paper, we performed the YOLO annotations of the Lumbar Spine Composite Dataset [41], which consists of the mid-sagittal views of MRI scans. We proposed a method to localize the lumbar spine and a sacrum using YOLOv5 and performed the calculations to compute the centroids of bounding boxes. The centroids are compared with ground truth centroids values which have very low mean error and high IOU in the case of each lumbar and sacrum vertebra. Region area computed from the centroids decides the lumbar lordosis severity, i.e., sway back, normal, or flat back, which will help the young doctors as a decision support system to diagnose the disease. Localized vertebrae are passed through HED U-Net to obtain the vertebrae and their edges. The angles, LLA, and LSA are found by computing the corners of vertebra using Harris corner detector with a very small mean error and standard deviation. It is concluded from the experiments that images should be smoothed before finding the corners of vertebrae; unsmoothed images make corner detection a laborious task. For this purpose, a Gaussian smoothing filter with a sigma value 1 was used to obtain the smoothed corners. LLA has 0.28°mean error which means it detects the lumbar lordosis, hypo, normal, and hyper, very efficiently. In the future, this work can be extended to diagnose cervical, thoracic spine, and pelvic region deformities. Other directions may be used to investigate and develop a fully automated machine learning toolkit for spinal deformities to prevent invasive surgery methods.

Abbreviations
Abbreviations used in the manuscript are described below.